--- title: "Using simphony to evaluate rhythm detection" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using simphony to evaluate rhythm detection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} knitr::opts_chunk$set(message = FALSE, collapse = TRUE, comment = '#>', fig.retina = 2) ``` `simphony` is a framework for simulating rhythmic data, especially gene expression data. Here we show an example of using it to benchmark a method for detecting rhythmicity. ## Load the packages we'll use Internally, `simphony` uses the `data.table` package, which provides an enhanced version of the standard R `data.frame`. We'll use `data.table` for this example as well. ```{r} library('data.table') library('ggplot2') library('kableExtra') library('knitr') library('limma') library('precrec') library('simphony') ``` ## Simulate the data Here we create a `data.table` called `featureGroups` that specifies the desired properties of the simulated genes. We want 75% of simulated genes to be non-rhythmic and 25% to have a rhythm amplitude of 1.1. Properties not specified in `featureGroups` will be given their default values. Our simulated experiment will have 200 genes. Expression values will be sampled from the negative binomial family, which models read counts from next-generation sequencing data. The interval between time points will be 2 (default period of 24), with one replicate per time point. We also use the default time range of our simulated data points of between 0 and 48 hours. ```{r} set.seed(44) featureGroups = data.table(fracFeatures = c(0.75, 0.25), amp = c(0, 0.3)) simData = simphony(featureGroups, nFeatures = 200, interval = 2, nReps = 1, family = 'negbinom') ``` The output of `simphony` has three components: `abundData`, `sampleMetadata`, and `featureMetadata`. `abundData` is a matrix that contains the simulated expression values. Each row of corresponds to a gene, each column corresponds to a sample. Since we sampled from the negative binomial family, all expression values are integers. ```{r} kable(simData$abundData[1:3, 1:3]) ``` `sampleMetadata` is a `data.table` that contains the condition (`cond`) and time for each sample. Here we simulated one condition, so `cond` is 1 for all samples. ```{r} kable(simData$sampleMetadata[1:3]) ``` `featureMetadata` is a `data.table` that contains the properties of each simulated gene in each condition. The `group` column corresponds to the row in `featureGroups` to which the gene belongs. ```{r} kable(simData$featureMetadata[149:151, !'dispFunc']) %>% kable_styling(font_size = 12) ``` ## Plot the simulated time-course for selected genes Here we plot the simulated time-course for a non-rhythmic gene and a rhythmic gene. We use the `mergeSimData` function to merge the expression values, the sample metadata, and the gene metadata. ```{r} fmExample = simData$featureMetadata[feature %in% c('feature_150', 'feature_151')] dExample = mergeSimData(simData, fmExample$feature) ``` We also want to compare the simulated expression values with their underlying distributions over time, for which we can use the `getExpectedAbund` function. Since we sampled from the negative binomial family, the resulting `mu` column corresponds to the expected log~2~ counts. ```{r} dExpect = getExpectedAbund(fmExample, 24, times = seq(0, 48, 0.25)) ``` Then it all comes together with `ggplot`. ```{r, fig.width = 6, fig.height = 2.75} dExample[, featureLabel := paste(feature, ifelse(amp0 == 0, '(non-rhythmic)', '(rhythmic)'))] dExpect[, featureLabel := paste(feature, ifelse(amp0 == 0, '(non-rhythmic)', '(rhythmic)'))] ggplot(dExample) + facet_wrap(~ featureLabel, nrow = 1) + geom_line(aes(x = time, y = log2(2^mu + 1)), size = 0.25, data = dExpect) + geom_point(aes(x = time, y = log2(abund + 1)), shape = 21, size = 2.5) + labs(x = 'Time (h)', y = expression(log[2](counts + 1))) + scale_x_continuous(limits = c(0, 48), breaks = seq(0, 48, 8)) ``` ## Detect rhythmic genes We can use the `limma` package to detect rhythmic genes based on a linear model that corresponds to cosinor regression. ```{r} sampleMetadata = copy(simData$sampleMetadata) sampleMetadata[, timeCos := cos(time * 2 * pi / 24)] sampleMetadata[, timeSin := sin(time * 2 * pi / 24)] design = model.matrix(~ timeCos + timeSin, data = sampleMetadata) ``` Here we follow the typical `limma` workflow: fit the linear model for each gene, run empirical Bayes, and extract the relevant summary statistics. We pass `lmFit` the log~2~ transformed counts. ```{r} fit = lmFit(log2(simData$abundData + 1), design) fit = eBayes(fit, trend = TRUE) rhyLimma = topTable(fit, coef = 2:3, number = Inf) ``` ## Evaluate accuracy of rhythmic gene detection First we merge the results from `limma` with the known amplitudes from `featureMetadata`. ```{r} rhyLimma$feature = rownames(rhyLimma) rhyLimma = merge(data.table(rhyLimma), simData$featureMetadata[, .(feature, amp0)], by = 'feature') ``` We can plot the distributions of p-values of rhythmicity for non-rhythmic and rhythmic genes. P-values for non-rhythmic genes are uniformly distributed between 0 and 1, as they should be under the null hypothesis. P-values for rhythmic genes, on the other hand, tend to be closer to 0. ```{r, fig.width = 3.5, fig.height = 3} ggplot(rhyLimma) + geom_jitter(aes(x = factor(amp0), y = P.Value), shape = 21, width = 0.2) + labs(x = expression('Rhythm amplitude ' * (log[2] ~ counts)), y = 'P-value of rhythmicity') ``` Finally, we can summarize the ability to distinguish non-rhythmic and rhythmic genes using a receiver operating characteristic (ROC) curve (here we use the `precrec` package). ```{r, fig.width = 3, fig.height = 3} rocprc = evalmod(scores = -log(rhyLimma$P.Value), labels = rhyLimma$amp0 > 0) autoplot(rocprc, 'ROC') ```