May 29, 2016

Improving your simulation study

Simulation can be used to determine how well a statistical or bioinformatics procedure works when the data are generated by a known mechanism.

For example, we can obtain more accurate p-values and false discovery rates or determine if a confidence interval is about the right size.

Improving your simulation study S2

In some cases, the best we can do is to assume a "best case" scenario and know that the results will be worse with our actual data.

e.g. Assuming that the data are Normally distributed and independent is the "best case" for many statistical procedues.

We can do better if we can create synthetic data that is more like the real data.

Generating synthetic data

Synthetic data can be created

  1. in vivo (in model systems)


  1. in vitro (by creating synthetic samples)



  1. in silico (by building numerical models)

Generating synthetic data

e.g. to determine how well a method works for differential expression:

  1. in vivo (in model systems)

use WT and knock-down in a simple system

  1. in vitro (by creating synthetic samples)

use a titration series consisting of mixtures of human and mouse mRNA at different proportions

  1. in silico (by building numerical models)

generate sythetic gene expression from a mathematical model with some genes that do and others that do no differentially express

Generating synthetic data: in silica

Building your own computational method is like building your own lab equipment: you should only do it if you have no other choice.

Generating synthetic data: in silica

But if you must …

Mimic the most important features of the data.

The Bootstrap Idea

The bootstrap is a statistical method to use the data as a basis for generating random samples that are similar to the original data generating mechanism.

3 types of bootstrap:

  1. Resampling (nonparametric) bootstrap.
  2. Noisy resampling (semiparametric) bootstrap.
  3. Parametric bootstrap.

The Bootstrap Idea S2

To estimate p-values: generate random samples as if there is no difference among conditions (Null distribution)

To estimate confidence intervals: generate random samples within condition (non-Null distribution)

For other estimates: consider the appropriate sampling distributions

Resampling Bootstrap

Assume that there are N samples with n samples from each "condition".

For each sample, multiple features (gene expression, SNPs, …) are measured.

NULL distribution: take N samples at random with replacement from all the samples

Non-NULL distribution: take n samples at random with replacement within each condition

Resampling Bootstrap S2

Pro:

  1. Preserves the association among features (e.g. gene and protein networks)
  2. Retains the empirical distribution of the data.

Con:

  1. With small sample sizes may not provide enough different bootstrap samples
  2. Difficult to preserve correlations among samples such as pairing, familial correlations (unless the sample is the cluster)
  3. Does not consider any values more extreme than observed.

Resampling Bootstrap Example

We will look at generating a resampling bootstrap sample from RNA-seq data. These are liver samples from humans and chimpanzees, mapped to the human genome. Genes with 8 or fewer reads were removed before bootstrap resampling.

There are N=12 samples and n=6 samples within each species.

Noisy Resampling Bootstrap

We add noise to every observation. Notice that this will be another sample from the observed (usually non-Null) distribution.

The Noisy Bootstrap is not used to generate samples from the Null distribution.

For intensity data, this can be done by adding a small amount of additive noise - e.g. adding Normal noise with mean zero and a small SD.

It is not so easy to see how to do this for count data such as RNA-seq data, since we need to maintain the data as whole numbers.

Counts of 0 are particularly problematic.

Noisy Resampling Bootstrap S2

Pro:

  1. Preserves some of the association among features (e.g. gene and protein networks).
  2. Preserves some of the association among samples (e.g. pairing or familial).
  3. Can be done with smaller sample sizes than the resampling bootstrap.
  4. Introduces values plausible from but not observed in the original data.

Noisy Resampling Bootstrap S3

Con:

  1. Introduces additional variation in the observations.
  2. Weakens the association among features.
  3. Weakens the association among samples.
  4. Since noise is added to the observed data, you cannot simulated from the Null distribution.

Noisy Resampling: Count Data

Count data is often modeled with a Poisson Distribution.

One idea is to use the observed read count as the mean of a Poisson distribution and generate a new read count from that Poisson.

Noisy Resampling: Count Data S2

par(mfrow=c(2,2))
hist(rpois(1000,.25),main="mean=0.25")
hist(rpois(1000,1),main="mean=1")
hist(rpois(1000,10),main="mean=10")
hist(rpois(100,50),main="mean=50")

Noisy Resampling: Count Data S3

We can see that if the Poisson mean is 0.25, we generate 0 about 60% of the time, so we can replace the 0's in the data by 0.25 for data generation.

The median is very close to the mean+0.3, so we are approximately equally likely to produce a synthetic count above or below the observed count.

For small means, the data are skewed towards larger numbers, so that if we observe 1 read and generate new data with this mean, we could generate a count as large as 4 or 5.

It is better to subtract 0.3 from the counts to move the noisy counts closer to the observed counts.

Noisy Resampling: Count Data S3

We will try this simple method:

  1. Replace all the non-zero counts by count-.3
  2. Replace all the zeroes by 0.25

Use these adjusted counts as the mean and generate Poisson data.

Parametric Bootstrap

For the parametric bootstrap, we assume that the data come from a distribution known up to a few parameters (e.g. mean and variance) which we will estimate from the data.

We then generate the bootstrap samples from the distribution with the estimated parameters plugged in.

e.g. For continuous data, we might assume that the data are Normal and estimate the mean and variance. We then generate random Normals with that mean and variance.

Parametric Bootstrap S2

Pro:

  1. It is simple to do.
  2. Can be done with smaller sample sizes than the resampling bootstrap.
  3. Introduces values plausible from but not observed in the original data.

Con:

  1. It is very difficult to capture assocation among features or samples unless this is also estimated.
  2. Often the distributions selected may be too simple - e.g. the Normal distribution does not have enough outliers to model gene expression data.

Parametric Bootstrap: Count Data

Count data has both Poisson variation (due to capture probability) and biological variation (due to different proportions of reads coming from each gene in each sample).

Sophisticated distributions can be fitted using the Negative Binomial distribution using e.g. RNA-seq software which estimates the mean and the dispersion.

Parametric Bootstrap: Count Data S2

We will use a simpler approach for our RNA-seq data. For each gene:

  1. Estimate the mean and variance of the percentage of the reads for each gene (after replacing the zeroes by 0.25).
  2. Use the mean and variance to generate proportions
  3. Multiple the proportion by the library size to obtain a mean for the Poisson count.
  4. Generate the Poisson count.

This can be done using the Null distribution (one distribution for each gene) or under the non-Null, a distribution for each gene under each condition.

Parametric Bootstrap: Count Data S3

These data have a very unusual feature - a few genes make up a large proportion of the sample.

For this reason, instead of generating the proportions by matching the mean and variance to a Beta distribution (which is usually used for generating random proportions) we will use another strategy.

Session Information

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.4     tools_3.2.2     htmltools_0.3.5
##  [5] yaml_2.1.13     Rcpp_0.12.5     stringi_1.0-1   rmarkdown_0.9.6
##  [9] knitr_1.13      stringr_1.0.0   digest_0.6.9    evaluate_0.9