Introduction

We will do an example of a bootstrapping for RNA-seq data collected for a differential expression analysis.

The Data

Blekhman, et al, (2010) used RNA-seq to interrogate liver samples in male and female human, chimpanzee and Rhesus macaque.
Samples were mapped to the human genome for read identification. This is described in the supplement to the paper.
As in any cross-species comparison, differences in sample characteristics and the underlying genome are confounded with expression differences. The similarity in coding region is about 98% for humans and chimpanzees, but it is undoubtedly less for Rhesus macaques.

There were 3 biological replicates of each combination of gender and species, each divided in 2 sequencing lanes. 20689 features were tabulated in each of the 36 lanes, with a total of 71 million mappable 35 bp reads. (This is an “old” dataset - typically now each sample will have about 25 million reads with minimum length 50 bp.) The raw reads are downloadable from GEO dataset GSE17274. The processed counts per feature can be downloaded from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17274.

In this example, I preprocessed the data by adding the reads across the 2 lanes per sample. (In more recent studies, we would preprocess by splitting bar-coded data from individual lanes into individual samples.) As well, I removed genes that have 10 or fewer than 10 reads across all the samples, and reduce the data to just human and chimpanzee.

To start, we need to load the data into R from the csv text file

ReadCounts=read.csv("liverReads.csv",header=T,row.names=1)
ReadCounts[1:5,1:5]
##                 HS1 HS2 HS3 HS4 HS5
## ENSG00000000003 121 329 421 300 359
## ENSG00000000005   0   0   1   0   0
## ENSG00000000419  39  81  78  61  62
## ENSG00000000457 114  91  73  62  95
## ENSG00000000460  15   6   7  17  17
dim(ReadCounts)
## [1] 14387    12

We have N=12 samples, with n=6 samples for each condition=species. We have 14387 genes.

Generating a Resampling Bootstrap Sample

The resampling bootstrap is the simplest method for creating synthetic data.

To generate a sample from the null distribution, we simply select 12 samples with replacement from the 12 original samples. We would retain the species label.

samp=sample(1:12,12,replace=TRUE)
bReadCounts=ReadCounts[,samp]
colnames(bReadCounts)=colnames(ReadCounts)

This produces a single sample from the multivariate null distribution. It preserves the distribution of gene expression in each gene and also the correlation structure.

We repeat this several thousand times to obtain an estimate of the multivariate distribution of any statistical test under the null hypothesis.

Notice that the library sizes are also maintained.

We could use these bootstrap samples in various ways.

e.g. We could do a differential expression analysis using our favorite RNA-seq software such as cuffdiff, edgeR or DESeq. However, we could adjust our p-values or FDR estimates using these samples from the null.

To generate a sample from the non-Null distribution, we sample with replacement from the original species labels. Humans are in columns 1:6 and chimpanzees are in columns 7:12.

samp=c(sample(1:6,6,replace=TRUE),sample(7:12,6,replace=TRUE))
bReadCounts=ReadCounts[,samp]
colnames(bReadCounts)=colnames(ReadCounts)

We could use these samples to obtain confidence intervals for the effect sizes for each gene.

Generating a Noisy Resampling Bootstrap Sample

We want to generate Poisson random counts using the observed counts as the mean.

We need to change the zeroes in the count data to 0.25 and also store the data as a matrix instead of a dataframe. We also need to compute the number of entries in the matrix which we call nn.

ReadCountsAdj=as.matrix(ReadCounts)
ReadCountsAdj=ReadCountsAdj-.3
ReadCountsAdj[ReadCountsAdj<0]=0.25
nn=nrow(ReadCountsAdj)*ncol(ReadCountsAdj)

We now have a matrix with our Poisson means. The Poisson random number generator can simultaneously generate nn random Poisson counts with nn different means. However, we need to reorganize these into the data matrix.

bNoisyReads=matrix(rpois(nn,ReadCountsAdj),ncol=ncol(ReadCountsAdj))

Lets have a look at how these compare to the actual read counts and the adjusted read counts.

ReadCounts[1:5,1:5]    # actual data
##                 HS1 HS2 HS3 HS4 HS5
## ENSG00000000003 121 329 421 300 359
## ENSG00000000005   0   0   1   0   0
## ENSG00000000419  39  81  78  61  62
## ENSG00000000457 114  91  73  62  95
## ENSG00000000460  15   6   7  17  17
ReadCountsAdj[1:5,1:5] # estimated Poisson means
##                    HS1    HS2   HS3    HS4    HS5
## ENSG00000000003 120.70 328.70 420.7 299.70 358.70
## ENSG00000000005   0.25   0.25   0.7   0.25   0.25
## ENSG00000000419  38.70  80.70  77.7  60.70  61.70
## ENSG00000000457 113.70  90.70  72.7  61.70  94.70
## ENSG00000000460  14.70   5.70   6.7  16.70  16.70
bNoisyReads[1:5,1:5]   # synthetic data
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  128  294  440  288  352
## [2,]    0    0    0    1    0
## [3,]   54   84   87   61   51
## [4,]  132  100   88   58   93
## [5,]    8    4    8   14   18

Finally, lets have a look at what happened to the library sizes:

colSums(ReadCounts)
##     HS1     HS2     HS3     HS4     HS5     HS6     PT1     PT2     PT3 
## 2843800 3384520 4384348 3420649 3798676 2843405 5344280 3175223 3791033 
##     PT4     PT5     PT6 
## 3553779 3651618 3548120
colSums(bNoisyReads)
##  [1] 2839979 3380886 4380614 3415833 3793699 2841290 5339896 3171835
##  [9] 3790150 3553355 3647385 3546851

We see that we have done a fairly good job of reproducing at least one feature of the data (library size) even though we added noise.

By generating many bootstrap samples, we could e.g. create confidence intervals for quantities estimated from the data.

Generating a Parametric Bootstrap

We start by generating data from the Null distribution. We will estimate a mean and variance of the percentages for each gene.

We start by replacing the zeroes by 0.25.

ReadCounts0=as.matrix(ReadCounts)
ReadCounts0[ReadCounts0==0]=.25

We also need the library sizes, which are the total reads for each sample (column).

libSizes=colSums(ReadCounts0)

Next we convert all the counts to proportions within each column:

ReadProp=ReadCounts0
for (i in 1:ncol(ReadCounts0)) ReadProp[,i]=ReadCounts0[,i]/libSizes[i]

To generate a parametric bootstrap sample under the Null distribution, we assume that all the proportions for a single gene (row) come from the same distribution

means=rowMeans(ReadProp)
SDs=apply(ReadProp,1,sd)

These data have an unusual feature, a few genes make up a large percentage of the data. These genes also have very large SD.

oGenes=order(means,decreasing=TRUE)
median(means)
## [1] 1.129729e-05
head(means[oGenes[1:10]])
## ENSG00000163631 ENSG00000197249 ENSG00000171564 ENSG00000118137 
##      0.09309684      0.03823129      0.02379628      0.01872519 
## ENSG00000171557 ENSG00000171560 
##      0.01758394      0.01711802
median(SDs)
## [1] 4.625687e-06
head(SDs[oGenes[1:10]])
## ENSG00000163631 ENSG00000197249 ENSG00000171564 ENSG00000118137 
##     0.044706895     0.012728878     0.012559645     0.004976314 
## ENSG00000171557 ENSG00000171560 
##     0.009499684     0.009249042

While it might make sense to transform the means and SDs of the proportions into the parameters of a distribution like the Beta, which generates data between 0 and 1, this method did not work well for these data, as it created distributions for some genes that essentially had only 1 proportion. Instead, we add Normal noise, scaled by the SD, to the mean. After scaling up by the library size to obtain a mean count, we replaced all the means that are less than 0.25 by 0.25.

bProp=ReadProp
bMeans=bProp
for (i in 1:nrow(ReadProp)) bProp[i,]=rnorm(ncol(ReadProp),means[i],SDs[i])
for (i in 1:ncol(bProp))  bMeans[,i]=bProp[,i]*libSizes[i]
bMeans[bMeans<=0.25]=0.25

Finally, we generate our samples as Poisson with our generated mean count. (Note that it is important tat bMeans is a matrix and not a dataframe.)

bReads=matrix(rpois(nn,bMeans),ncol=ncol(bMeans))
colnames(bReads)=colnames(ReadCounts)

We can check to see if the synthetic data resemble the observed data.

head(ReadCounts)
##                 HS1 HS2 HS3 HS4 HS5 HS6 PT1 PT2 PT3 PT4 PT5 PT6
## ENSG00000000003 121 329 421 300 359 168 574 409 429 685 386 428
## ENSG00000000005   0   0   1   0   0   0   1   0   4   1   1   1
## ENSG00000000419  39  81  78  61  62  56 100  59  66  58  65  93
## ENSG00000000457 114  91  73  62  95  76 131 274 229 239  87 149
## ENSG00000000460  15   6   7  17  17  12   8  12   8   7   5  10
## ENSG00000000938  73  44  43  65  65 210  84 198 104  31  76  58
head(bReads)
##      HS1 HS2 HS3 HS4 HS5 HS6 PT1 PT2 PT3 PT4 PT5 PT6
## [1,] 160 519 665 470 321 192 520 218 408 316 239 211
## [2,]   2   0   1   1   0   0   4   0   1   0   3   1
## [3,]  50  68 102  56  50  54  87  57  79  82  71  63
## [4,] 102 238 216 110 263 159 106  96 202  64 132  21
## [5,]   6  24   5  22   5   5  14  16   5   5   1   0
## [6,]  69  68 112  51   0 158 211   9  84   0  31 110

To generate the null distribution for any statistic, we then generate many (say 1000) or these samples.

If we want to generate data under the non-Null distribution, we proceed in exactly the same way except the we compute the mean and SD of each gene for each condition.

meansHS=rowMeans(ReadProp[,1:6])
meansPT=rowMeans(ReadProp[,7:12])
SDsHS=apply(ReadProp[,1:6],1,sd)
SDsPT=apply(ReadProp[,7:12],1,sd)

We then generate the read proportions and read count means and the means in the same way, but using both sets of means and SDs.

bPropNon=ReadProp
bMeansNon=bProp
for (i in 1:nrow(ReadProp)) bPropNon[i,]=c(rnorm(6,meansHS[i],SDsHS[i]),rnorm(6,meansPT[i],SDsPT[i]))
for (i in 1:ncol(bPropNon))  bMeansNon[,i]=bPropNon[,i]*libSizes[i]
bMeansNon[bMeansNon<=0.25]=0.25

Finally, we generate our samples as Poisson with our generated mean count. (Note that it is important tat bMeans is a matrix and not a dataframe.)

bReadsNon=matrix(rpois(nn,bMeansNon),ncol=ncol(bMeansNon))
colnames(bReadsNon)=colnames(ReadCounts)

We can check to see if the synthetic data resemble the observed data.

head(ReadCounts)
##                 HS1 HS2 HS3 HS4 HS5 HS6 PT1 PT2 PT3 PT4 PT5 PT6
## ENSG00000000003 121 329 421 300 359 168 574 409 429 685 386 428
## ENSG00000000005   0   0   1   0   0   0   1   0   4   1   1   1
## ENSG00000000419  39  81  78  61  62  56 100  59  66  58  65  93
## ENSG00000000457 114  91  73  62  95  76 131 274 229 239  87 149
## ENSG00000000460  15   6   7  17  17  12   8  12   8   7   5  10
## ENSG00000000938  73  44  43  65  65 210  84 198 104  31  76  58
head(bReadsNon)
##      HS1 HS2 HS3 HS4 HS5 HS6 PT1 PT2 PT3 PT4 PT5 PT6
## [1,] 143 163 440 241 256 123 647 354 460 251 562 392
## [2,]   0   0   1   0   0   1   4   2   0   1   3   0
## [3,]  54  69  96  65  94  64  80  53  53  86  74  50
## [4,] 104  81 131 100 115  84 184 167 271 302 289 194
## [5,]   4  18  25  25  12   7  15   6  19   2  12  11
## [6,]   7  13 290  73 240 163 146  75  37  37 144  65

As always, we end with session information.

Session Information

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.4     tools_3.2.2     htmltools_0.3.5
##  [5] yaml_2.1.13     Rcpp_0.12.5     stringi_1.0-1   rmarkdown_0.9.6
##  [9] knitr_1.13      stringr_1.0.0   digest_0.6.9    evaluate_0.9