We will do an example of a bootstrapping for RNA-seq data collected for a differential expression analysis.
Blekhman, et al, (2010) used RNA-seq to interrogate liver samples in male and female human, chimpanzee and Rhesus macaque.
Samples were mapped to the human genome for read identification. This is described in the supplement to the paper.
As in any cross-species comparison, differences in sample characteristics and the underlying genome are confounded with expression differences. The similarity in coding region is about 98% for humans and chimpanzees, but it is undoubtedly less for Rhesus macaques.
There were 3 biological replicates of each combination of gender and species, each divided in 2 sequencing lanes. 20689 features were tabulated in each of the 36 lanes, with a total of 71 million mappable 35 bp reads. (This is an “old” dataset - typically now each sample will have about 25 million reads with minimum length 50 bp.) The raw reads are downloadable from GEO dataset GSE17274. The processed counts per feature can be downloaded from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17274.
In this example, I preprocessed the data by adding the reads across the 2 lanes per sample. (In more recent studies, we would preprocess by splitting bar-coded data from individual lanes into individual samples.) As well, I removed genes that have 10 or fewer than 10 reads across all the samples, and reduce the data to just human and chimpanzee.
To start, we need to load the data into R from the csv text file
ReadCounts=read.csv("liverReads.csv",header=T,row.names=1)
ReadCounts[1:5,1:5]
## HS1 HS2 HS3 HS4 HS5
## ENSG00000000003 121 329 421 300 359
## ENSG00000000005 0 0 1 0 0
## ENSG00000000419 39 81 78 61 62
## ENSG00000000457 114 91 73 62 95
## ENSG00000000460 15 6 7 17 17
dim(ReadCounts)
## [1] 14387 12
We have N=12 samples, with n=6 samples for each condition=species. We have 14387 genes.
The resampling bootstrap is the simplest method for creating synthetic data.
To generate a sample from the null distribution, we simply select 12 samples with replacement from the 12 original samples. We would retain the species label.
samp=sample(1:12,12,replace=TRUE)
bReadCounts=ReadCounts[,samp]
colnames(bReadCounts)=colnames(ReadCounts)
This produces a single sample from the multivariate null distribution. It preserves the distribution of gene expression in each gene and also the correlation structure.
We repeat this several thousand times to obtain an estimate of the multivariate distribution of any statistical test under the null hypothesis.
Notice that the library sizes are also maintained.
We could use these bootstrap samples in various ways.
e.g. We could do a differential expression analysis using our favorite RNA-seq software such as cuffdiff, edgeR or DESeq. However, we could adjust our p-values or FDR estimates using these samples from the null.
To generate a sample from the non-Null distribution, we sample with replacement from the original species labels. Humans are in columns 1:6 and chimpanzees are in columns 7:12.
samp=c(sample(1:6,6,replace=TRUE),sample(7:12,6,replace=TRUE))
bReadCounts=ReadCounts[,samp]
colnames(bReadCounts)=colnames(ReadCounts)
We could use these samples to obtain confidence intervals for the effect sizes for each gene.
We want to generate Poisson random counts using the observed counts as the mean.
We need to change the zeroes in the count data to 0.25 and also store the data as a matrix instead of a dataframe. We also need to compute the number of entries in the matrix which we call nn.
ReadCountsAdj=as.matrix(ReadCounts)
ReadCountsAdj=ReadCountsAdj-.3
ReadCountsAdj[ReadCountsAdj<0]=0.25
nn=nrow(ReadCountsAdj)*ncol(ReadCountsAdj)
We now have a matrix with our Poisson means. The Poisson random number generator can simultaneously generate nn random Poisson counts with nn different means. However, we need to reorganize these into the data matrix.
bNoisyReads=matrix(rpois(nn,ReadCountsAdj),ncol=ncol(ReadCountsAdj))
Lets have a look at how these compare to the actual read counts and the adjusted read counts.
ReadCounts[1:5,1:5] # actual data
## HS1 HS2 HS3 HS4 HS5
## ENSG00000000003 121 329 421 300 359
## ENSG00000000005 0 0 1 0 0
## ENSG00000000419 39 81 78 61 62
## ENSG00000000457 114 91 73 62 95
## ENSG00000000460 15 6 7 17 17
ReadCountsAdj[1:5,1:5] # estimated Poisson means
## HS1 HS2 HS3 HS4 HS5
## ENSG00000000003 120.70 328.70 420.7 299.70 358.70
## ENSG00000000005 0.25 0.25 0.7 0.25 0.25
## ENSG00000000419 38.70 80.70 77.7 60.70 61.70
## ENSG00000000457 113.70 90.70 72.7 61.70 94.70
## ENSG00000000460 14.70 5.70 6.7 16.70 16.70
bNoisyReads[1:5,1:5] # synthetic data
## [,1] [,2] [,3] [,4] [,5]
## [1,] 128 294 440 288 352
## [2,] 0 0 0 1 0
## [3,] 54 84 87 61 51
## [4,] 132 100 88 58 93
## [5,] 8 4 8 14 18
Finally, lets have a look at what happened to the library sizes:
colSums(ReadCounts)
## HS1 HS2 HS3 HS4 HS5 HS6 PT1 PT2 PT3
## 2843800 3384520 4384348 3420649 3798676 2843405 5344280 3175223 3791033
## PT4 PT5 PT6
## 3553779 3651618 3548120
colSums(bNoisyReads)
## [1] 2839979 3380886 4380614 3415833 3793699 2841290 5339896 3171835
## [9] 3790150 3553355 3647385 3546851
We see that we have done a fairly good job of reproducing at least one feature of the data (library size) even though we added noise.
By generating many bootstrap samples, we could e.g. create confidence intervals for quantities estimated from the data.
We start by generating data from the Null distribution. We will estimate a mean and variance of the percentages for each gene.
We start by replacing the zeroes by 0.25.
ReadCounts0=as.matrix(ReadCounts)
ReadCounts0[ReadCounts0==0]=.25
We also need the library sizes, which are the total reads for each sample (column).
libSizes=colSums(ReadCounts0)
Next we convert all the counts to proportions within each column:
ReadProp=ReadCounts0
for (i in 1:ncol(ReadCounts0)) ReadProp[,i]=ReadCounts0[,i]/libSizes[i]
To generate a parametric bootstrap sample under the Null distribution, we assume that all the proportions for a single gene (row) come from the same distribution
means=rowMeans(ReadProp)
SDs=apply(ReadProp,1,sd)
These data have an unusual feature, a few genes make up a large percentage of the data. These genes also have very large SD.
oGenes=order(means,decreasing=TRUE)
median(means)
## [1] 1.129729e-05
head(means[oGenes[1:10]])
## ENSG00000163631 ENSG00000197249 ENSG00000171564 ENSG00000118137
## 0.09309684 0.03823129 0.02379628 0.01872519
## ENSG00000171557 ENSG00000171560
## 0.01758394 0.01711802
median(SDs)
## [1] 4.625687e-06
head(SDs[oGenes[1:10]])
## ENSG00000163631 ENSG00000197249 ENSG00000171564 ENSG00000118137
## 0.044706895 0.012728878 0.012559645 0.004976314
## ENSG00000171557 ENSG00000171560
## 0.009499684 0.009249042
While it might make sense to transform the means and SDs of the proportions into the parameters of a distribution like the Beta, which generates data between 0 and 1, this method did not work well for these data, as it created distributions for some genes that essentially had only 1 proportion. Instead, we add Normal noise, scaled by the SD, to the mean. After scaling up by the library size to obtain a mean count, we replaced all the means that are less than 0.25 by 0.25.
bProp=ReadProp
bMeans=bProp
for (i in 1:nrow(ReadProp)) bProp[i,]=rnorm(ncol(ReadProp),means[i],SDs[i])
for (i in 1:ncol(bProp)) bMeans[,i]=bProp[,i]*libSizes[i]
bMeans[bMeans<=0.25]=0.25
Finally, we generate our samples as Poisson with our generated mean count. (Note that it is important tat bMeans is a matrix and not a dataframe.)
bReads=matrix(rpois(nn,bMeans),ncol=ncol(bMeans))
colnames(bReads)=colnames(ReadCounts)
We can check to see if the synthetic data resemble the observed data.
head(ReadCounts)
## HS1 HS2 HS3 HS4 HS5 HS6 PT1 PT2 PT3 PT4 PT5 PT6
## ENSG00000000003 121 329 421 300 359 168 574 409 429 685 386 428
## ENSG00000000005 0 0 1 0 0 0 1 0 4 1 1 1
## ENSG00000000419 39 81 78 61 62 56 100 59 66 58 65 93
## ENSG00000000457 114 91 73 62 95 76 131 274 229 239 87 149
## ENSG00000000460 15 6 7 17 17 12 8 12 8 7 5 10
## ENSG00000000938 73 44 43 65 65 210 84 198 104 31 76 58
head(bReads)
## HS1 HS2 HS3 HS4 HS5 HS6 PT1 PT2 PT3 PT4 PT5 PT6
## [1,] 160 519 665 470 321 192 520 218 408 316 239 211
## [2,] 2 0 1 1 0 0 4 0 1 0 3 1
## [3,] 50 68 102 56 50 54 87 57 79 82 71 63
## [4,] 102 238 216 110 263 159 106 96 202 64 132 21
## [5,] 6 24 5 22 5 5 14 16 5 5 1 0
## [6,] 69 68 112 51 0 158 211 9 84 0 31 110
To generate the null distribution for any statistic, we then generate many (say 1000) or these samples.
If we want to generate data under the non-Null distribution, we proceed in exactly the same way except the we compute the mean and SD of each gene for each condition.
meansHS=rowMeans(ReadProp[,1:6])
meansPT=rowMeans(ReadProp[,7:12])
SDsHS=apply(ReadProp[,1:6],1,sd)
SDsPT=apply(ReadProp[,7:12],1,sd)
We then generate the read proportions and read count means and the means in the same way, but using both sets of means and SDs.
bPropNon=ReadProp
bMeansNon=bProp
for (i in 1:nrow(ReadProp)) bPropNon[i,]=c(rnorm(6,meansHS[i],SDsHS[i]),rnorm(6,meansPT[i],SDsPT[i]))
for (i in 1:ncol(bPropNon)) bMeansNon[,i]=bPropNon[,i]*libSizes[i]
bMeansNon[bMeansNon<=0.25]=0.25
Finally, we generate our samples as Poisson with our generated mean count. (Note that it is important tat bMeans is a matrix and not a dataframe.)
bReadsNon=matrix(rpois(nn,bMeansNon),ncol=ncol(bMeansNon))
colnames(bReadsNon)=colnames(ReadCounts)
We can check to see if the synthetic data resemble the observed data.
head(ReadCounts)
## HS1 HS2 HS3 HS4 HS5 HS6 PT1 PT2 PT3 PT4 PT5 PT6
## ENSG00000000003 121 329 421 300 359 168 574 409 429 685 386 428
## ENSG00000000005 0 0 1 0 0 0 1 0 4 1 1 1
## ENSG00000000419 39 81 78 61 62 56 100 59 66 58 65 93
## ENSG00000000457 114 91 73 62 95 76 131 274 229 239 87 149
## ENSG00000000460 15 6 7 17 17 12 8 12 8 7 5 10
## ENSG00000000938 73 44 43 65 65 210 84 198 104 31 76 58
head(bReadsNon)
## HS1 HS2 HS3 HS4 HS5 HS6 PT1 PT2 PT3 PT4 PT5 PT6
## [1,] 143 163 440 241 256 123 647 354 460 251 562 392
## [2,] 0 0 1 0 0 1 4 2 0 1 3 0
## [3,] 54 69 96 65 94 64 80 53 53 86 74 50
## [4,] 104 81 131 100 115 84 184 167 271 302 289 194
## [5,] 4 18 25 25 12 7 15 6 19 2 12 11
## [6,] 7 13 290 73 240 163 146 75 37 37 144 65
As always, we end with session information.
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.4 tools_3.2.2 htmltools_0.3.5
## [5] yaml_2.1.13 Rcpp_0.12.5 stringi_1.0-1 rmarkdown_0.9.6
## [9] knitr_1.13 stringr_1.0.0 digest_0.6.9 evaluate_0.9