We start by generating 20 random normals to represent the phenotypic scores of 20 samples and saving them in a variable called pheno.
nsamp=20
pheno=rnorm(nsamp)
Then we generate 1000 gene expressions. Each has a value for each of the 20 samples. Typically gene expression values are stored in a matrix with genes in the rows and samples in the columns. This matrix is therefore 1000 by 20.
ngene=1000
genes=matrix(rnorm(nsamp*ngene),ncol=nsamp)
Next we compute the correlation between the phenotypic score and gene expression. The R cor command computes the correlation between the columns of two matrices.
Since the genes are in the rows, we need to transpose the gene expression matrix to compute the correlations.
cors=cor(t(genes),pheno)
To visualize the correlations, we draw a histogram. Since we have 1000 correlations, we can have quite a few bins - I selected 50 but R adjusts this based on some algorithm.
hist(cors,nclass=50,xlab="Correlations",main="Correlation between Gene Expression and Phenotype")
Next we select the 10 genes with the largest absolute correlation. These are the genes the are individually most predictive of phenotypic score (in this data set).
nsel=10
selgenes=order(cors,decreasing=FALSE)[1:nsel]
sigGenes=genes[selgenes,]
Finally, we do the linear regression on the fitted genes and compute the \(R^2\).
regout=lm(pheno~t(sigGenes))
R2=var(fitted(regout))/var(pheno)
anova(regout)
## Analysis of Variance Table
##
## Response: pheno
## Df Sum Sq Mean Sq F value Pr(>F)
## t(sigGenes) 10 19.6963 1.96963 13.757 0.0002724 ***
## Residuals 9 1.2886 0.14317
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R2
## [1] 0.9385954