---
title: "Simulation 1"
author: "Naomi Altman"
date: "May 30, 2016"
output: html_document
---
We start by generating 20 random normals to represent the phenotypic scores of 20 samples and saving them in a variable called *pheno*.
```{r phenotype}
nsamp=20
pheno=rnorm(nsamp)
```
Then we generate 1000 gene expressions. Each has a value for each of the 20 samples. Typically gene expression values are stored in a matrix with genes in the rows and samples in the columns. This matrix is therefore 1000 by 20.
```{r genes}
ngene=1000
genes=matrix(rnorm(nsamp*ngene),ncol=nsamp)
```
Next we compute the correlation between the phenotypic score and gene expression. The R *cor* command computes the correlation between the columns of two matrices.
Since the genes are in the rows, we need to transpose the gene expression matrix to compute the correlations.
```{r correlation}
cors=cor(t(genes),pheno)
```
To visualize the correlations, we draw a histogram. Since we have 1000 correlations, we can have quite a few bins - I selected 50 but R adjusts this based on some algorithm.
```{r histCor}
hist(cors,nclass=50,xlab="Correlations",main="Correlation between Gene Expression and Phenotype")
```
Next we select the 10 genes with the largest absolute correlation. These are the genes the are individually most predictive of phenotypic score (in this data set).
```{r select}
nsel=10
selgenes=order(cors,decreasing=FALSE)[1:nsel]
sigGenes=genes[selgenes,]
```
Finally, we do the linear regression on the fitted genes and compute the $R^2$.
```{r regression}
regout=lm(pheno~t(sigGenes))
R2=var(fitted(regout))/var(pheno)
anova(regout)
R2
```