How to obtain a list of genes with significance information that can be used for identifying gene-active networks? @ dnet 1.0.11

      # Prior to this question, please allow us to talk about the procedure of how to identify gene-active networks/subgraphs. The dnet package takes as input a list of genes with the significance information (p-values or FDR), and superposes these genes onto a gene interaction network. From this network with imposed node information, dnet sets up the pipeline to search for a maximum-scoring subgraph. Given the threshold of tolerable p-value, it gives positive scores for nodes with p-values below the threshold (nodes of interest), and negative scores for nodes with threshold-above p-values (intolerable). After score transformation, the search for a maximum scoring subgraph is deduced to find the connected subgraph that is enriched with positive-score nodes, allowing for a few negative-score nodes as linkers. This objective is met through minimum spanning tree finding and post-processing. The solver is deterministic, only determined by the given tolerable p-value threshold. For identification of the subgraph with a desired number of nodes, an iterative procedure is also developed to fine-tune tolerable thresholds. This pipeline procedure is implemented in the function dNetPipeline.

# The calculation of gene significance depends on the nature of input data (especially samples under consideration). It can be: 1) samples that are grouped, 2) samples collected as a time series, and 3) samples in a clinical setting (with survival data).

# For grouped samples, the the p-value significance for genes can be calculated using the package 'limma' which is widely used for idnetification of differentially expressed genes. The example for this can be found in the Demo for multilayer replication timing dataset page.

# For time-course samples, the dnet package has developed a singular value decomposition-based method to obtain gene significance (see dSVDsignif). The example for this can be found in the Demo for time-course expression dataset page.

# For survival samples, survival analysis should be applied using Cox proportional hazards model, from which Cox hazard ratio and associated p-value for each gene can be calculated. This can be done via the 'survival' package, and is demonstrated in the Demo for TCGA mutation and survival dataset page.

###########################################################################
# Below we will use human embryo dataset from Fang et al to show how to calculate the gene signficance. This involves six successive developmental stages (S9-S14) with three replicates (R1-R3) for each stage, including:
## Fang: an expression matrix of 5,441 genes/probesets X 18 samples;
## Fang.geneinfo: a matrix of 5,441 X 3 containing gene/probeset information;
## Fang.sampleinfo: a matrix of 18 X 3 containing sample information.
###########################################################################

## import data containing three variables ('Fang', 'Fang.geneinfo' and 'Fang.sampleinfo')
data(Fang)
data <- Fang
fdata <- as.data.frame(Fang.geneinfo[,2:3])
rownames(fdata) <- Fang.geneinfo[,1]
pdata <- as.data.frame(Fang.sampleinfo[,2:3])
rownames(pdata) <- Fang.sampleinfo[,1]

## for the ease of managing this dataset, we create the 'eset' object:
colmatch <- match(rownames(pdata),colnames(data))
rowmatch <- match(rownames(fdata),rownames(data))
data <- data[rowmatch,colmatch]
eset <- new("ExpressionSet", exprs=as.matrix(data), phenoData=as(pdata,"AnnotatedDataFrame"), featureData=as(fdata,"AnnotatedDataFrame"))

## Since this dataset is generated by Affymetrix, so data at the probeset level should be converted to data at the gene level. For this, we define a function 'prob2gene' to convert probeset-centric to entrezgene-centric expression levels:
prob2gene <- function(eset){
    fdat <- fData(eset)
    tmp <- as.matrix(unique(fdat[c("EntrezGene", "Symbol")]))
    forder <- tmp[order(as.numeric(tmp[,1])),]
    forder <- forder[!is.na(forder[,1]),]
    rownames(forder) <- forder[,2]
    system.time({
        dat <- exprs(eset)
        edat <- matrix(data=NA, nrow=nrow(forder), ncol=ncol(dat))
        for (i in 1:nrow(forder)){
            ind <- which(fdat$EntrezGene==as.numeric(forder[i,1]))
            if (length(ind) == 1){
                edat[i,] <- dat[ind,]
            }else{
                edat[i,] <- apply(dat[ind,],2,mean)
            }
        }
    })
    
    rownames(edat) <- rownames(forder) # as gene symbols
    colnames(edat) <- rownames(pData(eset))
    esetGene <- new('ExpressionSet',exprs=data.frame(edat),phenoData=as(pData(eset),"AnnotatedDataFrame"),featureData=as(data.frame(forder),"AnnotatedDataFrame"))
    return(esetGene)
}
## 'esetGene' stores the gene-level expression data
esetGene <- prob2gene(eset)
esetGene

ExpressionSet (storageMode: lockedEnvironment)
assayData: 4139 features, 18 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: S9_R1 S9_R2 ... S14_R3 (18 total)
  varLabels: Stage Replicate
  varMetadata: labelDescription
featureData
  featureNames: A2M AADAC ... LOC100131613 (4139 total)
  fvarLabels: EntrezGene Symbol
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:  


## Next, we use 'limma' package to calcluate gene signficance according to two sample groups (S10 versus S9).
### define the design matrix in a order manner
all <- as.vector(pData(esetGene)$Stage)
level <- levels(factor(all))
index_level <- sapply(level, function(x) which(all==x)[1])
level_sorted <- all[sort(index_level, decreasing=F)]
design <- sapply(level_sorted, function(x) as.numeric(all==x)) # Convert a factor column to multiple boolean columns
### a linear model is fitted for every gene by the function lmFit
fit <- lmFit(exprs(esetGene), design)
### define a contrast matrix
contrasts <- dContrast(level_sorted, contrast.type="sequential")
contrast.matrix <- makeContrasts(contrasts=contrasts$each, levels=design)
colnames(contrast.matrix) <- contrasts$name
### computes moderated t-statistics and log-odds of differential expression by empirical Bayes shrinkage of the standard errors towards a common value
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
### for p-value
pvals <- as.matrix(fit2$p.value)
### for adjusted p-value
adjpvals <- sapply(1:ncol(pvals),function(x) {
    p.adjust(pvals[,x], method="BH")
})
colnames(adjpvals) <- colnames(pvals)
### num of differentially expressed genes
apply(adjpvals<1e-2, 2, sum)

 S10_S9 S11_S10 S12_S11 S13_S12 S14_S13 
    994     215     165     204     330 

### extract gene signficance according to two sample groups (S10 versus S9):
pvalue_S10_S9 <- pvals[,1]
hist(pvalue_S10_S9, 100)



## In addition to pair-wise comparisons, we also use this dataset for calculating gene signficance across the time course.
D <- as.matrix(exprs(esetGene))
D <- D - as.matrix(apply(D,1,mean),ncol=1)[,rep(1,ncol(D))]
fdr <- dSVDsignif(data=D, num.eigen=NULL, pval.eigen=1e-2, signif="fdr", orient.permutation="row", num.permutation=100, fdr.procedure="stepup", verbose=T)

Start at 2017-03-27 20:06:51

First, singular value decomposition of the input matrix (with 4139 rows and 18 columns)...
Second, determinate the eigens...
	via automatically deciding on the number of dominant eigens under the cutoff of 1.00e-02 pvalue
	number of the eigens in consideration: 2
Third, construct the gene-specific projection vector,and calculate distance statistics...
Finally, obtain gene significance (fdr) based on 100 permutations...
	doing row-wise permutations...
	estimating fdr...
1 (out of 4139) at 2017-03-27 20:07:02
414 (out of 4139) at 2017-03-27 20:07:09
828 (out of 4139) at 2017-03-27 20:07:17
1242 (out of 4139) at 2017-03-27 20:07:25
1656 (out of 4139) at 2017-03-27 20:07:33
2070 (out of 4139) at 2017-03-27 20:07:41
2484 (out of 4139) at 2017-03-27 20:07:48
2898 (out of 4139) at 2017-03-27 20:07:55
3312 (out of 4139) at 2017-03-27 20:08:02
3726 (out of 4139) at 2017-03-27 20:08:08
4139 (out of 4139) at 2017-03-27 20:08:14
	using stepup procedure...

Finish at 2017-03-27 20:08:14
Runtime in total is: 83 secs

hist(fdr, 100)
Source faq

FAQ3.r
How to obtain a list of genes with significance information that can be used for identifying gene-active networks?

Source faq

Functions used in this FAQ