How to obtain a list of genes with significance information that can be used for identifying gene-active networks?

Notes:
  • All results are based on dnet (version 1.0.11).
  • R scripts (i.e. R expressions) plus necessary comments are highlighted in light-cyan background, and the rest are outputs in the screen.
  • Images displayed below may be distorted, but should be normal in your screen.
  • Functions contained in dnet 1.0.11 are hyperlinked in-place and also listed on the right side.
  • Key texts are underlined, in bold and in pumpkin-orange color.
  •       
    # Prior to this question, please allow us to talk about the procedure of how to identify gene-active networks/subgraphs. The dnet package takes as input a list of genes with the significance information (p-values or FDR), and superposes these genes onto a gene interaction network. From this network with imposed node information, dnet sets up the pipeline to search for a maximum-scoring subgraph. Given the threshold of tolerable p-value, it gives positive scores for nodes with p-values below the threshold (nodes of interest), and negative scores for nodes with threshold-above p-values (intolerable). After score transformation, the search for a maximum scoring subgraph is deduced to find the connected subgraph that is enriched with positive-score nodes, allowing for a few negative-score nodes as linkers. This objective is met through minimum spanning tree finding and post-processing. The solver is deterministic, only determined by the given tolerable p-value threshold. For identification of the subgraph with a desired number of nodes, an iterative procedure is also developed to fine-tune tolerable thresholds. This pipeline procedure is implemented in the function dNetPipeline. # The calculation of gene significance depends on the nature of input data (especially samples under consideration). It can be: 1) samples that are grouped, 2) samples collected as a time series, and 3) samples in a clinical setting (with survival data). # For grouped samples, the the p-value significance for genes can be calculated using the package 'limma' which is widely used for idnetification of differentially expressed genes. The example for this can be found in the Demo for multilayer replication timing dataset page. # For time-course samples, the dnet package has developed a singular value decomposition-based method to obtain gene significance (see dSVDsignif). The example for this can be found in the Demo for time-course expression dataset page. # For survival samples, survival analysis should be applied using Cox proportional hazards model, from which Cox hazard ratio and associated p-value for each gene can be calculated. This can be done via the 'survival' package, and is demonstrated in the Demo for TCGA mutation and survival dataset page. ########################################################################### # Below we will use human embryo dataset from Fang et al to show how to calculate the gene signficance. This involves six successive developmental stages (S9-S14) with three replicates (R1-R3) for each stage, including: ## Fang: an expression matrix of 5,441 genes/probesets X 18 samples; ## Fang.geneinfo: a matrix of 5,441 X 3 containing gene/probeset information; ## Fang.sampleinfo: a matrix of 18 X 3 containing sample information. ########################################################################### ## import data containing three variables ('Fang', 'Fang.geneinfo' and 'Fang.sampleinfo') data(Fang) data <- Fang fdata <- as.data.frame(Fang.geneinfo[,2:3]) rownames(fdata) <- Fang.geneinfo[,1] pdata <- as.data.frame(Fang.sampleinfo[,2:3]) rownames(pdata) <- Fang.sampleinfo[,1] ## for the ease of managing this dataset, we create the 'eset' object: colmatch <- match(rownames(pdata),colnames(data)) rowmatch <- match(rownames(fdata),rownames(data)) data <- data[rowmatch,colmatch] eset <- new("ExpressionSet", exprs=as.matrix(data), phenoData=as(pdata,"AnnotatedDataFrame"), featureData=as(fdata,"AnnotatedDataFrame")) ## Since this dataset is generated by Affymetrix, so data at the probeset level should be converted to data at the gene level. For this, we define a function 'prob2gene' to convert probeset-centric to entrezgene-centric expression levels: prob2gene <- function(eset){ fdat <- fData(eset) tmp <- as.matrix(unique(fdat[c("EntrezGene", "Symbol")])) forder <- tmp[order(as.numeric(tmp[,1])),] forder <- forder[!is.na(forder[,1]),] rownames(forder) <- forder[,2] system.time({ dat <- exprs(eset) edat <- matrix(data=NA, nrow=nrow(forder), ncol=ncol(dat)) for (i in 1:nrow(forder)){ ind <- which(fdat$EntrezGene==as.numeric(forder[i,1])) if (length(ind) == 1){ edat[i,] <- dat[ind,] }else{ edat[i,] <- apply(dat[ind,],2,mean) } } }) rownames(edat) <- rownames(forder) # as gene symbols colnames(edat) <- rownames(pData(eset)) esetGene <- new('ExpressionSet',exprs=data.frame(edat),phenoData=as(pData(eset),"AnnotatedDataFrame"),featureData=as(data.frame(forder),"AnnotatedDataFrame")) return(esetGene) } ## 'esetGene' stores the gene-level expression data esetGene <- prob2gene(eset) esetGene
    ExpressionSet (storageMode: lockedEnvironment) assayData: 4139 features, 18 samples element names: exprs protocolData: none phenoData sampleNames: S9_R1 S9_R2 ... S14_R3 (18 total) varLabels: Stage Replicate varMetadata: labelDescription featureData featureNames: A2M AADAC ... LOC100131613 (4139 total) fvarLabels: EntrezGene Symbol fvarMetadata: labelDescription experimentData: use 'experimentData(object)' Annotation:
    ## Next, we use 'limma' package to calcluate gene signficance according to two sample groups (S10 versus S9). ### define the design matrix in a order manner all <- as.vector(pData(esetGene)$Stage) level <- levels(factor(all)) index_level <- sapply(level, function(x) which(all==x)[1]) level_sorted <- all[sort(index_level, decreasing=F)] design <- sapply(level_sorted, function(x) as.numeric(all==x)) # Convert a factor column to multiple boolean columns ### a linear model is fitted for every gene by the function lmFit fit <- lmFit(exprs(esetGene), design) ### define a contrast matrix contrasts <- dContrast(level_sorted, contrast.type="sequential") contrast.matrix <- makeContrasts(contrasts=contrasts$each, levels=design) colnames(contrast.matrix) <- contrasts$name ### computes moderated t-statistics and log-odds of differential expression by empirical Bayes shrinkage of the standard errors towards a common value fit2 <- contrasts.fit(fit, contrast.matrix) fit2 <- eBayes(fit2) ### for p-value pvals <- as.matrix(fit2$p.value) ### for adjusted p-value adjpvals <- sapply(1:ncol(pvals),function(x) { p.adjust(pvals[,x], method="BH") }) colnames(adjpvals) <- colnames(pvals) ### num of differentially expressed genes apply(adjpvals<1e-2, 2, sum)
    S10_S9 S11_S10 S12_S11 S13_S12 S14_S13 994 215 165 204 330
    ### extract gene signficance according to two sample groups (S10 versus S9): pvalue_S10_S9 <- pvals[,1] hist(pvalue_S10_S9, 100)
    ## In addition to pair-wise comparisons, we also use this dataset for calculating gene signficance across the time course. D <- as.matrix(exprs(esetGene)) D <- D - as.matrix(apply(D,1,mean),ncol=1)[,rep(1,ncol(D))] fdr <- dSVDsignif(data=D, num.eigen=NULL, pval.eigen=1e-2, signif="fdr", orient.permutation="row", num.permutation=100, fdr.procedure="stepup", verbose=T)
    Start at 2017-03-27 20:06:51 First, singular value decomposition of the input matrix (with 4139 rows and 18 columns)... Second, determinate the eigens... via automatically deciding on the number of dominant eigens under the cutoff of 1.00e-02 pvalue number of the eigens in consideration: 2 Third, construct the gene-specific projection vector,and calculate distance statistics... Finally, obtain gene significance (fdr) based on 100 permutations... doing row-wise permutations... estimating fdr... 1 (out of 4139) at 2017-03-27 20:07:02 414 (out of 4139) at 2017-03-27 20:07:09 828 (out of 4139) at 2017-03-27 20:07:17 1242 (out of 4139) at 2017-03-27 20:07:25 1656 (out of 4139) at 2017-03-27 20:07:33 2070 (out of 4139) at 2017-03-27 20:07:41 2484 (out of 4139) at 2017-03-27 20:07:48 2898 (out of 4139) at 2017-03-27 20:07:55 3312 (out of 4139) at 2017-03-27 20:08:02 3726 (out of 4139) at 2017-03-27 20:08:08 4139 (out of 4139) at 2017-03-27 20:08:14 using stepup procedure... Finish at 2017-03-27 20:08:14 Runtime in total is: 83 secs
    hist(fdr, 100)

    Source faq

    FAQ3.r

    Functions used in this FAQ