dGSEA
is supposed to conduct gene set enrichment analysis given
the input data and the ontology in query. It returns an object of class
"eTerm".
dGSEA(data, identity = c("symbol", "entrez"), check.symbol.identity = FALSE, genome = c("Hs", "Mm", "Rn", "Gg", "Ce", "Dm", "Da", "At"), ontology = c("GOBP", "GOMF", "GOCC", "PS", "PS2", "SF", "DO", "HPPA", "HPMI", "HPCM", "HPMA", "MP", "MsigdbH", "MsigdbC1", "MsigdbC2CGP", "MsigdbC2CP", "MsigdbC2KEGG", "MsigdbC2REACTOME", "MsigdbC2BIOCARTA", "MsigdbC3TFT", "MsigdbC3MIR", "MsigdbC4CGN", "MsigdbC4CM", "MsigdbC5BP", "MsigdbC5MF", "MsigdbC5CC", "MsigdbC6", "MsigdbC7", "DGIdb", "Customised"), customised.genesets = NULL, sizeRange = c(10, 20000), which_distance = NULL, weight = 1, nperm = 1000, fast = T, sigTail = c("two-tails", "one-tail"), p.adjust.method = c("BH", "BY", "bonferroni", "holm", "hochberg", "hommel"), verbose = T, RData.location = "https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7")
RData.location="."
. Surely, the location can be anywhere as long
as the user provides the correct path pointing to (otherwise, the
script will have to remotely download each time). Here is the UNIX
command for downloading all RData files (preserving the directory
structure): wget -r -l2 -A "*.RData" -np -nH --cut-dirs=0
"http://dnet.r-forge.r-project.org/RData"
an object of class "eTerm", a list with following components:
set_info
: a matrix of nSet X 4 containing gene set
information, where nSet is the number of gene set in consideration, and
the 4 columns are "setID" (i.e. "Term ID"), "name" (i.e. "Term Name"),
"namespace" and "distance"
gs
: a list of gene sets, each storing gene members.
Always, gene sets are identified by "setID" and gene members identified
by "Entrez ID"
data
: a matrix of nGene X nSample containing input data in
consideration. It is not always the same as the input data as only
those mappable are retained
es
: a matrix of nSet X nSample containing enrichment
score, where nSample is the number of samples (i.e. the number of
columns in input data
nes
: a matrix of nSet X nSample containing normalised
enrichment score. It is the version of enrichment score but after being
normalised by gene set size
pvalue
: a matrix of nSet X nSample containing nominal p
value
adjp
: a matrix of nSet X nSample containing adjusted p
value. It is the p value but after being adjusted for multiple
comparisons
gadjp
: a matrix of nSet X nSample containing globally
adjusted p value in terms of all samples
fdr
: a matrix of nSet X nSample containing false discovery
rate (FDR). It is the estimated probability that the normalised
enrichment score represents a false positive finding
qvalue
: a matrix of nSet X nSample containing q value. It
is the monotunically increasing FDR
weight
: the input type of score weight
call
: the call that produced this result
The interpretation of returned components:
# load data library(Biobase) TCGA_mutations <-dRDataLoader(RData='TCGA_mutations')'TCGA_mutations' (from package 'dnet' version 1.1.2) has been loaded into the working environment (at 2018-01-19 12:35:34)# gene set enrichment analysis (GSEA) using KEGG pathways ## calculate the total mutations for each gene tol <- apply(exprs(TCGA_mutations), 1, sum) data <- data.frame(tol=tol) eTerm <- dGSEA(data, identity="symbol", genome="Hs", ontology="MsigdbC2KEGG")Start at 2018-01-19 12:35:36 First, load the ontology MsigdbC2KEGG and its gene associations in the genome Hs (2018-01-19 12:35:36) ... 'org.Hs.eg' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.eg.RData?raw=true) has been loaded into the working environment (at 2018-01-19 12:35:37) 'org.Hs.egMsigdbC2KEGG' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egMsigdbC2KEGG.RData?raw=true) has been loaded into the working environment (at 2018-01-19 12:35:39) Then, do mapping based on symbol (2018-01-19 12:35:39) ... Among 19420 symbols of input data, there are 19038 mappable via official gene symbols but 382 left unmappable Third, perform GSEA analysis based on 1000 permutations for 186 gene sets (2018-01-19 12:35:43) ... Sample 1 is being processed at (2018-01-19 12:35:43) ... 10 of 186 gene sets have been processed (2018-01-19 12:35:45) ... 20 of 186 gene sets have been processed (2018-01-19 12:35:48) ... 30 of 186 gene sets have been processed (2018-01-19 12:35:50) ... 40 of 186 gene sets have been processed (2018-01-19 12:35:52) ... 50 of 186 gene sets have been processed (2018-01-19 12:35:54) ... 60 of 186 gene sets have been processed (2018-01-19 12:35:57) ... 70 of 186 gene sets have been processed (2018-01-19 12:35:59) ... 80 of 186 gene sets have been processed (2018-01-19 12:36:02) ... 90 of 186 gene sets have been processed (2018-01-19 12:36:04) ... 100 of 186 gene sets have been processed (2018-01-19 12:36:06) ... 110 of 186 gene sets have been processed (2018-01-19 12:36:08) ... 120 of 186 gene sets have been processed (2018-01-19 12:36:11) ... 130 of 186 gene sets have been processed (2018-01-19 12:36:13) ... 140 of 186 gene sets have been processed (2018-01-19 12:36:15) ... 150 of 186 gene sets have been processed (2018-01-19 12:36:17) ... 160 of 186 gene sets have been processed (2018-01-19 12:36:19) ... 170 of 186 gene sets have been processed (2018-01-19 12:36:21) ... 180 of 186 gene sets have been processed (2018-01-19 12:36:23) ... 186 of 186 gene sets have been processed (2018-01-19 12:36:25) ... End at 2018-01-19 12:36:26 Runtime in total is: 50 secsres <- dGSEAview(eTerm, which_sample=1, top_num=5, sortBy="adjp", decreasing=FALSE, details=TRUE) visGSEA(eTerm, which_sample=1, which_term=rownames(res)[1]) output <- dGSEAwrite(eTerm, which_content="gadjp", which_score="gadjp", filename="eTerm.txt")A file called eTerm.txt has been successfully written!## based on customised gene sets eTerm <- dGSEA(data, identity="symbol", genome="Hs", ontology="Customised", customised.genesets=sample(rownames(data),100))Start at 2018-01-19 12:36:27 First, load the ontology Customised and its gene associations in the genome Hs (2018-01-19 12:36:27) ... 'org.Hs.eg' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.eg.RData?raw=true) has been loaded into the working environment (at 2018-01-19 12:36:28) Among 100 symbols of input data, there are 95 mappable via official gene symbols but 5 left unmappable Then, do mapping based on symbol (2018-01-19 12:36:28) ... Among 19420 symbols of input data, there are 19038 mappable via official gene symbols but 382 left unmappable Third, perform GSEA analysis based on 1000 permutations for 1 gene sets (2018-01-19 12:36:30) ... Sample 1 is being processed at (2018-01-19 12:36:30) ... 1 of 1 gene sets have been processed (2018-01-19 12:36:30) ... End at 2018-01-19 12:36:30 Runtime in total is: 3 secsres <- dGSEAview(eTerm, which_sample=1, top_num=5, sortBy="adjp", decreasing=FALSE, details=TRUE) visGSEA(eTerm, which_sample=1, which_term=rownames(res)[1])