What is the built-in data and how to use? @ dnet 1.0.11

      # These built-in data are the backend of various analytical utilities supported in the dnet package, spanning a wide range of the known gene-centric knowledge across well-studied organisms. They are provided as RData-formatted files which are regularly updated. Also, we will populate them by adding new knowledge, for example, upon request by users. The built-in RData are summarised in brief and available in the RData page.

# Usually, the users do not need to download them by self for use. Instead, the users are encouraged to understand what they want to use by simply looking up the keywords in the Documentations page. The package has functions to import them or deal with them directly.

# The function dRDataLoader allows the users to import what they want to use.

# For the ease to use, organism-specific data start with 'org', followed by the specific organim ('Hs' for human), and the data content: only 'eg' means information about Entrez Genes, and further appendix (for example, 'GOBP') means information about their annotations by Gene Ontology Biological Process (GOBP).

## load human Entrez Genes (EG), and list the first 3 genes
org.Hs.eg <- dRDataLoader(RData='org.Hs.eg')

'org.Hs.eg' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.eg.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:17)
org.Hs.eg$gene_info[1:3,]

  GeneID Symbol                        description chromosome map_location
1      1   A1BG             alpha-1-B glycoprotein         19      19q13.4
2      2    A2M              alpha-2-macroglobulin         12     12p13.31
3      3  A2MP1 alpha-2-macroglobulin pseudogene 1         12     12p13.31
                   Synonyms
1      A1B|ABG|GAB|HYST2477
2 A2MD|CPAMD5|FWP007|S863-7
3                      A2MP
                                                                            dbXrefs
1 MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|HPRD:00726|Vega:OTTHUMG00000183507
2 MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899|HPRD:00072|Vega:OTTHUMG00000150267
3                                               HGNC:HGNC:8|Ensembl:ENSG00000256069

## load annotations of human Entrez Genes by Gene Ontology Biological Process (GOBP), inspect the content and list the first 3 terms
org.Hs.egGOBP <- dRDataLoader(RData='org.Hs.egGOBP')

'org.Hs.egGOBP' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egGOBP.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:21)
names(org.Hs.egGOBP)

[1] "gs"       "set_info"

org.Hs.egGOBP$set_info[1:3,]

                setID                             name namespace distance
GO:0000002 GO:0000002 mitochondrial genome maintenance   Process        6
GO:0000003 GO:0000003                     reproduction   Process        2
GO:0000011 GO:0000011              vacuole inheritance   Process        6

## load annotations of human Entrez Genes by Disease Ontology (DO), inspect the content and list the first 5 terms
org.Hs.egDO <- dRDataLoader(RData='org.Hs.egDO')

'org.Hs.egDO' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egDO.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:23)
org.Hs.egDO$set_info[1:3,]

                    setID                  name        namespace distance
DOID:0001816 DOID:0001816          angiosarcoma Disease_Ontology        5
DOID:0002116 DOID:0002116             pterygium Disease_Ontology        7
DOID:0014667 DOID:0014667 disease of metabolism Disease_Ontology        1

## load phylostratific age (PS) information of human Entrez Genes, inspect the content and list all our ancestors
org.Hs.egPS <- dRDataLoader(RData='org.Hs.egPS')

'org.Hs.egPS' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egPS.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:24)
org.Hs.egPS$set_info

   setID                  name    namespace   distance
3      3        2759:Eukaryota superkingdom 0.00000000
4      4    33154:Opisthokonta      no rank 0.02227541
5      5    33154:Opisthokonta      no rank 0.02677301
6      6    33154:Opisthokonta      no rank 0.03026936
7      7    33154:Opisthokonta      no rank 0.03573534
8      8    33154:Opisthokonta      no rank 0.03880849
9      9         33208:Metazoa      kingdom 0.04949159
10    10         33208:Metazoa      kingdom 0.06686750
11    11         33208:Metazoa      kingdom 0.09260898
12    12        6072:Eumetazoa      no rank 0.10459007
13    13        6072:Eumetazoa      no rank 0.11176118
14    14       33213:Bilateria      no rank 0.12058364
15    15       33213:Bilateria      no rank 0.12660301
16    16   33511:Deuterostomia      no rank 0.13884801
17    17   33511:Deuterostomia      no rank 0.14852778
18    18         7711:Chordata       phylum 0.15759842
19    19       7742:Vertebrata      no rank 0.16953129
20    20   117571:Euteleostomi      no rank 0.18295445
21    21    8287:Sarcopterygii      no rank 0.18554672
22    22       32523:Tetrapoda      no rank 0.18855901
23    23         32524:Amniota      no rank 0.19241034
24    24        40674:Mammalia        class 0.19552877
25    25          32525:Theria      no rank 0.19917128
26    26         9347:Eutheria      no rank 0.20262687
27    27 1437010:Boreoeutheria      no rank 0.20409224
29    29         9443:Primates        order 0.20521882
30    30         9443:Primates        order 0.20708817
32    32    314293:Simiiformes   infraorder 0.21351030
33    33       9526:Catarrhini    parvorder 0.21636349
34    34     314295:Hominoidea  superfamily 0.21875281
35    35        9604:Hominidae       family 0.22019688
36    36      207598:Homininae    subfamily 0.22313298
37    37     9606:Homo sapiens      species 0.23340461

## load domain superfamily (SF) information of human Entrez Genes, inspect the content and list the first 3 superfamilies
org.Hs.egSF <- dRDataLoader(RData='org.Hs.egSF')

'org.Hs.egSF' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egSF.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:26)
org.Hs.egSF$set_info[1:3,]

      setID                         name namespace distance
46458 46458                  Globin-like        sf    a.1.1
46548 46548     alpha-helical ferredoxin        sf    a.1.2
46561 46561 Ribosomal protein L29 (L29p)        sf    a.2.2

## load KEGG pathways for human Entrez Genes, inspect the content and list the first 3 pathways
org.Hs.egMsigdbC2KEGG <- dRDataLoader(RData='org.Hs.egMsigdbC2KEGG')

'org.Hs.egMsigdbC2KEGG' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egMsigdbC2KEGG.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:32)
org.Hs.egMsigdbC2KEGG$set_info[1:3,]

                                                  setID                   name
KEGG_ABC_TRANSPORTERS             KEGG_ABC_TRANSPORTERS       ABC transporters
KEGG_ACUTE_MYELOID_LEUKEMIA KEGG_ACUTE_MYELOID_LEUKEMIA Acute myeloid leukemia
KEGG_ADHERENS_JUNCTION           KEGG_ADHERENS_JUNCTION      Adherens junction
                            namespace
KEGG_ABC_TRANSPORTERS              C2
KEGG_ACUTE_MYELOID_LEUKEMIA        C2
KEGG_ADHERENS_JUNCTION             C2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               distance
KEGG_ABC_TRANSPORTERS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           The ATP-binding cassette (ABC) transporters form one of the largest known protein families, and are widespread in bacteria, archaea, and eukaryotes. They couple ATP hydrolysis to active transport of a wide variety of substrates such as ions, sugars, lipids, sterols, peptides, proteins, and drugs. The structure of a prokaryotic ABC transporter usually consists of three components; typically two integral membrane proteins each having six transmembrane segments, two peripheral proteins that bind and hydrolyze ATP, and a periplasmic (or lipoprotein) substrate-binding protein. Many of the genes for the three components form operons as in fact observed in many bacterial and archaeal genomes. On the other hand, in a typical eukaryotic ABC transporter, the membrane spanning protein and the ATP-binding protein are fused, forming a multi-domain protein with the membrane-spanning domain (MSD) and the nucleotide-binding domain (NBD).
KEGG_ACUTE_MYELOID_LEUKEMIA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Two major types of genetic events are crucial for the molecular pathogenesis of acute myeloid leukemias (AML) : activating mutations of signal transduction intermediates and alterations in myeloid transcription factors governing hematopoietic differentiation. Both aberrant and constitutive activation of signal transduction molecules are found in about 50% of primary AML bone marrow samples, and seem to contribute to the increased proliferation and apoptosis resistance. The most common of these activating events were observed in the RTK Flt3, in N-Ras and K-Ras, in Kit, and sporadically in other RTKs. Specific haematopoietic transcription factors are crucial for differentiation to particular lineages during normal differentiation, but are frequently disrupted in AML. Some mechanisms of disruption involve the effect of fusion proteins that are generated by chromosomal translocations on haematopoietic transcription factors. In other cases, the transcription factors themselves are mutated.
KEGG_ADHERENS_JUNCTION      Cell-cell adherens junctions (AJs), the most common type of intercellular adhesions, are important for maintaining tissue architecture and cell polarity and can limit cell movement and proliferation. At AJs, E-cadherin serves as an essential cell adhesion molecules (CAMs). The cytoplasmic tail binds beta-catenin, which in turn binds alpha-catenin. Alpha-catenin is associated with F-actin bundles directly and indirectly. The integrity of the cadherin-catenin complex is negatively regulated by phosphorylation of beta-catenin by receptor tyrosine kinases (RTKs) and cytoplasmic tyrosine kinases (Fer, Fyn, Yes, and Src), which leads to dissociation of the cadherin-catenin complex. Integrity of this complex is positively regulated by beta -catenin phosphorylation by casein kinase II, and dephosphorylation by protein tyrosine phosphatases. Changes in the phosphorylation state of beta-catenin affect cell-cell adhesion, cell migration and the level of signaling beta-catenin. Wnt signaling acts as a positive regulator of beta-catenin by inhibiting beta-catenin degradation, which stabilizes beta-catenin, and causes its accumulation. Cadherin may acts as a negative regulator of signaling beta-catenin as it binds beta-catenin at the cell surface and thereby sequesters it from the nucleus. Nectins also function as CAMs at AJs, but are more highly concentrated at AJs than E-cadherin. Nectins transduce signals through Cdc42 and Rac, which reorganize the actin cytoskeleton, regulate the formation of AJs, and strengthen cell-cell adhesion.

## load the network for human Entrez Genes as an 'igraph' object
org.Hs.string <- dRDataLoader(RData='org.Hs.string')

'org.Hs.string' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.string.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:39)
org.Hs.string

IGRAPH UN-- 18492 728141 -- 
+ attr: name (v/c), seqid (v/c), geneid (v/n), symbol (v/c),
| description (v/c), neighborhood_score (e/n), fusion_score (e/n),
| cooccurence_score (e/n), coexpression_score (e/n), experimental_score
| (e/n), database_score (e/n), textmining_score (e/n), combined_score
| (e/n)
+ edges (vertex names):
 [1] 3025671--3031737 3021358--3027795 3021358--3027929 3021358--3027741
 [5] 3024278--3029186 3029373--3031543 3031543--3034385 3019006--3030823
 [9] 3015391--3030823 3021191--3028634 3021191--3024550 3021191--3033402
[13] 3031876--3031959 3015324--3031959 3023108--3033954 3016546--3020273
+ ... omitted several edges

## This network is extracted from the STRING database. Only those associations with medium confidence (score>=400) are retained. And the users can restrict to those edges with high confidence (score>=700, for example)
network <- igraph::subgraph.edges(org.Hs.string, eids=E(org.Hs.string)[combined_score>=700])
network

IGRAPH UN-- 15341 316170 -- 
+ attr: name (v/c), seqid (v/c), geneid (v/n), symbol (v/c),
| description (v/c), neighborhood_score (e/n), fusion_score (e/n),
| cooccurence_score (e/n), coexpression_score (e/n), experimental_score
| (e/n), database_score (e/n), textmining_score (e/n), combined_score
| (e/n)
+ edges (vertex names):
 [1] 3017550--3023854 3023931--3028317 3019304--3028317 3028317--3033319
 [5] 3023602--3028317 3014709--3028317 3024678--3026825 3023468--3030905
 [9] 3026117--3030905 3026845--3029085 3017265--3027473 3015527--3033837
[13] 3019960--3033973 3021862--3033174 3015979--3025568 3015355--3025568
+ ... omitted several edges


# In addition to data import, the package has also functions (see below) to deal with them directly. In these functions, the users only need to specify which genome/organism and which ontology to use.

# Here, we use human TCGA mutation dataset as an example
data(TCGA_mutations)
symbols <- as.character(fData(TCGA_mutations)$Symbol)

## Enrichment analysis using Disease Ontology (DO)
data <- symbols[1:100] # select the first 100 human genes
eTerm <- dEnricher(data, identity="symbol", genome="Hs", ontology="DO")

Start at 2017-03-27 20:05:43

First, load the ontology DO and its gene associations in the genome Hs (2017-03-27 20:05:43) ...
'org.Hs.eg' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.eg.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:46)
'org.Hs.egDO' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egDO.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:47)
Then, do mapping based on symbol (2017-03-27 20:05:47) ...
	Among 100 symbols of input data, there are 100 mappable via official gene symbols but 0 left unmappable
Third, perform enrichment analysis using HypergeoTest (2017-03-27 20:05:48) ...
	There are 917 terms being used, each restricted within [10,1000] annotations
Last, adjust the p-values using the BH method (2017-03-27 20:05:48) ...

End at 2017-03-27 20:05:48
Runtime in total is: 5 secs


## gene set enrichment analysis (GSEA) using KEGG pathways
tol <- apply(exprs(TCGA_mutations), 1, sum) # calculate the total mutations for each gene
data <- data.frame(tol=tol)
eTerm <- dGSEA(data, identity="symbol", genome="Hs", ontology="MsigdbC2KEGG")
Start at 2017-03-27 20:05:52

First, load the ontology MsigdbC2KEGG and its gene associations in the genome Hs (2017-03-27 20:05:52) ...
'org.Hs.eg' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.eg.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:55)
'org.Hs.egMsigdbC2KEGG' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egMsigdbC2KEGG.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:06:00)
Then, do mapping based on symbol (2017-03-27 20:06:00) ...
	Among 19420 symbols of input data, there are 19038 mappable via official gene symbols but 382 left unmappable
Third, perform GSEA analysis based on 1000 permutations for 186 gene sets (2017-03-27 20:06:05) ...
	Sample 1 is being processed at (2017-03-27 20:06:05) ...
	10 of 186 gene sets have been processed (2017-03-27 20:06:08) ...
	20 of 186 gene sets have been processed (2017-03-27 20:06:11) ...
	30 of 186 gene sets have been processed (2017-03-27 20:06:13) ...
	40 of 186 gene sets have been processed (2017-03-27 20:06:15) ...
	50 of 186 gene sets have been processed (2017-03-27 20:06:17) ...
	60 of 186 gene sets have been processed (2017-03-27 20:06:19) ...
	70 of 186 gene sets have been processed (2017-03-27 20:06:22) ...
	80 of 186 gene sets have been processed (2017-03-27 20:06:24) ...
	90 of 186 gene sets have been processed (2017-03-27 20:06:26) ...
	100 of 186 gene sets have been processed (2017-03-27 20:06:28) ...
	110 of 186 gene sets have been processed (2017-03-27 20:06:31) ...
	120 of 186 gene sets have been processed (2017-03-27 20:06:33) ...
	130 of 186 gene sets have been processed (2017-03-27 20:06:35) ...
	140 of 186 gene sets have been processed (2017-03-27 20:06:37) ...
	150 of 186 gene sets have been processed (2017-03-27 20:06:39) ...
	160 of 186 gene sets have been processed (2017-03-27 20:06:41) ...
	170 of 186 gene sets have been processed (2017-03-27 20:06:44) ...
	180 of 186 gene sets have been processed (2017-03-27 20:06:46) ...
	186 of 186 gene sets have been processed (2017-03-27 20:06:47) ...

End at 2017-03-27 20:06:49
Runtime in total is: 57 secs
Source faq

FAQ2.r
What is the built-in data and how to use?

Source faq

Functions used in this FAQ