What is the built-in data and how to use?

Notes:
  • All results are based on dnet (version 1.0.11).
  • R scripts (i.e. R expressions) plus necessary comments are highlighted in light-cyan background, and the rest are outputs in the screen.
  • Images displayed below may be distorted, but should be normal in your screen.
  • Functions contained in dnet 1.0.11 are hyperlinked in-place and also listed on the right side.
  • Key texts are underlined, in bold and in pumpkin-orange color.
  •       
    # These built-in data are the backend of various analytical utilities supported in the dnet package, spanning a wide range of the known gene-centric knowledge across well-studied organisms. They are provided as RData-formatted files which are regularly updated. Also, we will populate them by adding new knowledge, for example, upon request by users. The built-in RData are summarised in brief and available in the RData page. # Usually, the users do not need to download them by self for use. Instead, the users are encouraged to understand what they want to use by simply looking up the keywords in the Documentations page. The package has functions to import them or deal with them directly. # The function dRDataLoader allows the users to import what they want to use. # For the ease to use, organism-specific data start with 'org', followed by the specific organim ('Hs' for human), and the data content: only 'eg' means information about Entrez Genes, and further appendix (for example, 'GOBP') means information about their annotations by Gene Ontology Biological Process (GOBP). ## load human Entrez Genes (EG), and list the first 3 genes org.Hs.eg <- dRDataLoader(RData='org.Hs.eg')
    'org.Hs.eg' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.eg.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:17)
    org.Hs.eg$gene_info[1:3,]
    GeneID Symbol description chromosome map_location 1 1 A1BG alpha-1-B glycoprotein 19 19q13.4 2 2 A2M alpha-2-macroglobulin 12 12p13.31 3 3 A2MP1 alpha-2-macroglobulin pseudogene 1 12 12p13.31 Synonyms 1 A1B|ABG|GAB|HYST2477 2 A2MD|CPAMD5|FWP007|S863-7 3 A2MP dbXrefs 1 MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|HPRD:00726|Vega:OTTHUMG00000183507 2 MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899|HPRD:00072|Vega:OTTHUMG00000150267 3 HGNC:HGNC:8|Ensembl:ENSG00000256069
    ## load annotations of human Entrez Genes by Gene Ontology Biological Process (GOBP), inspect the content and list the first 3 terms org.Hs.egGOBP <- dRDataLoader(RData='org.Hs.egGOBP')
    'org.Hs.egGOBP' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egGOBP.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:21)
    names(org.Hs.egGOBP)
    [1] "gs" "set_info"
    org.Hs.egGOBP$set_info[1:3,]
    setID name namespace distance GO:0000002 GO:0000002 mitochondrial genome maintenance Process 6 GO:0000003 GO:0000003 reproduction Process 2 GO:0000011 GO:0000011 vacuole inheritance Process 6
    ## load annotations of human Entrez Genes by Disease Ontology (DO), inspect the content and list the first 5 terms org.Hs.egDO <- dRDataLoader(RData='org.Hs.egDO')
    'org.Hs.egDO' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egDO.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:23)
    org.Hs.egDO$set_info[1:3,]
    setID name namespace distance DOID:0001816 DOID:0001816 angiosarcoma Disease_Ontology 5 DOID:0002116 DOID:0002116 pterygium Disease_Ontology 7 DOID:0014667 DOID:0014667 disease of metabolism Disease_Ontology 1
    ## load phylostratific age (PS) information of human Entrez Genes, inspect the content and list all our ancestors org.Hs.egPS <- dRDataLoader(RData='org.Hs.egPS')
    'org.Hs.egPS' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egPS.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:24)
    org.Hs.egPS$set_info
    setID name namespace distance 3 3 2759:Eukaryota superkingdom 0.00000000 4 4 33154:Opisthokonta no rank 0.02227541 5 5 33154:Opisthokonta no rank 0.02677301 6 6 33154:Opisthokonta no rank 0.03026936 7 7 33154:Opisthokonta no rank 0.03573534 8 8 33154:Opisthokonta no rank 0.03880849 9 9 33208:Metazoa kingdom 0.04949159 10 10 33208:Metazoa kingdom 0.06686750 11 11 33208:Metazoa kingdom 0.09260898 12 12 6072:Eumetazoa no rank 0.10459007 13 13 6072:Eumetazoa no rank 0.11176118 14 14 33213:Bilateria no rank 0.12058364 15 15 33213:Bilateria no rank 0.12660301 16 16 33511:Deuterostomia no rank 0.13884801 17 17 33511:Deuterostomia no rank 0.14852778 18 18 7711:Chordata phylum 0.15759842 19 19 7742:Vertebrata no rank 0.16953129 20 20 117571:Euteleostomi no rank 0.18295445 21 21 8287:Sarcopterygii no rank 0.18554672 22 22 32523:Tetrapoda no rank 0.18855901 23 23 32524:Amniota no rank 0.19241034 24 24 40674:Mammalia class 0.19552877 25 25 32525:Theria no rank 0.19917128 26 26 9347:Eutheria no rank 0.20262687 27 27 1437010:Boreoeutheria no rank 0.20409224 29 29 9443:Primates order 0.20521882 30 30 9443:Primates order 0.20708817 32 32 314293:Simiiformes infraorder 0.21351030 33 33 9526:Catarrhini parvorder 0.21636349 34 34 314295:Hominoidea superfamily 0.21875281 35 35 9604:Hominidae family 0.22019688 36 36 207598:Homininae subfamily 0.22313298 37 37 9606:Homo sapiens species 0.23340461
    ## load domain superfamily (SF) information of human Entrez Genes, inspect the content and list the first 3 superfamilies org.Hs.egSF <- dRDataLoader(RData='org.Hs.egSF')
    'org.Hs.egSF' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egSF.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:26)
    org.Hs.egSF$set_info[1:3,]
    setID name namespace distance 46458 46458 Globin-like sf a.1.1 46548 46548 alpha-helical ferredoxin sf a.1.2 46561 46561 Ribosomal protein L29 (L29p) sf a.2.2
    ## load KEGG pathways for human Entrez Genes, inspect the content and list the first 3 pathways org.Hs.egMsigdbC2KEGG <- dRDataLoader(RData='org.Hs.egMsigdbC2KEGG')
    'org.Hs.egMsigdbC2KEGG' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egMsigdbC2KEGG.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:32)
    org.Hs.egMsigdbC2KEGG$set_info[1:3,]
    setID name KEGG_ABC_TRANSPORTERS KEGG_ABC_TRANSPORTERS ABC transporters KEGG_ACUTE_MYELOID_LEUKEMIA KEGG_ACUTE_MYELOID_LEUKEMIA Acute myeloid leukemia KEGG_ADHERENS_JUNCTION KEGG_ADHERENS_JUNCTION Adherens junction namespace KEGG_ABC_TRANSPORTERS C2 KEGG_ACUTE_MYELOID_LEUKEMIA C2 KEGG_ADHERENS_JUNCTION C2 distance KEGG_ABC_TRANSPORTERS The ATP-binding cassette (ABC) transporters form one of the largest known protein families, and are widespread in bacteria, archaea, and eukaryotes. They couple ATP hydrolysis to active transport of a wide variety of substrates such as ions, sugars, lipids, sterols, peptides, proteins, and drugs. The structure of a prokaryotic ABC transporter usually consists of three components; typically two integral membrane proteins each having six transmembrane segments, two peripheral proteins that bind and hydrolyze ATP, and a periplasmic (or lipoprotein) substrate-binding protein. Many of the genes for the three components form operons as in fact observed in many bacterial and archaeal genomes. On the other hand, in a typical eukaryotic ABC transporter, the membrane spanning protein and the ATP-binding protein are fused, forming a multi-domain protein with the membrane-spanning domain (MSD) and the nucleotide-binding domain (NBD). KEGG_ACUTE_MYELOID_LEUKEMIA Two major types of genetic events are crucial for the molecular pathogenesis of acute myeloid leukemias (AML) : activating mutations of signal transduction intermediates and alterations in myeloid transcription factors governing hematopoietic differentiation. Both aberrant and constitutive activation of signal transduction molecules are found in about 50% of primary AML bone marrow samples, and seem to contribute to the increased proliferation and apoptosis resistance. The most common of these activating events were observed in the RTK Flt3, in N-Ras and K-Ras, in Kit, and sporadically in other RTKs. Specific haematopoietic transcription factors are crucial for differentiation to particular lineages during normal differentiation, but are frequently disrupted in AML. Some mechanisms of disruption involve the effect of fusion proteins that are generated by chromosomal translocations on haematopoietic transcription factors. In other cases, the transcription factors themselves are mutated. KEGG_ADHERENS_JUNCTION Cell-cell adherens junctions (AJs), the most common type of intercellular adhesions, are important for maintaining tissue architecture and cell polarity and can limit cell movement and proliferation. At AJs, E-cadherin serves as an essential cell adhesion molecules (CAMs). The cytoplasmic tail binds beta-catenin, which in turn binds alpha-catenin. Alpha-catenin is associated with F-actin bundles directly and indirectly. The integrity of the cadherin-catenin complex is negatively regulated by phosphorylation of beta-catenin by receptor tyrosine kinases (RTKs) and cytoplasmic tyrosine kinases (Fer, Fyn, Yes, and Src), which leads to dissociation of the cadherin-catenin complex. Integrity of this complex is positively regulated by beta -catenin phosphorylation by casein kinase II, and dephosphorylation by protein tyrosine phosphatases. Changes in the phosphorylation state of beta-catenin affect cell-cell adhesion, cell migration and the level of signaling beta-catenin. Wnt signaling acts as a positive regulator of beta-catenin by inhibiting beta-catenin degradation, which stabilizes beta-catenin, and causes its accumulation. Cadherin may acts as a negative regulator of signaling beta-catenin as it binds beta-catenin at the cell surface and thereby sequesters it from the nucleus. Nectins also function as CAMs at AJs, but are more highly concentrated at AJs than E-cadherin. Nectins transduce signals through Cdc42 and Rac, which reorganize the actin cytoskeleton, regulate the formation of AJs, and strengthen cell-cell adhesion.
    ## load the network for human Entrez Genes as an 'igraph' object org.Hs.string <- dRDataLoader(RData='org.Hs.string')
    'org.Hs.string' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.string.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:39)
    org.Hs.string
    IGRAPH UN-- 18492 728141 -- + attr: name (v/c), seqid (v/c), geneid (v/n), symbol (v/c), | description (v/c), neighborhood_score (e/n), fusion_score (e/n), | cooccurence_score (e/n), coexpression_score (e/n), experimental_score | (e/n), database_score (e/n), textmining_score (e/n), combined_score | (e/n) + edges (vertex names): [1] 3025671--3031737 3021358--3027795 3021358--3027929 3021358--3027741 [5] 3024278--3029186 3029373--3031543 3031543--3034385 3019006--3030823 [9] 3015391--3030823 3021191--3028634 3021191--3024550 3021191--3033402 [13] 3031876--3031959 3015324--3031959 3023108--3033954 3016546--3020273 + ... omitted several edges
    ## This network is extracted from the STRING database. Only those associations with medium confidence (score>=400) are retained. And the users can restrict to those edges with high confidence (score>=700, for example) network <- igraph::subgraph.edges(org.Hs.string, eids=E(org.Hs.string)[combined_score>=700]) network
    IGRAPH UN-- 15341 316170 -- + attr: name (v/c), seqid (v/c), geneid (v/n), symbol (v/c), | description (v/c), neighborhood_score (e/n), fusion_score (e/n), | cooccurence_score (e/n), coexpression_score (e/n), experimental_score | (e/n), database_score (e/n), textmining_score (e/n), combined_score | (e/n) + edges (vertex names): [1] 3017550--3023854 3023931--3028317 3019304--3028317 3028317--3033319 [5] 3023602--3028317 3014709--3028317 3024678--3026825 3023468--3030905 [9] 3026117--3030905 3026845--3029085 3017265--3027473 3015527--3033837 [13] 3019960--3033973 3021862--3033174 3015979--3025568 3015355--3025568 + ... omitted several edges
    # In addition to data import, the package has also functions (see below) to deal with them directly. In these functions, the users only need to specify which genome/organism and which ontology to use. # Here, we use human TCGA mutation dataset as an example data(TCGA_mutations) symbols <- as.character(fData(TCGA_mutations)$Symbol) ## Enrichment analysis using Disease Ontology (DO) data <- symbols[1:100] # select the first 100 human genes eTerm <- dEnricher(data, identity="symbol", genome="Hs", ontology="DO")
    Start at 2017-03-27 20:05:43 First, load the ontology DO and its gene associations in the genome Hs (2017-03-27 20:05:43) ... 'org.Hs.eg' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.eg.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:46) 'org.Hs.egDO' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egDO.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:47) Then, do mapping based on symbol (2017-03-27 20:05:47) ... Among 100 symbols of input data, there are 100 mappable via official gene symbols but 0 left unmappable Third, perform enrichment analysis using HypergeoTest (2017-03-27 20:05:48) ... There are 917 terms being used, each restricted within [10,1000] annotations Last, adjust the p-values using the BH method (2017-03-27 20:05:48) ... End at 2017-03-27 20:05:48 Runtime in total is: 5 secs
    ## gene set enrichment analysis (GSEA) using KEGG pathways tol <- apply(exprs(TCGA_mutations), 1, sum) # calculate the total mutations for each gene data <- data.frame(tol=tol) eTerm <- dGSEA(data, identity="symbol", genome="Hs", ontology="MsigdbC2KEGG")
    Start at 2017-03-27 20:05:52 First, load the ontology MsigdbC2KEGG and its gene associations in the genome Hs (2017-03-27 20:05:52) ... 'org.Hs.eg' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.eg.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:05:55) 'org.Hs.egMsigdbC2KEGG' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dnet/1.0.7/org.Hs.egMsigdbC2KEGG.RData?raw=true) has been loaded into the working environment (at 2017-03-27 20:06:00) Then, do mapping based on symbol (2017-03-27 20:06:00) ... Among 19420 symbols of input data, there are 19038 mappable via official gene symbols but 382 left unmappable Third, perform GSEA analysis based on 1000 permutations for 186 gene sets (2017-03-27 20:06:05) ... Sample 1 is being processed at (2017-03-27 20:06:05) ... 10 of 186 gene sets have been processed (2017-03-27 20:06:08) ... 20 of 186 gene sets have been processed (2017-03-27 20:06:11) ... 30 of 186 gene sets have been processed (2017-03-27 20:06:13) ... 40 of 186 gene sets have been processed (2017-03-27 20:06:15) ... 50 of 186 gene sets have been processed (2017-03-27 20:06:17) ... 60 of 186 gene sets have been processed (2017-03-27 20:06:19) ... 70 of 186 gene sets have been processed (2017-03-27 20:06:22) ... 80 of 186 gene sets have been processed (2017-03-27 20:06:24) ... 90 of 186 gene sets have been processed (2017-03-27 20:06:26) ... 100 of 186 gene sets have been processed (2017-03-27 20:06:28) ... 110 of 186 gene sets have been processed (2017-03-27 20:06:31) ... 120 of 186 gene sets have been processed (2017-03-27 20:06:33) ... 130 of 186 gene sets have been processed (2017-03-27 20:06:35) ... 140 of 186 gene sets have been processed (2017-03-27 20:06:37) ... 150 of 186 gene sets have been processed (2017-03-27 20:06:39) ... 160 of 186 gene sets have been processed (2017-03-27 20:06:41) ... 170 of 186 gene sets have been processed (2017-03-27 20:06:44) ... 180 of 186 gene sets have been processed (2017-03-27 20:06:46) ... 186 of 186 gene sets have been processed (2017-03-27 20:06:47) ... End at 2017-03-27 20:06:49 Runtime in total is: 57 secs

    Source faq

    FAQ2.r

    Functions used in this FAQ