Function to calculate pair-wise semantic similarity between input terms based on a direct acyclic graph (DAG) with annotated data

Description

dDAGtermSim is supposed to calculate pair-wise semantic similarity between input terms based on a direct acyclic graph (DAG) with annotated data. Parallel computing is also supported for Linux or Mac operating systems.

Usage

dDAGtermSim(g, terms = NULL, method = c("Resnik", "Lin", "Schlicker", "Jiang", "Pesquita"), 
  fast = T, parallel = TRUE, multicores = NULL, verbose = T)

Arguments

g
an object of class "igraph" or "graphNEL". It must contain a vertex attribute called 'annotations' for storing annotation data (see example for howto)
terms
the terms/nodes between which pair-wise semantic similarity is calculated. If NULL, all terms in the input DAG will be used for calcluation, which is very prohibitively expensive!
method
the method used to measure semantic similarity between input terms. It can be "Resnik" for information content (IC) of most informative common ancestor (MICA) (see http://arxiv.org/pdf/cmp-lg/9511007.pdf), "Lin" for 2*IC at MICA divided by the sum of IC at pairs of terms (see https://www.cse.iitb.ac.in/~cs626-449/Papers/WordSimilarity/3.pdf), "Schlicker" for weighted version of 'Lin' by the 1-prob(MICA) (see http://www.ncbi.nlm.nih.gov/pubmed/16776819), "Jiang" for 1 - difference between the sum of IC at pairs of terms and 2*IC at MICA (see http://arxiv.org/pdf/cmp-lg/9709008.pdf), "Pesquita" for graph information content similarity related to Tanimoto-Jacard index (ie. summed information content of common ancestors divided by summed information content of all ancestors of term1 and term2 (see http://www.ncbi.nlm.nih.gov/pubmed/18460186)). By default, it uses "Schlicker" method
fast
logical to indicate whether a vectorised fast computation is used. By default, it sets to true. It is always advisable to use this vectorised fast computation; since the conventional computation is just used for understanding scripts
parallel
logical to indicate whether parallel computation with multicores is used. By default, it sets to true, but not necessarily does so. It will depend on whether these two packages "foreach" and "doParallel" have been installed. It can be installed via: source("http://bioconductor.org/biocLite.R"); biocLite(c("foreach","doParallel")). If not yet installed, this option will be disabled
multicores
an integer to specify how many cores will be registered as the multicore parallel backend to the 'foreach' package. If NULL, it will use a half of cores available in a user's computer. This option only works when parallel computation is enabled
verbose
logical to indicate whether the messages will be displayed in the screen. By default, it sets to true for display

Value

It returns a sparse matrix containing pair-wise semantic similarity between input terms. This sparse matrix can be converted to the full matrix via the function as.matrix

Note

none

Examples

# 1) load HPPA as igraph object ig.HPPA <-dRDataLoader(RData='ig.HPPA')
'ig.HPPA' (from package 'dnet' version 1.1.2) has been loaded into the working environment (at 2018-01-19 12:35:06)
g <- ig.HPPA # 2) load human genes annotated by HPPA org.Hs.egHPPA <- dRDataLoader(RData='org.Hs.egHPPA')
'org.Hs.egHPPA' (from package 'dnet' version 1.1.2) has been loaded into the working environment (at 2018-01-19 12:35:06)
# 3) prepare for ontology and its annotation information dag <- dDAGannotate(g, annotations=org.Hs.egHPPA, path.mode="all_paths", verbose=TRUE)
At level 13, there are 5 nodes, and 12 incoming neighbors. At level 12, there are 17 nodes, and 27 incoming neighbors. At level 11, there are 50 nodes, and 65 incoming neighbors. At level 10, there are 144 nodes, and 145 incoming neighbors. At level 9, there are 332 nodes, and 282 incoming neighbors. At level 8, there are 518 nodes, and 374 incoming neighbors. At level 7, there are 625 nodes, and 389 incoming neighbors. At level 6, there are 710 nodes, and 382 incoming neighbors. At level 5, there are 587 nodes, and 232 incoming neighbors. At level 4, there are 297 nodes, and 91 incoming neighbors. At level 3, there are 105 nodes, and 23 incoming neighbors. At level 2, there are 23 nodes, and 1 incoming neighbors. At level 1, there are 1 nodes, and 0 incoming neighbors.
# 4) calculate pair-wise semantic similarity between 5 randomly chosen terms terms <- sample(V(dag)$name, 5) sim <- dDAGtermSim(g=dag, terms=terms, method="Schlicker", parallel=FALSE)
Start at 2018-01-19 12:35:19 Calculate semantic similarity between 5 terms using Schlicker method (2018-01-19 12:35:19)... Build a sparse matrix of children x ancestors (with 5 rows and 3414 columns (2018-01-19 12:35:19)... 1 out of 5 (2018-01-19 12:35:19) 2 out of 5 (2018-01-19 12:35:19) 3 out of 5 (2018-01-19 12:35:19) 4 out of 5 (2018-01-19 12:35:19) 5 out of 5 (2018-01-19 12:35:19) Finish at 2018-01-19 12:35:19 Runtime in total is: 0 secs
sim
5 x 5 sparse Matrix of class "dsCMatrix" HP:0410008 HP:0009141 HP:0001178 HP:0002926 HP:0010765 HP:0410008 0.9615623 . . . . HP:0009141 . 0.9978301 . . . HP:0001178 . . 0.9975201 . . HP:0002926 . . . 0.966832 . HP:0010765 . . . . 0.9913205