Function to setup a pipeine to estimate RWR-based contact strength between samples from an input gene-sample data matrix and an input graph

Description

dRWRpipeline is supposed to estimate sample relationships (ie. contact strength between samples) from an input gene-sample matrix and an input graph. The pipeline includes: 1) random walk restart (RWR) of the input graph using the input matrix as seeds; 2) calculation of contact strength (inner products of RWR-smoothed columns of input matrix); 3) estimation of the contact signficance by a randomalisation procedure. It supports two methods how to use RWR: 'direct' for directly applying RWR in the given seeds; 'indirectly' for first pre-computing affinity matrix of the input graph, and then deriving the affinity score. Parallel computing is also supported for Linux or Mac operating systems.

Usage

dRWRpipeline(data, g, method = c("direct", "indirect"), normalise = c("laplacian", 
  "row", "column", "none"), restart = 0.75, normalise.affinity.matrix = c("none", 
      "quantile"), permutation = c("random", "degree"), num.permutation = 10, p.adjust.method = c("BH", 
      "BY", "bonferroni", "holm", "hochberg", "hommel"), adjp.cutoff = 0.05, parallel = TRUE, 
      multicores = NULL, verbose = T)

Arguments

data
an input gene-sample data matrix used for seeds. Each value in input gene-sample matrix does not necessarily have to be binary (non-zeros will be used as a weight, but should be non-negative for easy interpretation).
g
an object of class "igraph" or "graphNEL"
method
the method used to calculate RWR. It can be 'direct' for directly applying RWR, 'indirect' for indirectly applying RWR (first pre-compute affinity matrix and then derive the affinity score)
normalise
the way to normalise the adjacency matrix of the input graph. It can be 'laplacian' for laplacian normalisation, 'row' for row-wise normalisation, 'column' for column-wise normalisation, or 'none'
restart
the restart probability used for RWR. The restart probability takes the value from 0 to 1, controlling the range from the starting nodes/seeds that the walker will explore. The higher the value, the more likely the walker is to visit the nodes centered on the starting nodes. At the extreme when the restart probability is zero, the walker moves freely to the neighbors at each step without restarting from seeds, i.e., following a random walk (RW)
normalise.affinity.matrix
the way to normalise the output affinity matrix. It can be 'none' for no normalisation, 'quantile' for quantile normalisation to ensure that columns (if multiple) of the output affinity matrix have the same quantiles
permutation
how to do permutation. It can be 'degree' for degree-preserving permutation, 'random' for permutation in random
num.permutation
the number of permutations used to for generating the distribution of contact strength under randomalisation
p.adjust.method
the method used to adjust p-values. It can be one of "BH", "BY", "bonferroni", "holm", "hochberg" and "hommel". The first two methods "BH" (widely used) and "BY" control the false discovery rate (FDR: the expected proportion of false discoveries amongst the rejected hypotheses); the last four methods "bonferroni", "holm", "hochberg" and "hommel" are designed to give strong control of the family-wise error rate (FWER). Notes: FDR is a less stringent condition than FWER
adjp.cutoff
the cutoff of adjusted pvalue to construct the contact graph
parallel
logical to indicate whether parallel computation with multicores is used. By default, it sets to true, but not necessarily does so. It will depend on whether these two packages "foreach" and "doParallel" have been installed. It can be installed via: source("http://bioconductor.org/biocLite.R"); biocLite(c("foreach","doParallel")). If not yet installed, this option will be disabled
multicores
an integer to specify how many cores will be registered as the multicore parallel backend to the 'foreach' package. If NULL, it will use a half of cores available in a user's computer. This option only works when parallel computation is enabled
verbose
logical to indicate whether the messages will be displayed in the screen. By default, it sets to true for display

Value

an object of class "dContact", a list with following components:

  • ratio: a symmetric matrix storing ratio (the observed against the expected) between pairwise samples
  • zscore: a symmetric matrix storing zscore between pairwise samples
  • pval: a symmetric matrix storing pvalue between pairwise samples
  • adjpval: a symmetric matrix storing adjusted pvalue between pairwise samples
  • cgraph: the constructed contact graph (as a 'igraph' object) under the cutoff of adjusted value
  • Amatrix: a pre-computated affinity matrix when using 'inderect' method; NULL otherwise
  • call: the call that produced this result

Note

The choice of which method to use RWR depends on the number of seed sets and the number of permutations for statistical test. If the total product of both numbers are huge, it is better to use 'indrect' method (for a single run). However, if the user wants to re-use pre-computed affinity matrix (ie. re-use the input graph a lot), then it is highly recommended to sequentially use dRWR and dRWRcontact instead.

Examples

# 1) generate a random graph according to the ER model g <- erdos.renyi.game(100, 1/100) # 2) produce the induced subgraph only based on the nodes in query subg <- dNetInduce(g, V(g), knn=0) V(subg)$name <- 1:vcount(subg) # 3) estimate RWR dating based sample relationships # define sets of seeds as data # each seed with equal weight (i.e. all non-zero entries are '1') aSeeds <- c(1,0,1,0,1) bSeeds <- c(0,0,1,0,1) data <- data.frame(aSeeds,bSeeds) rownames(data) <- 1:5 # calcualte their two contact graph dContact <- dRWRpipeline(data=data, g=subg, parallel=FALSE)
Start at 2018-01-19 12:36:56 First, RWR on input graph (15 nodes and 14 edges) using input matrix (5 rows and 2 columns) as seeds (2018-01-19 12:36:56)... Second, calculate contact strength (2018-01-19 12:36:56)... Third, generate the distribution of contact strength based on 10 permutations on nodes respecting random (2018-01-19 12:36:56)... 1 out of 10 (2018-01-19 12:36:56) 2 out of 10 (2018-01-19 12:36:56) 3 out of 10 (2018-01-19 12:36:56) 4 out of 10 (2018-01-19 12:36:57) 5 out of 10 (2018-01-19 12:36:57) 6 out of 10 (2018-01-19 12:36:57) 7 out of 10 (2018-01-19 12:36:57) 8 out of 10 (2018-01-19 12:36:57) 9 out of 10 (2018-01-19 12:36:57) 10 out of 10 (2018-01-19 12:36:57) Last, estimate the significance of contact strength: zscore, pvalue, and BH adjusted-pvalue (2018-01-19 12:36:58)... Also, construct the contact graph under the cutoff 5.0e-02 of adjusted-pvalue (2018-01-19 12:36:58)... Finish at 2018-01-19 12:36:58 Runtime in total is: 2 secs
dContact
$ratio [,1] [,2] [1,] 1.181814 1.191843 [2,] 1.191843 1.141409 $zscore [,1] [,2] [1,] 2.170178 2.121734 [2,] 2.121734 1.185599 $pval [,1] [,2] [1,] 0 0.0 [2,] 0 0.1 $adjpval [,1] [,2] [1,] 0 0.0 [2,] 0 0.1 $cgraph IGRAPH 85ace61 U-W- 2 1 -- + attr: weight (e/n) + edge from 85ace61: [1] 1--2 $Amatrix NULL $call dRWRpipeline(data = data, g = subg, parallel = FALSE) $method [1] "dnet" attr(,"class") [1] "dContact"