espresso
is a package for domain estimation of cells only from gene expression profile based on graph-structured stochastic self-organizing map (GraphSOM) and Markov chain Monte Carlo (MCMC) method.
espresso
espresso
can be installed by the following command.
# Installation (X.X.X is a version number)
install.packages("espresso-X.X.X.tar.gz", repos = NULL, type = "source")
If an error message “ERROR: dependencies ‘XXX’ are not available for package ‘espresso’” is displayed in the installation step, please try to install required packages, which are listed as follows.
CRAN packages
# Installation example of CRAN packages
install.packages(c("igraph", "mclust", "progress", "uwot", "rgl",
"rFerns", "Boruta", "doParallel", "foreach", "pheatmap",
"RColorBrewer", "scatterplot3d", "tagcloud", "plotrix"),
dependencies = TRUE)
Bioconductor packages
# Installation example of Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(c("biomaRt","GO.db")) BiocManager
Note
If you try to install or load espresso
on a machine that does not have X11, a warning message will be displayed because rgl
package requires X11 to be installed. In such case, install X11, or ignore the warning if you do not need UMAP plots.
Altanative
espresso
can be installed directly from Github repository, if devtools
is installed.
library(devtools)
install_github("tmorikuicr/espresso")
This vignette was generated by the following versions of R and packages.
version#> _
#> platform x86_64-apple-darwin17.0
#> arch x86_64
#> os darwin17.0
#> system x86_64, darwin17.0
#> status
#> major 4
#> minor 0.3
#> year 2020
#> month 10
#> day 10
#> svn rev 79318
#> language R
#> version.string R version 4.0.3 (2020-10-10)
#> nickname Bunny-Wunnies Freak Out
packageVersion("igraph")
#> [1] '1.2.11'
packageVersion("mclust")
#> [1] '5.4.9'
packageVersion("progress")
#> [1] '1.2.2'
packageVersion("uwot")
#> [1] '0.1.11'
packageVersion("rgl")
#> [1] '0.108.3'
packageVersion("rFerns")
#> [1] '5.0.0'
packageVersion("Boruta")
#> [1] '7.0.0'
packageVersion("doParallel")
#> [1] '1.0.17'
packageVersion("foreach")
#> [1] '1.5.2'
packageVersion("pheatmap")
#> [1] '1.0.12'
packageVersion("RColorBrewer")
#> [1] '1.1.2'
packageVersion("scatterplot3d")
#> [1] '0.3.41'
packageVersion("tagcloud")
#> [1] '0.6'
packageVersion("plotrix")
#> [1] '3.8.2'
packageVersion("biomaRt")
#> [1] '2.46.3'
packageVersion("GO.db")
#> [1] '3.12.1'
Note that some versions of R may not be consistent with the biomaRt
and dplyr
pacakages. In such case, please install the above versions or updates R and the packages.
espresso
can be easily executed by a few functions. Here, the most simple code is shown in the following, assuming that a gene expression profile exprs
, a domain topology topology
, and a domain assignment asgmt
are already defined. The details of these inputs are explained in Data format.
library(espresso)
# Initialize espresso object
<- initEsp(exprs, topology, asgmt)
eobj
# GraphSOM clustering
<- initGraphSOM(eobj)
eobj <- graphSOM(eobj)
eobj
# Show prediction scores
@summary eobj
If you need more details of the espresso
functions, refer to the following sections and help manuals by help
function (e.g., help(espresso)
, help(graphSOM)
etc.)
espresso
requires three inputs, i.e., expression profile, domain topology, and domain assignment information for cell samples.
exprs
Gene 1 | Gene 2 | Gene 3 | … | |
Cell 1 | 0.15 | 0.28 | 1.33 | … |
Cell 2 | 1.08 | 0.62 | 0.01 | … |
Cell 3 | 0.74 | 0.42 | 0.87 | … |
… | … | … | … | … |
topology
1
and 0
mean that corresponding domains are connected (i.e., adjacent each other in living a tissue or an organ) and unconnected, respectively. Note that this matrix must be symmetric and its diagonals must be filled by 0
since self-loops are not assumed in espresso
.Domain 1 | Domain 2 | Domain 3 | Domain 4 | |
Domain 1 | 0 | 1 | 1 | 0 |
Domain 2 | 1 | 0 | 1 | 1 |
Domain 3 | 1 | 1 | 0 | 1 |
Domain 4 | 0 | 1 | 1 | 0 |
asgmt
data.frame
with two columns which indicate cell samples and corresponding domains.sample | domain |
Cell 1 | Domain 1 |
Cell 2 | Domain 2 |
Cell 3 | Domain 1 |
Cell 4 | Domain 4 |
… | … |
espresso
contains example inputs of raw datasets for the above three inputs: “exprs_soysa19_e825.txt”, “topology_soysa19_e825.txt”, and “asgmt_soysa19_e825.txt”, which can be loaded by the following commands. The expression profile “exprs_soysa19_e825.txt” was created from GSE126128_E775E825_WTKO_10X.csv.gz by random selection of 10 single-cells from each domain i.e., 70 cell-samples in total.
# Get paths of input files
<- system.file("extdata", "exprs_soysa19_e825.txt", package = "espresso")
path_exprs <- system.file("extdata", "topology_soysa19_e825.txt", package = "espresso")
path_topology <- system.file("extdata", "asgmt_soysa19_e825.txt", package = "espresso")
path_asgmt
# Load data
<- t(read.table(path_exprs, header = TRUE, row.names = 1, sep = "\t",
exprs as.is = TRUE, check.names = FALSE))
<- as.matrix(read.table(path_topology, header = T, row.names = 1))
topology <- read.table(path_asgmt, header = T, as.is = TRUE)
asgmt
# Show the contents of the objects
1:5, 1:5]
exprs[#> 0610005C13Rik 0610007P08Rik 0610007P14Rik 0610007P22Rik
#> TGCCCATGTGCTGTAT-9 0 0.0000000 0.0000000 0.0000000
#> GCTGCGAAGGTGCAAC-10 0 0.0000000 0.5096020 0.8455493
#> CATCGAAAGTGTCCAT-10 0 0.0000000 0.0000000 0.4440263
#> GAACGGATCTGATTCT-10 0 0.3703834 0.6400520 0.3703834
#> ACTGCTCTCATTGCCC-10 0 0.0000000 0.6228278 0.8312940
#> 0610008F07Rik
#> TGCCCATGTGCTGTAT-9 0
#> GCTGCGAAGGTGCAAC-10 0
#> CATCGAAAGTGTCCAT-10 0
#> GAACGGATCTGATTCT-10 0
#> ACTGCTCTCATTGCCC-10 0
topology#> AHF OFT RV LV Atrial SV pSHF
#> AHF 0 1 0 0 0 0 1
#> OFT 1 0 1 0 0 0 0
#> RV 0 1 0 1 0 0 0
#> LV 0 0 1 0 1 0 0
#> Atrial 0 0 0 1 0 1 0
#> SV 0 0 0 0 1 0 1
#> pSHF 1 0 0 0 0 1 0
head(asgmt)
#> sample domain
#> 1 TGCCCATGTGCTGTAT-9 Atrial
#> 2 GCTGCGAAGGTGCAAC-10 Atrial
#> 3 CATCGAAAGTGTCCAT-10 Atrial
#> 4 GAACGGATCTGATTCT-10 Atrial
#> 5 ACTGCTCTCATTGCCC-10 Atrial
#> 6 AAGGAGCTCATACGGT-10 Atrial
Before executing GraphSOM clustering, espresso
object should be prepared. The variables exprs
, topology
, and asgmt
are inputs explained in Data format. initEsp
is a function to initialize and return an espresso
object.
<- initEsp(exprs, topology, asgmt) eobj
The espresso
object is defined as S4
class, and its slots can be accessed by @
symbol.
str(eobj)
#> Formal class 'Espresso' [package "espresso"] with 28 slots
#> ..@ exprs : num [1:70, 1:23749] 0 0 0 0 0 0 0 0 0 0 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:70] "TGCCCATGTGCTGTAT-9" "GCTGCGAAGGTGCAAC-10" "CATCGAAAGTGTCCAT-10" "GAACGGATCTGATTCT-10" ...
#> .. .. ..$ : chr [1:23749] "0610005C13Rik" "0610007P08Rik" "0610007P14Rik" "0610007P22Rik" ...
#> ..@ exprs_som : NULL
#> ..@ rept_som : NULL
#> ..@ topology : int [1:7, 1:7] 0 1 0 0 0 0 1 1 0 1 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#> .. .. ..$ : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#> ..@ dist : NULL
#> ..@ map : NULL
#> ..@ map_method : NULL
#> ..@ radius : NULL
#> ..@ lsteps : NULL
#> ..@ rept : NULL
#> ..@ bmu : NULL
#> ..@ bmu_method : NULL
#> ..@ stochastic : NULL
#> ..@ rmin : NULL
#> ..@ rmax : NULL
#> ..@ domain2name:'data.frame': 7 obs. of 2 variables:
#> .. ..$ domain: int [1:7] 1 2 3 4 5 6 7
#> .. ..$ name : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#> ..@ gset : chr [1:23749] "0610005C13Rik" "0610007P08Rik" "0610007P14Rik" "0610007P22Rik" ...
#> ..@ ssets : NULL
#> ..@ asgmt :'data.frame': 70 obs. of 2 variables:
#> .. ..$ sample: chr [1:70] "TGCCCATGTGCTGTAT-9" "GCTGCGAAGGTGCAAC-10" "CATCGAAAGTGTCCAT-10" "GAACGGATCTGATTCT-10" ...
#> .. ..$ domain: chr [1:70] "Atrial" "Atrial" "Atrial" "Atrial" ...
#> ..@ score : NULL
#> ..@ summary : NULL
#> ..@ mcmc : NULL
#> ..@ seed : NULL
#> ..@ coef : NULL
#> ..@ nmin : NULL
#> ..@ nmax : NULL
#> ..@ swap : NULL
#> ..@ umap : NULL
espresso
provides simple preprocessing functions for log-scaling of expression profile and filtering out of low expressed and low variable genes. If the input expression profile need not be preprocessed, this procedure can be ignored.
The input gene expression data can be scaled by log10 as preprocessing. The base can be changed by specifying base = "2"
or base = "e"
.
# Log-scaling of the expression profile
# In this example, log-scaling is unnecessary since the expression values is already log-scaled.
# eobj <- logScale(eobj)
The low expressed and low variable genes can be filtered out by the following command. As default setting, genes with expression level of higher than 1 in at least two cells and whose standard deviation of expression levels through all cells more than 0.05 can pass the filtering.
# Filter out genes with low expression and low variance
<- filterGenes(eobj)
eobj #> Filtering genes...
#> 19699 genes are filtered out (23749 -> 4050 genes).
The parameters can be changed by specifying the arguments ncell
, expressed
, and sd
, i.e., the default setting is ncell = 2
, expressed = 1.0
, and sd = 0.05
.
# Filter out genes with low expression and low variance
<- filterGenes(eobj, ncell = 2, expressed = 1.0, sd = 0.05) eobj
Before starting GraphSOM clustering, the parameters for GraphSOM clustering should be initialized.
<- initGraphSOM(eobj) eobj
In this case, all cell samples and genes are used for the clustering, so that it will take a long time to execute when the input gene expression profile is quite large.
Therefore, if downsampling is required, graphSOM
can be executed with subsets of cell samples selected uniformly random from each domain, and repeats the clustering with different sets. The nsamples
indicates the number of samples extracted from each domain, i.e., nsamples = 10
under a dataset with 4 domains means that 40 cell samples will be selected. If the total number of samples of a certain domain is less than the value specified nsamples
, all samples of the domain are selected. The rept
is the number of repeats for GraphSOM clustering with different sample sets.
In addition, the coef
is the coefficient of ARI value for score function. The GraphSOM results are evaluated by prediction accuracy, ARI, and a score function (= accuracy + coef
* ARI). The default is coef = 1.0
.
<- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3) eobj
If the sample set after downsampling needs to be fixed, set a value to the parameter seed
.
Note
Even if the same random seed is given, the selected samples may differ depending on the version of R.
<- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3, seed = 0) eobj
From the version 0.2.8, GraphSOM clustering introduces the cluster swap method by default. If the cluster swap method is not needed, set FALSE
to the option swap
of initGraphSOM
function.
<- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3, swap = FALSE, seed = 0) eobj
After initializing GraphSOM setting, GraphSOM can be executed. The most simple way to execute GraphSOM clustering under the parameter setting given by initGraphSOM
is as follow.
<- graphSOM(eobj) eobj
As another option, graphSOM
can be executed with a specific gene set by setting it to the parameter gset
as a vector
.
<- graphSOM(eobj, gset = c("Actn2", "Hand2", "Vegfb", "Wnt5a"))
eobj #> GraphSOM clustering
#> Evaluation
#> Complete
In addition to these, graphSOM
has several parameters, so please refer to the manual for their details.
The graphSOM
clustering results can be shown by the following.
@summary
eobj#> $mean_score
#> [1] 0.7524326
#>
#> $var_score
#> [1] 0.0003480211
#>
#> $max_score
#> [1] 0.7730137
#>
#> $min_score
#> [1] 0.7366345
#>
#> $mean_acc
#> [1] 0.6433402
#>
#> $var_acc
#> [1] 0.0001070489
#>
#> $max_acc
#> [1] 0.6546584
#>
#> $min_acc
#> [1] 0.6343685
#>
#> $mean_ari
#> [1] 0.2181847
#>
#> $var_ari
#> [1] 0.0002766807
#>
#> $max_ari
#> [1] 0.2367107
#>
#> $min_ari
#> [1] 0.2045319
The results are mean values and mean variance of the score, the prediction accuracy, and ARI for the repeats specified in graphSOM
function. The result of each repeat can also be shown.
@score
eobj#> score acc ari
#> rept.1 0.7366345 0.6343685 0.2045319
#> rept.2 0.7730137 0.6546584 0.2367107
#> rept.3 0.7476495 0.6409938 0.2133114
The convergence curve can be shown. rept
indicates the repeat numbers to be plotted.
plotConvCurve(eobj, rept = 1:2)
The multiple results are plotted at the same time by the following example code.
par(mfrow = c(2,2))
plotConvCurve(eobj)
dev.off()
espresso
provides a function for feature gene selection selectFeatures
. This function is a rapper function of Boruta
function of Boruta
package, which is a feature gene selection method based on random forest. The usage of this function is very simple.
<- selectFeatures(eobj, maxRuns = 100)
fgenes
fgenes#> [1] "Mylk3" "Ncl" "Ptma" "Tnni1"
#> [5] "Actc1" "Ak1" "Apoe" "Cryab"
#> [9] "Csrp2" "Cst3" "Fbxl22" "Fbxo32"
#> [13] "Hspb2" "Myh7" "Myl3" "Myl4"
#> [17] "Myl7" "Pgam2" "Pkm" "Slc25a4"
#> [21] "Tagln" "Tmod1" "Tnni3" "Tpi1"
#> [25] "Tpm1" "Tuba1a" "Actn2" "Atp5c1"
#> [29] "Hcfc1r1" "Myl1" "Myl6" "Npm3"
#> [33] "Tgfb1i1" "Bnip3" "Crip1" "Gyg"
#> [37] "Hmgb1" "Dlk1" "Hspe1" "Ldha"
#> [41] "Pdlim1" "Pdlim5" "Ptpla" "Marcksl1"
#> [45] "Naca" "Rps7" "Atp5o" "Gm10075"
#> [49] "Apobec2" "Isl1" "Lta4h" "Actb"
#> [53] "Atp2a2" "Dcp1b" "Itga6" "Nr2f2"
#> [57] "Anxa2" "Cox6a2" "Rassf5" "Tdgf1"
#> [61] "Wbp5" "Suclg2" "Tkt" "Krt19"
#> [65] "Lrrc39" "Nop16" "Atp5g1" "Rpl7a"
#> [69] "Rplp2" "Cdh2" "Hnrnpab" "Fundc2"
#> [73] "4833424O15Rik" "Ndufs6"
Please note that the result of Boruta
depends on random seed. Thus, if the results need to be fixed, set a value to the parameter seed
.
<- selectFeatures(eobj, maxRuns = 100, seed = 0) fgenes
Boruta
classifies genes into the three categories, “Confirmed”, “Tentative”, and “Rejected”, according to their importance. This function returns a vector of “NOT Rejected” genes as features by default. If only “Confirmed” genes are required, set "c"
to the parameter decision
.
<- selectFeatures(eobj, maxRuns = 100, decision = "c", seed = 0) fgenes
In order to resolve genes left as “Tentative”, maxRuns
may be increased, where the parameter maxRuns
is the maximum number of importance source runs (refer the manual of Boruta
package). Empirically, it is good to determine the parameters so that around 100 genes are selected.
In order to execute graphSOM
with functional gene sets already defined, espresso
provides us with gene sets in GO database.
# Mouse gene sets (Ensembl release version 101)
data(mm_symbol_v101)
# Human gene sets (Ensembl release version 101)
data(hs_symbol_v101)
If another version, another species, or another gene identifier are required, the following command enable a user to create another gene set via R biomaRt
package. The parameters to be set to species
are gid
according to biomaRt
. Note that if the latest version is required, please do not specify version
in the function.
# Get GO information
<- getGO(species = 'hsapiens', gid = 'hgnc_symbol', version = '98')
hs_symbol_v98
# Save 'hs_symbol_v98' as an R data file.
save(hs_symbol_v98, file = "hs_symbol_v98")
# Load 'hs_symbol_v98'
load(hs_symbol_v98)
The GO information are stored as a list, and each element of list has two vectors, term
and genes
.
"GO:0003209"]][["term"]]
mm_symbol_v101[[#> [1] "cardiac atrium morphogenesis"
"GO:0003209"]][["genes"]]
mm_symbol_v101[[#> [1] "Acvr1" "Bmp10" "Bmpr2" "Ccm2l" "Ccn1" "Cfc1" "Dll4" "Eng"
#> [9] "Gata4" "Gja5" "Heg1" "Hey2" "Isl1" "Mesp1" "Myh6" "Nkx2-5"
#> [17] "Nog" "Notch1" "Notch2" "Nsd2" "Pitx2" "Prox1" "Shox2" "Smo"
#> [25] "Sos1" "Sox4" "Tbx20" "Tbx5" "Tgfb2" "Tnnt2" "Wnt2" "Wnt5a"
#> [33] "Zfpm1"
Thus, graphSOM
execution using GO genes can be done by
<- graphSOM(eobj, gset = mm_symbol_v101[["GO:0003209"]][["genes"]]) eobj
espresso
optimizes gene sets by replica exchange Markov chain Monte Carlo method (RX-MCMC). The parameters itr
, n_ex
, and n_repl
are the number of iterations of MCMC until each replica exchange, the number of times of exchanges, and the number of replicas, respectively. Thus, the total number of MCMC steps is itr * n_ex
. If the user’s machine has multi-core processors, the MCMC computations of each replica are parallelized.
<- initGraphSOM(eobj, nsamples = 10, rept = 1, coef = 0.5)
eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4) eobj
Although the number of cores is automatically detected, it can be specified by n_cl
. In addition, as the default setting, MCMC optimizations are started with 3 genes randomly selected from the gene set specified by gset
. However it can be also modified by n_ig
. Note that the length of fgenes
must be larger than n_ig
.
<- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, n_ig = 3) eobj
The random seed can be fixed by the seed
option in the rxmcmc
function.
<- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, n_ig = 3, seed = 0) eobj
The results of RX-MCMC are stored in espresso
objects and they can be accessed by @mcmc
.
eobj@mcmc[["sampling"]]
: History of samplingeobj@mcmc[["max"]]
: History of the samples with the best scoreeobj@mcmc[["exchange"]]
: History of replica exchangeeobj@mcmc[["final_genes"]]
: The final genes for each replicaseobj@mcmc[["best_repl"]]
: Information of the best replicaseobj@mcmc[["best_genes"]]
: The best genes@mcmc[["sampling"]][["repl.1"]]
eobj#> mean_score var_score max_score min_score mean_acc var_acc max_acc
#> step.1 0.7761541 NA 0.7761541 0.7761541 0.6360248 NA 0.6360248
#> step.2 0.9046352 NA 0.9046352 0.9046352 0.7064182 NA 0.7064182
#> step.3 1.0212497 NA 1.0212497 1.0212497 0.7726708 NA 0.7726708
#> step.4 1.0891609 NA 1.0891609 1.0891609 0.8053830 NA 0.8053830
#> step.5 1.0895137 NA 1.0895137 1.0895137 0.8033126 NA 0.8033126
#> step.6 1.1095087 NA 1.1095087 1.1095087 0.8132505 NA 0.8132505
#> step.7 1.1616804 NA 1.1616804 1.1616804 0.8414079 NA 0.8414079
#> step.8 1.1616804 NA 1.1616804 1.1616804 0.8414079 NA 0.8414079
#> step.9 1.1663870 NA 1.1663870 1.1663870 0.8451346 NA 0.8451346
#> step.10 1.1682801 NA 1.1682801 1.1682801 0.8476190 NA 0.8476190
#> min_acc mean_ari var_ari max_ari min_ari gene_size
#> step.1 0.6360248 0.2802585 NA 0.2802585 0.2802585 3
#> step.2 0.7064182 0.3964340 NA 0.3964340 0.3964340 4
#> step.3 0.7726708 0.4971578 NA 0.4971578 0.4971578 5
#> step.4 0.8053830 0.5675557 NA 0.5675557 0.5675557 6
#> step.5 0.8033126 0.5724020 NA 0.5724020 0.5724020 7
#> step.6 0.8132505 0.5925164 NA 0.5925164 0.5925164 8
#> step.7 0.8414079 0.6405451 NA 0.6405451 0.6405451 9
#> step.8 0.8414079 0.6405451 NA 0.6405451 0.6405451 10
#> step.9 0.8451346 0.6425049 NA 0.6425049 0.6425049 11
#> step.10 0.8476190 0.6413221 NA 0.6413221 0.6413221 12
@mcmc[["max"]][["repl.1"]]
eobj#> mean_score var_score max_score min_score mean_acc var_acc max_acc
#> step.1 0.7761541 NA 0.7761541 0.7761541 0.6360248 NA 0.6360248
#> step.2 0.9046352 NA 0.9046352 0.9046352 0.7064182 NA 0.7064182
#> step.3 1.0212497 NA 1.0212497 1.0212497 0.7726708 NA 0.7726708
#> step.4 1.0891609 NA 1.0891609 1.0891609 0.8053830 NA 0.8053830
#> step.5 1.0895137 NA 1.0895137 1.0895137 0.8033126 NA 0.8033126
#> step.6 1.1095087 NA 1.1095087 1.1095087 0.8132505 NA 0.8132505
#> step.7 1.1616804 NA 1.1616804 1.1616804 0.8414079 NA 0.8414079
#> step.8 1.1616804 NA 1.1616804 1.1616804 0.8414079 NA 0.8414079
#> step.9 1.1663870 NA 1.1663870 1.1663870 0.8451346 NA 0.8451346
#> step.10 1.1682801 NA 1.1682801 1.1682801 0.8476190 NA 0.8476190
#> min_acc mean_ari var_ari max_ari min_ari gene_size
#> step.1 0.6360248 0.2802585 NA 0.2802585 0.2802585 3
#> step.2 0.7064182 0.3964340 NA 0.3964340 0.3964340 4
#> step.3 0.7726708 0.4971578 NA 0.4971578 0.4971578 5
#> step.4 0.8053830 0.5675557 NA 0.5675557 0.5675557 6
#> step.5 0.8033126 0.5724020 NA 0.5724020 0.5724020 7
#> step.6 0.8132505 0.5925164 NA 0.5925164 0.5925164 8
#> step.7 0.8414079 0.6405451 NA 0.6405451 0.6405451 9
#> step.8 0.8414079 0.6405451 NA 0.6405451 0.6405451 9
#> step.9 0.8451346 0.6425049 NA 0.6425049 0.6425049 11
#> step.10 0.8476190 0.6413221 NA 0.6413221 0.6413221 12
@mcmc[["exchange"]]
eobj#> repl.1 repl.2 repl.3 repl.4
#> step.1 1.000 0.100 0.01 0.001
#> step.2 1.000 0.010 0.10 0.001
#> step.3 1.000 0.010 0.10 0.001
#> step.4 0.100 0.001 1.00 0.010
#> step.5 0.100 0.001 1.00 0.010
#> step.6 0.010 0.001 1.00 0.100
#> step.7 0.010 0.001 1.00 0.100
#> step.8 0.001 0.010 0.10 1.000
#> step.9 0.001 0.010 0.10 1.000
#> step.10 0.001 0.100 0.01 1.000
@mcmc[["final_genes"]]
eobj#> $repl.1
#> [1] "Rassf5" "Myl3" "Atp5o" "Apoe" "Dlk1" "Marcksl1"
#> [7] "Fbxo32" "Atp5c1" "Ncl" "Itga6" "Tmod1" "Tdgf1"
#>
#> $repl.2
#> [1] "Wbp5" "Myl3" "Actn2" "Ncl" "Apoe" "Fbxo32" "Bnip3"
#> [8] "Slc25a4" "Atp5o" "Tdgf1" "Cst3"
#>
#> $repl.3
#> [1] "Crip1" "Hmgb1" "Ptma" "Apoe" "Tdgf1" "Bnip3" "Tpi1" "Fbxo32"
#> [9] "Pdlim1" "Rps7" "Naca" "Tagln"
#>
#> $repl.4
#> [1] "Anxa2" "Fbxl22" "Apoe" "Dlk1" "Myl1" "Tuba1a" "Csrp2" "Actn2"
@mcmc[["best_repl"]]
eobj#> repl step
#> 1 3 10
@mcmc[["best_genes"]]
eobj#> $repl.3
#> [1] "Crip1" "Hmgb1" "Ptma" "Apoe" "Tdgf1" "Bnip3" "Tpi1" "Fbxo32"
#> [9] "Pdlim1" "Rps7" "Naca" "Tagln"
The results of RX-MCMC can be also obtained as text files as follows, where dir
is a path to an output directory.
writeMCMC(eobj, dir = "output")
The histories of RX-MCMC optimization and replica exchanges can be plotted by plotMCMC
function.
plotMCMC(eobj)
In order to visualize the distance between domains and between cell samples before/after clustering for all replications, espresso
provides plotDistMap
. The usage is quite simple.
# size_d: The font and cell sizes for distance map of domains (default: NULL).
# size_s: The font and cell sizes for distance map of cell samples (default: 5).
plotDistMap(eobj, size_s = 5)
espresso
provides a function for dimension reduction method named Uniform Manifold Approximation and Projection (UMAP) to show the distribution of the cell samples in 2D or 3D space (the default space is 2D). To compute and plot UMAP, use runUMAP
and plotUMAP
functions, respectively. The UMAP result is stored @umap
.
<- runUMAP(eobj)
eobj head(eobj@umap)
#> [,1] [,2]
#> TGCCCATGTGCTGTAT-9 -0.0225598 -2.2024913
#> GCTGCGAAGGTGCAAC-10 0.4925257 2.6053568
#> CATCGAAAGTGTCCAT-10 0.4897042 -2.5706015
#> GAACGGATCTGATTCT-10 0.7458983 -2.0646923
#> ACTGCTCTCATTGCCC-10 -0.8827290 1.5327852
#> AAGGAGCTCATACGGT-10 2.3228327 -0.6908234
plotUMAP(eobj, file = 'umap_2D.png')
#> quartz_off_screen
#> 2
UMAP
This function imports uwot
packages, thus all parameters for UMAP computation can be given according to the original package. When using custom settings for UMAP, give the UMAP parameters umap_param
as a list to runUMAP
by the following way. For the details of the parameters, refer to uwot::umap
function. The feature genes can be given by gset
parameter while UMAP is performed with genes stored in @gset
if gset = NULL
.
<- runUMAP(eobj, gset = eobj@mcmc[["best_genes"]][[1]],
eobj umap_param = list("n_neighbors" = 15, "n_components" = 3, "n_epochs" = 1000))
head(eobj@umap)
#> [,1] [,2] [,3]
#> TGCCCATGTGCTGTAT-9 -0.6878206 -0.1425771 0.65135581
#> GCTGCGAAGGTGCAAC-10 -0.4250149 -0.2897149 0.69028134
#> CATCGAAAGTGTCCAT-10 -0.8812406 0.2021387 0.24999088
#> GAACGGATCTGATTCT-10 -0.7272825 0.4317129 -0.03455686
#> ACTGCTCTCATTGCCC-10 -0.6010484 -0.3180909 1.52406007
#> AAGGAGCTCATACGGT-10 -1.1745350 0.3119029 -0.00599524
# size: the size for plotted points.
# magnify: Multiplicative factor to apply to size of window when producing legend (default: 1).
plotUMAP(eobj, size = 1.5, magnify = 1, file = "umap_3D.png")
UMAP
Although the plotUMAP
imports rgl
package, scatterplot3d
package can be used instead of rgl
by rgl = FALSE
(default: TRUE
) when rgl
is not available.
plotUMAP(eobj, file = "umap_3D_scatterplot.png", rgl = FALSE)
#> quartz_off_screen
#> 2
UMAP
The export format is automatically detected from its extension (Supported: png, pdf, eps)
plotUMAP(eobj, file = 'umap.pdf')
The RGL window can be closed after plotting by rglclose = TRUE
.
plotUMAP(eobj, file = 'umap.pdf', rglclose = TRUE)
The file saving can be suppress by save = FALSE
. This is useful when you want to use png()
, pdf()
, and postscript()
functions directly in order to specify their parameters (e.g., width
, height
, and so on) manually.
pdf(file = 'umap.pdf', width = 6, height = 6)
plotUMAP(eobj, save = FALSE)
dev.off()
Note
When the exported format is ‘pdf’ or ‘eps’, and rgl = TRUE
, the legend information is not printed due to the design of rgl
package. In addition, it often takes much time to draw UMAP plot as a PDF file when the number of samples is large. In such case, please output as a PNG file.
Plotting colors can be changed by any colors according to the domains.
<- list(AHF = "orange",
domain_colors Atrial = "pink",
LV = "green",
OFT = "lightblue",
pSHF = "purple",
RV = "blue",
SV = "plum"
)plotUMAP(eobj, domcol = domain_colors, file = "umap_3D_color1.png", size = 1.5)
UMAP
Color coding in groups different from the domains is possible with the color_by_group
and grpcol
options. The color_by_group
must be a data.frame
of two columns with a header given by the following format, where the header must be ‘sample’ and ‘group’.
sample | group |
Cell 1 | Group 1 |
Cell 2 | Group 2 |
Cell 3 | Group 1 |
Cell 4 | Group 4 |
… | … |
The color of each group can be specified by grpcol
of the same format as domcol
. The following is an example.
<- system.file("extdata", "group_soysa19_e825.txt", package = "espresso")
path_group <- read.table(path_group, header = T, as.is = TRUE)
group head(group)
#> sample group
#> 1 TGCCCATGTGCTGTAT-9 FHF
#> 2 GCTGCGAAGGTGCAAC-10 FHF
#> 3 CATCGAAAGTGTCCAT-10 FHF
#> 4 GAACGGATCTGATTCT-10 FHF
#> 5 ACTGCTCTCATTGCCC-10 FHF
#> 6 AAGGAGCTCATACGGT-10 FHF
<- list(FHF = "orange", SHF = "blue")
group_colors plotUMAP(eobj, color_by_group = group, grpcol = group_colors, size = 1.5, file = "umap_3D_color2.png")
UMAP
In addition, the coloring of each sample can also be based on the expression levels of a gene rather than the domains by specifying a gene name to the argument color_by_gene
. In such case, the domcol
is ignored.
plotUMAP(eobj, color_by_gene = "Actn2", file = "umap_3D_color3.png", size = 1.5)
UMAP
If movie = TRUE
, a movie of the resulting UMAP is generated. This option is valid when rgl = TRUE
.
plotUMAP(eobj, movie = TRUE)
The above script uses all cell samples for UMAP plotting. If you want to plot UMAP with downsampled sets, the following script can work.
<- eobj
eobj.tmp <- eobj.tmp@ssets[["set.1"]]
samples @exprs <- eobj.tmp@exprs[match(samples, rownames(eobj.tmp@exprs)), , drop = FALSE]
eobj.tmp<- runUMAP(eobj.tmp, gset = eobj.tmp@mcmc[["best_genes"]][[1]],
eobj.tmp umap_param = list("n_neighbors" = 5, "n_components" = 3, "n_epochs" = 1000))
plotUMAP(eobj.tmp, movie = TRUE, domcol = domain_colors)
UMAP is even more effective for overviewing distribution of cell samples when the number of cell samples is large. The following figure is UMAP using GO:0003209 for 3,331 cell samples obtained from GSE126128_E775E825_WTKO_10X.csv.gz.
UMAP
graphSOM
and rxmcmc
of previous version (0)In espresso
of version 0, the reproducibility between the results of graphSOM
and rxmcmc
was not guaranteed, and a random number generation method different from version 1.0.0 or later was employed. The followings are examples of the way to run graphSOM
and rxmcmc
of version 0.
graphSOM(eobj, gset = fgenes, seed = 0, version = "0")
rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, seed = 0, version = "0")
The gene description can be obtained via biomaRt
package as followings:
library(biomaRt)
<- c("Actn2", "Hand2", "Vegfb", "Wnt5a")
fgenes <- useEnsembl(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl')
ds <- getBM(attributes = c("mgi_symbol", "description"),
ann mart = ds, filters = "mgi_symbol", values = fgenes)
Note
If the version of R is 3.6 or later, the above script may not work depending on the version of biomaRt (e.g., >= 2.42.0). In that case, add the option useCache = False
to the getBM
function.
<- getBM(attributes = c("mgi_symbol", "description"),
ann mart = ds, filters = "mgi_symbol", values = fgenes, useCache = FALSE)
To cite your use of the espresso
software, please reference the following paper: