espresso is a package for domain estimation of cells only from gene expression profile based on graph-structured stochastic self-organizing map (GraphSOM) and Markov chain Monte Carlo (MCMC) method.
espressoespresso can be installed by the following command.
# Installation (X.X.X is a version number)
install.packages("espresso-X.X.X.tar.gz", repos = NULL, type = "source")If an error message “ERROR: dependencies ‘XXX’ are not available for package ‘espresso’” is displayed in the installation step, please try to install required packages, which are listed as follows.
CRAN packages
# Installation example of CRAN packages
install.packages(c("igraph", "mclust", "progress", "uwot", "rgl",
"rFerns", "Boruta", "doParallel", "foreach", "pheatmap",
"RColorBrewer", "scatterplot3d", "tagcloud", "plotrix"),
dependencies = TRUE)Bioconductor packages
# Installation example of Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("biomaRt","GO.db"))Note
If you try to install or load espresso on a machine that does not have X11, a warning message will be displayed because rgl package requires X11 to be installed. In such case, install X11, or ignore the warning if you do not need UMAP plots.
Altanative
espresso can be installed directly from Github repository, if devtools is installed.
library(devtools)
install_github("tmorikuicr/espresso")This vignette was generated by the following versions of R and packages.
version
#> _
#> platform x86_64-apple-darwin17.0
#> arch x86_64
#> os darwin17.0
#> system x86_64, darwin17.0
#> status
#> major 4
#> minor 0.3
#> year 2020
#> month 10
#> day 10
#> svn rev 79318
#> language R
#> version.string R version 4.0.3 (2020-10-10)
#> nickname Bunny-Wunnies Freak Out
packageVersion("igraph")
#> [1] '1.2.11'
packageVersion("mclust")
#> [1] '5.4.9'
packageVersion("progress")
#> [1] '1.2.2'
packageVersion("uwot")
#> [1] '0.1.11'
packageVersion("rgl")
#> [1] '0.108.3'
packageVersion("rFerns")
#> [1] '5.0.0'
packageVersion("Boruta")
#> [1] '7.0.0'
packageVersion("doParallel")
#> [1] '1.0.17'
packageVersion("foreach")
#> [1] '1.5.2'
packageVersion("pheatmap")
#> [1] '1.0.12'
packageVersion("RColorBrewer")
#> [1] '1.1.2'
packageVersion("scatterplot3d")
#> [1] '0.3.41'
packageVersion("tagcloud")
#> [1] '0.6'
packageVersion("plotrix")
#> [1] '3.8.2'
packageVersion("biomaRt")
#> [1] '2.46.3'
packageVersion("GO.db")
#> [1] '3.12.1'Note that some versions of R may not be consistent with the biomaRt and dplyr pacakages. In such case, please install the above versions or updates R and the packages.
espresso can be easily executed by a few functions. Here, the most simple code is shown in the following, assuming that a gene expression profile exprs, a domain topology topology, and a domain assignment asgmt are already defined. The details of these inputs are explained in Data format.
library(espresso)
# Initialize espresso object
eobj <- initEsp(exprs, topology, asgmt)
# GraphSOM clustering
eobj <- initGraphSOM(eobj)
eobj <- graphSOM(eobj)
# Show prediction scores
eobj@summaryIf you need more details of the espresso functions, refer to the following sections and help manuals by help function (e.g., help(espresso), help(graphSOM) etc.)
espresso requires three inputs, i.e., expression profile, domain topology, and domain assignment information for cell samples.
exprs| Gene 1 | Gene 2 | Gene 3 | … | |
| Cell 1 | 0.15 | 0.28 | 1.33 | … |
| Cell 2 | 1.08 | 0.62 | 0.01 | … |
| Cell 3 | 0.74 | 0.42 | 0.87 | … |
| … | … | … | … | … |
topology1 and 0 mean that corresponding domains are connected (i.e., adjacent each other in living a tissue or an organ) and unconnected, respectively. Note that this matrix must be symmetric and its diagonals must be filled by 0 since self-loops are not assumed in espresso.| Domain 1 | Domain 2 | Domain 3 | Domain 4 | |
| Domain 1 | 0 | 1 | 1 | 0 |
| Domain 2 | 1 | 0 | 1 | 1 |
| Domain 3 | 1 | 1 | 0 | 1 |
| Domain 4 | 0 | 1 | 1 | 0 |
asgmtdata.frame with two columns which indicate cell samples and corresponding domains.| sample | domain |
| Cell 1 | Domain 1 |
| Cell 2 | Domain 2 |
| Cell 3 | Domain 1 |
| Cell 4 | Domain 4 |
| … | … |
espresso contains example inputs of raw datasets for the above three inputs: “exprs_soysa19_e825.txt”, “topology_soysa19_e825.txt”, and “asgmt_soysa19_e825.txt”, which can be loaded by the following commands. The expression profile “exprs_soysa19_e825.txt” was created from GSE126128_E775E825_WTKO_10X.csv.gz by random selection of 10 single-cells from each domain i.e., 70 cell-samples in total.
# Get paths of input files
path_exprs <- system.file("extdata", "exprs_soysa19_e825.txt", package = "espresso")
path_topology <- system.file("extdata", "topology_soysa19_e825.txt", package = "espresso")
path_asgmt <- system.file("extdata", "asgmt_soysa19_e825.txt", package = "espresso")
# Load data
exprs <- t(read.table(path_exprs, header = TRUE, row.names = 1, sep = "\t",
as.is = TRUE, check.names = FALSE))
topology <- as.matrix(read.table(path_topology, header = T, row.names = 1))
asgmt <- read.table(path_asgmt, header = T, as.is = TRUE)
# Show the contents of the objects
exprs[1:5, 1:5]
#> 0610005C13Rik 0610007P08Rik 0610007P14Rik 0610007P22Rik
#> TGCCCATGTGCTGTAT-9 0 0.0000000 0.0000000 0.0000000
#> GCTGCGAAGGTGCAAC-10 0 0.0000000 0.5096020 0.8455493
#> CATCGAAAGTGTCCAT-10 0 0.0000000 0.0000000 0.4440263
#> GAACGGATCTGATTCT-10 0 0.3703834 0.6400520 0.3703834
#> ACTGCTCTCATTGCCC-10 0 0.0000000 0.6228278 0.8312940
#> 0610008F07Rik
#> TGCCCATGTGCTGTAT-9 0
#> GCTGCGAAGGTGCAAC-10 0
#> CATCGAAAGTGTCCAT-10 0
#> GAACGGATCTGATTCT-10 0
#> ACTGCTCTCATTGCCC-10 0
topology
#> AHF OFT RV LV Atrial SV pSHF
#> AHF 0 1 0 0 0 0 1
#> OFT 1 0 1 0 0 0 0
#> RV 0 1 0 1 0 0 0
#> LV 0 0 1 0 1 0 0
#> Atrial 0 0 0 1 0 1 0
#> SV 0 0 0 0 1 0 1
#> pSHF 1 0 0 0 0 1 0
head(asgmt)
#> sample domain
#> 1 TGCCCATGTGCTGTAT-9 Atrial
#> 2 GCTGCGAAGGTGCAAC-10 Atrial
#> 3 CATCGAAAGTGTCCAT-10 Atrial
#> 4 GAACGGATCTGATTCT-10 Atrial
#> 5 ACTGCTCTCATTGCCC-10 Atrial
#> 6 AAGGAGCTCATACGGT-10 AtrialBefore executing GraphSOM clustering, espresso object should be prepared. The variables exprs, topology, and asgmt are inputs explained in Data format. initEsp is a function to initialize and return an espresso object.
eobj <- initEsp(exprs, topology, asgmt)The espresso object is defined as S4 class, and its slots can be accessed by @ symbol.
str(eobj)
#> Formal class 'Espresso' [package "espresso"] with 28 slots
#> ..@ exprs : num [1:70, 1:23749] 0 0 0 0 0 0 0 0 0 0 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:70] "TGCCCATGTGCTGTAT-9" "GCTGCGAAGGTGCAAC-10" "CATCGAAAGTGTCCAT-10" "GAACGGATCTGATTCT-10" ...
#> .. .. ..$ : chr [1:23749] "0610005C13Rik" "0610007P08Rik" "0610007P14Rik" "0610007P22Rik" ...
#> ..@ exprs_som : NULL
#> ..@ rept_som : NULL
#> ..@ topology : int [1:7, 1:7] 0 1 0 0 0 0 1 1 0 1 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#> .. .. ..$ : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#> ..@ dist : NULL
#> ..@ map : NULL
#> ..@ map_method : NULL
#> ..@ radius : NULL
#> ..@ lsteps : NULL
#> ..@ rept : NULL
#> ..@ bmu : NULL
#> ..@ bmu_method : NULL
#> ..@ stochastic : NULL
#> ..@ rmin : NULL
#> ..@ rmax : NULL
#> ..@ domain2name:'data.frame': 7 obs. of 2 variables:
#> .. ..$ domain: int [1:7] 1 2 3 4 5 6 7
#> .. ..$ name : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#> ..@ gset : chr [1:23749] "0610005C13Rik" "0610007P08Rik" "0610007P14Rik" "0610007P22Rik" ...
#> ..@ ssets : NULL
#> ..@ asgmt :'data.frame': 70 obs. of 2 variables:
#> .. ..$ sample: chr [1:70] "TGCCCATGTGCTGTAT-9" "GCTGCGAAGGTGCAAC-10" "CATCGAAAGTGTCCAT-10" "GAACGGATCTGATTCT-10" ...
#> .. ..$ domain: chr [1:70] "Atrial" "Atrial" "Atrial" "Atrial" ...
#> ..@ score : NULL
#> ..@ summary : NULL
#> ..@ mcmc : NULL
#> ..@ seed : NULL
#> ..@ coef : NULL
#> ..@ nmin : NULL
#> ..@ nmax : NULL
#> ..@ swap : NULL
#> ..@ umap : NULLespresso provides simple preprocessing functions for log-scaling of expression profile and filtering out of low expressed and low variable genes. If the input expression profile need not be preprocessed, this procedure can be ignored.
The input gene expression data can be scaled by log10 as preprocessing. The base can be changed by specifying base = "2" or base = "e".
# Log-scaling of the expression profile
# In this example, log-scaling is unnecessary since the expression values is already log-scaled.
# eobj <- logScale(eobj)The low expressed and low variable genes can be filtered out by the following command. As default setting, genes with expression level of higher than 1 in at least two cells and whose standard deviation of expression levels through all cells more than 0.05 can pass the filtering.
# Filter out genes with low expression and low variance
eobj <- filterGenes(eobj)
#> Filtering genes...
#> 19699 genes are filtered out (23749 -> 4050 genes).The parameters can be changed by specifying the arguments ncell, expressed, and sd, i.e., the default setting is ncell = 2, expressed = 1.0, and sd = 0.05.
# Filter out genes with low expression and low variance
eobj <- filterGenes(eobj, ncell = 2, expressed = 1.0, sd = 0.05)Before starting GraphSOM clustering, the parameters for GraphSOM clustering should be initialized.
eobj <- initGraphSOM(eobj)In this case, all cell samples and genes are used for the clustering, so that it will take a long time to execute when the input gene expression profile is quite large.
Therefore, if downsampling is required, graphSOM can be executed with subsets of cell samples selected uniformly random from each domain, and repeats the clustering with different sets. The nsamples indicates the number of samples extracted from each domain, i.e., nsamples = 10 under a dataset with 4 domains means that 40 cell samples will be selected. If the total number of samples of a certain domain is less than the value specified nsamples, all samples of the domain are selected. The rept is the number of repeats for GraphSOM clustering with different sample sets.
In addition, the coef is the coefficient of ARI value for score function. The GraphSOM results are evaluated by prediction accuracy, ARI, and a score function (= accuracy + coef * ARI). The default is coef = 1.0.
eobj <- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3)If the sample set after downsampling needs to be fixed, set a value to the parameter seed.
Note
Even if the same random seed is given, the selected samples may differ depending on the version of R.
eobj <- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3, seed = 0)From the version 0.2.8, GraphSOM clustering introduces the cluster swap method by default. If the cluster swap method is not needed, set FALSE to the option swap of initGraphSOM function.
eobj <- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3, swap = FALSE, seed = 0)After initializing GraphSOM setting, GraphSOM can be executed. The most simple way to execute GraphSOM clustering under the parameter setting given by initGraphSOM is as follow.
eobj <- graphSOM(eobj)As another option, graphSOM can be executed with a specific gene set by setting it to the parameter gset as a vector.
eobj <- graphSOM(eobj, gset = c("Actn2", "Hand2", "Vegfb", "Wnt5a"))
#> GraphSOM clustering
#> Evaluation
#> CompleteIn addition to these, graphSOM has several parameters, so please refer to the manual for their details.
The graphSOM clustering results can be shown by the following.
eobj@summary
#> $mean_score
#> [1] 0.7524326
#>
#> $var_score
#> [1] 0.0003480211
#>
#> $max_score
#> [1] 0.7730137
#>
#> $min_score
#> [1] 0.7366345
#>
#> $mean_acc
#> [1] 0.6433402
#>
#> $var_acc
#> [1] 0.0001070489
#>
#> $max_acc
#> [1] 0.6546584
#>
#> $min_acc
#> [1] 0.6343685
#>
#> $mean_ari
#> [1] 0.2181847
#>
#> $var_ari
#> [1] 0.0002766807
#>
#> $max_ari
#> [1] 0.2367107
#>
#> $min_ari
#> [1] 0.2045319The results are mean values and mean variance of the score, the prediction accuracy, and ARI for the repeats specified in graphSOM function. The result of each repeat can also be shown.
eobj@score
#> score acc ari
#> rept.1 0.7366345 0.6343685 0.2045319
#> rept.2 0.7730137 0.6546584 0.2367107
#> rept.3 0.7476495 0.6409938 0.2133114The convergence curve can be shown. rept indicates the repeat numbers to be plotted.
plotConvCurve(eobj, rept = 1:2)The multiple results are plotted at the same time by the following example code.
par(mfrow = c(2,2))
plotConvCurve(eobj)
dev.off()espresso provides a function for feature gene selection selectFeatures. This function is a rapper function of Boruta function of Boruta package, which is a feature gene selection method based on random forest. The usage of this function is very simple.
fgenes <- selectFeatures(eobj, maxRuns = 100)
fgenes
#> [1] "Mylk3" "Ncl" "Ptma" "Tnni1"
#> [5] "Actc1" "Ak1" "Apoe" "Cryab"
#> [9] "Csrp2" "Cst3" "Fbxl22" "Fbxo32"
#> [13] "Hspb2" "Myh7" "Myl3" "Myl4"
#> [17] "Myl7" "Pgam2" "Pkm" "Slc25a4"
#> [21] "Tagln" "Tmod1" "Tnni3" "Tpi1"
#> [25] "Tpm1" "Tuba1a" "Actn2" "Atp5c1"
#> [29] "Hcfc1r1" "Myl1" "Myl6" "Npm3"
#> [33] "Tgfb1i1" "Bnip3" "Crip1" "Gyg"
#> [37] "Hmgb1" "Dlk1" "Hspe1" "Ldha"
#> [41] "Pdlim1" "Pdlim5" "Ptpla" "Marcksl1"
#> [45] "Naca" "Rps7" "Atp5o" "Gm10075"
#> [49] "Apobec2" "Isl1" "Lta4h" "Actb"
#> [53] "Atp2a2" "Dcp1b" "Itga6" "Nr2f2"
#> [57] "Anxa2" "Cox6a2" "Rassf5" "Tdgf1"
#> [61] "Wbp5" "Suclg2" "Tkt" "Krt19"
#> [65] "Lrrc39" "Nop16" "Atp5g1" "Rpl7a"
#> [69] "Rplp2" "Cdh2" "Hnrnpab" "Fundc2"
#> [73] "4833424O15Rik" "Ndufs6"Please note that the result of Boruta depends on random seed. Thus, if the results need to be fixed, set a value to the parameter seed.
fgenes <- selectFeatures(eobj, maxRuns = 100, seed = 0)Boruta classifies genes into the three categories, “Confirmed”, “Tentative”, and “Rejected”, according to their importance. This function returns a vector of “NOT Rejected” genes as features by default. If only “Confirmed” genes are required, set "c" to the parameter decision.
fgenes <- selectFeatures(eobj, maxRuns = 100, decision = "c", seed = 0)In order to resolve genes left as “Tentative”, maxRuns may be increased, where the parameter maxRuns is the maximum number of importance source runs (refer the manual of Boruta package). Empirically, it is good to determine the parameters so that around 100 genes are selected.
In order to execute graphSOM with functional gene sets already defined, espresso provides us with gene sets in GO database.
# Mouse gene sets (Ensembl release version 101)
data(mm_symbol_v101)
# Human gene sets (Ensembl release version 101)
data(hs_symbol_v101)If another version, another species, or another gene identifier are required, the following command enable a user to create another gene set via R biomaRt package. The parameters to be set to species are gid according to biomaRt. Note that if the latest version is required, please do not specify version in the function.
# Get GO information
hs_symbol_v98 <- getGO(species = 'hsapiens', gid = 'hgnc_symbol', version = '98')
# Save 'hs_symbol_v98' as an R data file.
save(hs_symbol_v98, file = "hs_symbol_v98")
# Load 'hs_symbol_v98'
load(hs_symbol_v98)The GO information are stored as a list, and each element of list has two vectors, term and genes.
mm_symbol_v101[["GO:0003209"]][["term"]]
#> [1] "cardiac atrium morphogenesis"
mm_symbol_v101[["GO:0003209"]][["genes"]]
#> [1] "Acvr1" "Bmp10" "Bmpr2" "Ccm2l" "Ccn1" "Cfc1" "Dll4" "Eng"
#> [9] "Gata4" "Gja5" "Heg1" "Hey2" "Isl1" "Mesp1" "Myh6" "Nkx2-5"
#> [17] "Nog" "Notch1" "Notch2" "Nsd2" "Pitx2" "Prox1" "Shox2" "Smo"
#> [25] "Sos1" "Sox4" "Tbx20" "Tbx5" "Tgfb2" "Tnnt2" "Wnt2" "Wnt5a"
#> [33] "Zfpm1"Thus, graphSOM execution using GO genes can be done by
eobj <- graphSOM(eobj, gset = mm_symbol_v101[["GO:0003209"]][["genes"]])espresso optimizes gene sets by replica exchange Markov chain Monte Carlo method (RX-MCMC). The parameters itr, n_ex, and n_repl are the number of iterations of MCMC until each replica exchange, the number of times of exchanges, and the number of replicas, respectively. Thus, the total number of MCMC steps is itr * n_ex. If the user’s machine has multi-core processors, the MCMC computations of each replica are parallelized.
eobj <- initGraphSOM(eobj, nsamples = 10, rept = 1, coef = 0.5)
eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4)Although the number of cores is automatically detected, it can be specified by n_cl. In addition, as the default setting, MCMC optimizations are started with 3 genes randomly selected from the gene set specified by gset. However it can be also modified by n_ig. Note that the length of fgenes must be larger than n_ig.
eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, n_ig = 3)The random seed can be fixed by the seed option in the rxmcmc function.
eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, n_ig = 3, seed = 0)The results of RX-MCMC are stored in espresso objects and they can be accessed by @mcmc.
eobj@mcmc[["sampling"]]: History of samplingeobj@mcmc[["max"]]: History of the samples with the best scoreeobj@mcmc[["exchange"]]: History of replica exchangeeobj@mcmc[["final_genes"]]: The final genes for each replicaseobj@mcmc[["best_repl"]]: Information of the best replicaseobj@mcmc[["best_genes"]]: The best geneseobj@mcmc[["sampling"]][["repl.1"]]
#> mean_score var_score max_score min_score mean_acc var_acc max_acc
#> step.1 0.7761541 NA 0.7761541 0.7761541 0.6360248 NA 0.6360248
#> step.2 0.9046352 NA 0.9046352 0.9046352 0.7064182 NA 0.7064182
#> step.3 1.0212497 NA 1.0212497 1.0212497 0.7726708 NA 0.7726708
#> step.4 1.0891609 NA 1.0891609 1.0891609 0.8053830 NA 0.8053830
#> step.5 1.0895137 NA 1.0895137 1.0895137 0.8033126 NA 0.8033126
#> step.6 1.1095087 NA 1.1095087 1.1095087 0.8132505 NA 0.8132505
#> step.7 1.1616804 NA 1.1616804 1.1616804 0.8414079 NA 0.8414079
#> step.8 1.1616804 NA 1.1616804 1.1616804 0.8414079 NA 0.8414079
#> step.9 1.1663870 NA 1.1663870 1.1663870 0.8451346 NA 0.8451346
#> step.10 1.1682801 NA 1.1682801 1.1682801 0.8476190 NA 0.8476190
#> min_acc mean_ari var_ari max_ari min_ari gene_size
#> step.1 0.6360248 0.2802585 NA 0.2802585 0.2802585 3
#> step.2 0.7064182 0.3964340 NA 0.3964340 0.3964340 4
#> step.3 0.7726708 0.4971578 NA 0.4971578 0.4971578 5
#> step.4 0.8053830 0.5675557 NA 0.5675557 0.5675557 6
#> step.5 0.8033126 0.5724020 NA 0.5724020 0.5724020 7
#> step.6 0.8132505 0.5925164 NA 0.5925164 0.5925164 8
#> step.7 0.8414079 0.6405451 NA 0.6405451 0.6405451 9
#> step.8 0.8414079 0.6405451 NA 0.6405451 0.6405451 10
#> step.9 0.8451346 0.6425049 NA 0.6425049 0.6425049 11
#> step.10 0.8476190 0.6413221 NA 0.6413221 0.6413221 12
eobj@mcmc[["max"]][["repl.1"]]
#> mean_score var_score max_score min_score mean_acc var_acc max_acc
#> step.1 0.7761541 NA 0.7761541 0.7761541 0.6360248 NA 0.6360248
#> step.2 0.9046352 NA 0.9046352 0.9046352 0.7064182 NA 0.7064182
#> step.3 1.0212497 NA 1.0212497 1.0212497 0.7726708 NA 0.7726708
#> step.4 1.0891609 NA 1.0891609 1.0891609 0.8053830 NA 0.8053830
#> step.5 1.0895137 NA 1.0895137 1.0895137 0.8033126 NA 0.8033126
#> step.6 1.1095087 NA 1.1095087 1.1095087 0.8132505 NA 0.8132505
#> step.7 1.1616804 NA 1.1616804 1.1616804 0.8414079 NA 0.8414079
#> step.8 1.1616804 NA 1.1616804 1.1616804 0.8414079 NA 0.8414079
#> step.9 1.1663870 NA 1.1663870 1.1663870 0.8451346 NA 0.8451346
#> step.10 1.1682801 NA 1.1682801 1.1682801 0.8476190 NA 0.8476190
#> min_acc mean_ari var_ari max_ari min_ari gene_size
#> step.1 0.6360248 0.2802585 NA 0.2802585 0.2802585 3
#> step.2 0.7064182 0.3964340 NA 0.3964340 0.3964340 4
#> step.3 0.7726708 0.4971578 NA 0.4971578 0.4971578 5
#> step.4 0.8053830 0.5675557 NA 0.5675557 0.5675557 6
#> step.5 0.8033126 0.5724020 NA 0.5724020 0.5724020 7
#> step.6 0.8132505 0.5925164 NA 0.5925164 0.5925164 8
#> step.7 0.8414079 0.6405451 NA 0.6405451 0.6405451 9
#> step.8 0.8414079 0.6405451 NA 0.6405451 0.6405451 9
#> step.9 0.8451346 0.6425049 NA 0.6425049 0.6425049 11
#> step.10 0.8476190 0.6413221 NA 0.6413221 0.6413221 12
eobj@mcmc[["exchange"]]
#> repl.1 repl.2 repl.3 repl.4
#> step.1 1.000 0.100 0.01 0.001
#> step.2 1.000 0.010 0.10 0.001
#> step.3 1.000 0.010 0.10 0.001
#> step.4 0.100 0.001 1.00 0.010
#> step.5 0.100 0.001 1.00 0.010
#> step.6 0.010 0.001 1.00 0.100
#> step.7 0.010 0.001 1.00 0.100
#> step.8 0.001 0.010 0.10 1.000
#> step.9 0.001 0.010 0.10 1.000
#> step.10 0.001 0.100 0.01 1.000
eobj@mcmc[["final_genes"]]
#> $repl.1
#> [1] "Rassf5" "Myl3" "Atp5o" "Apoe" "Dlk1" "Marcksl1"
#> [7] "Fbxo32" "Atp5c1" "Ncl" "Itga6" "Tmod1" "Tdgf1"
#>
#> $repl.2
#> [1] "Wbp5" "Myl3" "Actn2" "Ncl" "Apoe" "Fbxo32" "Bnip3"
#> [8] "Slc25a4" "Atp5o" "Tdgf1" "Cst3"
#>
#> $repl.3
#> [1] "Crip1" "Hmgb1" "Ptma" "Apoe" "Tdgf1" "Bnip3" "Tpi1" "Fbxo32"
#> [9] "Pdlim1" "Rps7" "Naca" "Tagln"
#>
#> $repl.4
#> [1] "Anxa2" "Fbxl22" "Apoe" "Dlk1" "Myl1" "Tuba1a" "Csrp2" "Actn2"
eobj@mcmc[["best_repl"]]
#> repl step
#> 1 3 10
eobj@mcmc[["best_genes"]]
#> $repl.3
#> [1] "Crip1" "Hmgb1" "Ptma" "Apoe" "Tdgf1" "Bnip3" "Tpi1" "Fbxo32"
#> [9] "Pdlim1" "Rps7" "Naca" "Tagln"The results of RX-MCMC can be also obtained as text files as follows, where dir is a path to an output directory.
writeMCMC(eobj, dir = "output")The histories of RX-MCMC optimization and replica exchanges can be plotted by plotMCMC function.
plotMCMC(eobj)In order to visualize the distance between domains and between cell samples before/after clustering for all replications, espresso provides plotDistMap. The usage is quite simple.
# size_d: The font and cell sizes for distance map of domains (default: NULL).
# size_s: The font and cell sizes for distance map of cell samples (default: 5).
plotDistMap(eobj, size_s = 5)espresso provides a function for dimension reduction method named Uniform Manifold Approximation and Projection (UMAP) to show the distribution of the cell samples in 2D or 3D space (the default space is 2D). To compute and plot UMAP, use runUMAP and plotUMAP functions, respectively. The UMAP result is stored @umap.
eobj <- runUMAP(eobj)
head(eobj@umap)
#> [,1] [,2]
#> TGCCCATGTGCTGTAT-9 -0.0225598 -2.2024913
#> GCTGCGAAGGTGCAAC-10 0.4925257 2.6053568
#> CATCGAAAGTGTCCAT-10 0.4897042 -2.5706015
#> GAACGGATCTGATTCT-10 0.7458983 -2.0646923
#> ACTGCTCTCATTGCCC-10 -0.8827290 1.5327852
#> AAGGAGCTCATACGGT-10 2.3228327 -0.6908234
plotUMAP(eobj, file = 'umap_2D.png')
#> quartz_off_screen
#> 2UMAP
This function imports uwot packages, thus all parameters for UMAP computation can be given according to the original package. When using custom settings for UMAP, give the UMAP parameters umap_param as a list to runUMAP by the following way. For the details of the parameters, refer to uwot::umap function. The feature genes can be given by gset parameter while UMAP is performed with genes stored in @gset if gset = NULL.
eobj <- runUMAP(eobj, gset = eobj@mcmc[["best_genes"]][[1]],
umap_param = list("n_neighbors" = 15, "n_components" = 3, "n_epochs" = 1000))
head(eobj@umap)
#> [,1] [,2] [,3]
#> TGCCCATGTGCTGTAT-9 -0.6878206 -0.1425771 0.65135581
#> GCTGCGAAGGTGCAAC-10 -0.4250149 -0.2897149 0.69028134
#> CATCGAAAGTGTCCAT-10 -0.8812406 0.2021387 0.24999088
#> GAACGGATCTGATTCT-10 -0.7272825 0.4317129 -0.03455686
#> ACTGCTCTCATTGCCC-10 -0.6010484 -0.3180909 1.52406007
#> AAGGAGCTCATACGGT-10 -1.1745350 0.3119029 -0.00599524
# size: the size for plotted points.
# magnify: Multiplicative factor to apply to size of window when producing legend (default: 1).
plotUMAP(eobj, size = 1.5, magnify = 1, file = "umap_3D.png") UMAP
Although the plotUMAP imports rgl package, scatterplot3d package can be used instead of rgl by rgl = FALSE (default: TRUE) when rgl is not available.
plotUMAP(eobj, file = "umap_3D_scatterplot.png", rgl = FALSE)
#> quartz_off_screen
#> 2UMAP
The export format is automatically detected from its extension (Supported: png, pdf, eps)
plotUMAP(eobj, file = 'umap.pdf')The RGL window can be closed after plotting by rglclose = TRUE.
plotUMAP(eobj, file = 'umap.pdf', rglclose = TRUE)The file saving can be suppress by save = FALSE. This is useful when you want to use png(), pdf(), and postscript() functions directly in order to specify their parameters (e.g., width, height, and so on) manually.
pdf(file = 'umap.pdf', width = 6, height = 6)
plotUMAP(eobj, save = FALSE)
dev.off()Note
When the exported format is ‘pdf’ or ‘eps’, and rgl = TRUE, the legend information is not printed due to the design of rgl package. In addition, it often takes much time to draw UMAP plot as a PDF file when the number of samples is large. In such case, please output as a PNG file.
Plotting colors can be changed by any colors according to the domains.
domain_colors <- list(AHF = "orange",
Atrial = "pink",
LV = "green",
OFT = "lightblue",
pSHF = "purple",
RV = "blue",
SV = "plum"
)
plotUMAP(eobj, domcol = domain_colors, file = "umap_3D_color1.png", size = 1.5)UMAP
Color coding in groups different from the domains is possible with the color_by_group and grpcol options. The color_by_group must be a data.frame of two columns with a header given by the following format, where the header must be ‘sample’ and ‘group’.
| sample | group |
| Cell 1 | Group 1 |
| Cell 2 | Group 2 |
| Cell 3 | Group 1 |
| Cell 4 | Group 4 |
| … | … |
The color of each group can be specified by grpcol of the same format as domcol. The following is an example.
path_group <- system.file("extdata", "group_soysa19_e825.txt", package = "espresso")
group <- read.table(path_group, header = T, as.is = TRUE)
head(group)
#> sample group
#> 1 TGCCCATGTGCTGTAT-9 FHF
#> 2 GCTGCGAAGGTGCAAC-10 FHF
#> 3 CATCGAAAGTGTCCAT-10 FHF
#> 4 GAACGGATCTGATTCT-10 FHF
#> 5 ACTGCTCTCATTGCCC-10 FHF
#> 6 AAGGAGCTCATACGGT-10 FHF
group_colors <- list(FHF = "orange", SHF = "blue")
plotUMAP(eobj, color_by_group = group, grpcol = group_colors, size = 1.5, file = "umap_3D_color2.png")UMAP
In addition, the coloring of each sample can also be based on the expression levels of a gene rather than the domains by specifying a gene name to the argument color_by_gene. In such case, the domcol is ignored.
plotUMAP(eobj, color_by_gene = "Actn2", file = "umap_3D_color3.png", size = 1.5)UMAP
If movie = TRUE, a movie of the resulting UMAP is generated. This option is valid when rgl = TRUE.
plotUMAP(eobj, movie = TRUE)The above script uses all cell samples for UMAP plotting. If you want to plot UMAP with downsampled sets, the following script can work.
eobj.tmp <- eobj
samples <- eobj.tmp@ssets[["set.1"]]
eobj.tmp@exprs <- eobj.tmp@exprs[match(samples, rownames(eobj.tmp@exprs)), , drop = FALSE]
eobj.tmp <- runUMAP(eobj.tmp, gset = eobj.tmp@mcmc[["best_genes"]][[1]],
umap_param = list("n_neighbors" = 5, "n_components" = 3, "n_epochs" = 1000))
plotUMAP(eobj.tmp, movie = TRUE, domcol = domain_colors)UMAP is even more effective for overviewing distribution of cell samples when the number of cell samples is large. The following figure is UMAP using GO:0003209 for 3,331 cell samples obtained from GSE126128_E775E825_WTKO_10X.csv.gz.
UMAP
graphSOM and rxmcmc of previous version (0)In espresso of version 0, the reproducibility between the results of graphSOM and rxmcmc was not guaranteed, and a random number generation method different from version 1.0.0 or later was employed. The followings are examples of the way to run graphSOM and rxmcmc of version 0.
graphSOM(eobj, gset = fgenes, seed = 0, version = "0")rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, seed = 0, version = "0")The gene description can be obtained via biomaRt package as followings:
library(biomaRt)
fgenes <- c("Actn2", "Hand2", "Vegfb", "Wnt5a")
ds <- useEnsembl(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl')
ann <- getBM(attributes = c("mgi_symbol", "description"),
mart = ds, filters = "mgi_symbol", values = fgenes)Note
If the version of R is 3.6 or later, the above script may not work depending on the version of biomaRt (e.g., >= 2.42.0). In that case, add the option useCache = False to the getBM function.
ann <- getBM(attributes = c("mgi_symbol", "description"),
mart = ds, filters = "mgi_symbol", values = fgenes, useCache = FALSE)To cite your use of the espresso software, please reference the following paper: