espresso

31-Mar-2022: espresso 1.0.1 released.

espresso is a package for domain estimation of cells only from gene expression profile based on graph-structured stochastic self-organizing map (GraphSOM) and Markov chain Monte Carlo (MCMC) method.

Table of Contents


Installation

espresso can be installed by the following command.

# Installation (X.X.X is a version number)
install.packages("espresso-X.X.X.tar.gz", repos = NULL, type = "source")

If an error message “ERROR: dependencies ‘XXX’ are not available for package ‘espresso’” is displayed in the installation step, please try to install required packages, which are listed as follows.

CRAN packages

# Installation example of CRAN packages
install.packages(c("igraph", "mclust", "progress", "uwot", "rgl", 
                   "rFerns", "Boruta", "doParallel", "foreach", "pheatmap", 
                   "RColorBrewer", "scatterplot3d", "tagcloud", "plotrix"), 
                 dependencies = TRUE)

Bioconductor packages

# Installation example of Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("biomaRt","GO.db"))

Note

If you try to install or load espresso on a machine that does not have X11, a warning message will be displayed because rgl package requires X11 to be installed. In such case, install X11, or ignore the warning if you do not need UMAP plots.

Altanative

espresso can be installed directly from Github repository, if devtools is installed.

library(devtools)
install_github("tmorikuicr/espresso")


Confirmed version

This vignette was generated by the following versions of R and packages.

version
#>                _                           
#> platform       x86_64-apple-darwin17.0     
#> arch           x86_64                      
#> os             darwin17.0                  
#> system         x86_64, darwin17.0          
#> status                                     
#> major          4                           
#> minor          0.3                         
#> year           2020                        
#> month          10                          
#> day            10                          
#> svn rev        79318                       
#> language       R                           
#> version.string R version 4.0.3 (2020-10-10)
#> nickname       Bunny-Wunnies Freak Out
packageVersion("igraph")
#> [1] '1.2.11'
packageVersion("mclust")
#> [1] '5.4.9'
packageVersion("progress")
#> [1] '1.2.2'
packageVersion("uwot")
#> [1] '0.1.11'
packageVersion("rgl")
#> [1] '0.108.3'
packageVersion("rFerns")
#> [1] '5.0.0'
packageVersion("Boruta")
#> [1] '7.0.0'
packageVersion("doParallel")
#> [1] '1.0.17'
packageVersion("foreach")
#> [1] '1.5.2'
packageVersion("pheatmap")
#> [1] '1.0.12'
packageVersion("RColorBrewer")
#> [1] '1.1.2'
packageVersion("scatterplot3d")
#> [1] '0.3.41'
packageVersion("tagcloud")
#> [1] '0.6'
packageVersion("plotrix")
#> [1] '3.8.2'
packageVersion("biomaRt")
#> [1] '2.46.3'
packageVersion("GO.db")
#> [1] '3.12.1'

Note that some versions of R may not be consistent with the biomaRt and dplyr pacakages. In such case, please install the above versions or updates R and the packages.


Quick example

espresso can be easily executed by a few functions. Here, the most simple code is shown in the following, assuming that a gene expression profile exprs, a domain topology topology, and a domain assignment asgmt are already defined. The details of these inputs are explained in Data format.

library(espresso)

# Initialize espresso object
eobj <- initEsp(exprs, topology, asgmt)

# GraphSOM clustering
eobj <- initGraphSOM(eobj)
eobj <- graphSOM(eobj)

# Show prediction scores
eobj@summary

If you need more details of the espresso functions, refer to the following sections and help manuals by help function (e.g., help(espresso), help(graphSOM) etc.)


Usage

Data format

espresso requires three inputs, i.e., expression profile, domain topology, and domain assignment information for cell samples.

Gene 1 Gene 2 Gene 3
Cell 1 0.15 0.28 1.33
Cell 2 1.08 0.62 0.01
Cell 3 0.74 0.42 0.87
Domain 1 Domain 2 Domain 3 Domain 4
Domain 1 0 1 1 0
Domain 2 1 0 1 1
Domain 3 1 1 0 1
Domain 4 0 1 1 0
sample domain
Cell 1 Domain 1
Cell 2 Domain 2
Cell 3 Domain 1
Cell 4 Domain 4

espresso contains example inputs of raw datasets for the above three inputs: “exprs_soysa19_e825.txt”, “topology_soysa19_e825.txt”, and “asgmt_soysa19_e825.txt”, which can be loaded by the following commands. The expression profile “exprs_soysa19_e825.txt” was created from GSE126128_E775E825_WTKO_10X.csv.gz by random selection of 10 single-cells from each domain i.e., 70 cell-samples in total.

# Get paths of input files
path_exprs <- system.file("extdata", "exprs_soysa19_e825.txt", package = "espresso")
path_topology <- system.file("extdata", "topology_soysa19_e825.txt", package = "espresso")
path_asgmt <- system.file("extdata", "asgmt_soysa19_e825.txt", package = "espresso")

# Load data
exprs <- t(read.table(path_exprs, header = TRUE, row.names = 1, sep = "\t", 
                      as.is = TRUE, check.names = FALSE))
topology <- as.matrix(read.table(path_topology, header = T, row.names = 1))
asgmt <- read.table(path_asgmt, header = T, as.is = TRUE)

# Show the contents of the objects
exprs[1:5, 1:5]
#>                     0610005C13Rik 0610007P08Rik 0610007P14Rik 0610007P22Rik
#> TGCCCATGTGCTGTAT-9              0     0.0000000     0.0000000     0.0000000
#> GCTGCGAAGGTGCAAC-10             0     0.0000000     0.5096020     0.8455493
#> CATCGAAAGTGTCCAT-10             0     0.0000000     0.0000000     0.4440263
#> GAACGGATCTGATTCT-10             0     0.3703834     0.6400520     0.3703834
#> ACTGCTCTCATTGCCC-10             0     0.0000000     0.6228278     0.8312940
#>                     0610008F07Rik
#> TGCCCATGTGCTGTAT-9              0
#> GCTGCGAAGGTGCAAC-10             0
#> CATCGAAAGTGTCCAT-10             0
#> GAACGGATCTGATTCT-10             0
#> ACTGCTCTCATTGCCC-10             0
topology
#>        AHF OFT RV LV Atrial SV pSHF
#> AHF      0   1  0  0      0  0    1
#> OFT      1   0  1  0      0  0    0
#> RV       0   1  0  1      0  0    0
#> LV       0   0  1  0      1  0    0
#> Atrial   0   0  0  1      0  1    0
#> SV       0   0  0  0      1  0    1
#> pSHF     1   0  0  0      0  1    0
head(asgmt)
#>                sample domain
#> 1  TGCCCATGTGCTGTAT-9 Atrial
#> 2 GCTGCGAAGGTGCAAC-10 Atrial
#> 3 CATCGAAAGTGTCCAT-10 Atrial
#> 4 GAACGGATCTGATTCT-10 Atrial
#> 5 ACTGCTCTCATTGCCC-10 Atrial
#> 6 AAGGAGCTCATACGGT-10 Atrial

Initialization of eSPRESSO object

Before executing GraphSOM clustering, espresso object should be prepared. The variables exprs, topology, and asgmt are inputs explained in Data format. initEsp is a function to initialize and return an espresso object.

eobj <- initEsp(exprs, topology, asgmt)

The espresso object is defined as S4 class, and its slots can be accessed by @ symbol.

str(eobj)
#> Formal class 'Espresso' [package "espresso"] with 28 slots
#>   ..@ exprs      : num [1:70, 1:23749] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:70] "TGCCCATGTGCTGTAT-9" "GCTGCGAAGGTGCAAC-10" "CATCGAAAGTGTCCAT-10" "GAACGGATCTGATTCT-10" ...
#>   .. .. ..$ : chr [1:23749] "0610005C13Rik" "0610007P08Rik" "0610007P14Rik" "0610007P22Rik" ...
#>   ..@ exprs_som  : NULL
#>   ..@ rept_som   : NULL
#>   ..@ topology   : int [1:7, 1:7] 0 1 0 0 0 0 1 1 0 1 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#>   .. .. ..$ : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#>   ..@ dist       : NULL
#>   ..@ map        : NULL
#>   ..@ map_method : NULL
#>   ..@ radius     : NULL
#>   ..@ lsteps     : NULL
#>   ..@ rept       : NULL
#>   ..@ bmu        : NULL
#>   ..@ bmu_method : NULL
#>   ..@ stochastic : NULL
#>   ..@ rmin       : NULL
#>   ..@ rmax       : NULL
#>   ..@ domain2name:'data.frame':  7 obs. of  2 variables:
#>   .. ..$ domain: int [1:7] 1 2 3 4 5 6 7
#>   .. ..$ name  : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#>   ..@ gset       : chr [1:23749] "0610005C13Rik" "0610007P08Rik" "0610007P14Rik" "0610007P22Rik" ...
#>   ..@ ssets      : NULL
#>   ..@ asgmt      :'data.frame':  70 obs. of  2 variables:
#>   .. ..$ sample: chr [1:70] "TGCCCATGTGCTGTAT-9" "GCTGCGAAGGTGCAAC-10" "CATCGAAAGTGTCCAT-10" "GAACGGATCTGATTCT-10" ...
#>   .. ..$ domain: chr [1:70] "Atrial" "Atrial" "Atrial" "Atrial" ...
#>   ..@ score      : NULL
#>   ..@ summary    : NULL
#>   ..@ mcmc       : NULL
#>   ..@ seed       : NULL
#>   ..@ coef       : NULL
#>   ..@ nmin       : NULL
#>   ..@ nmax       : NULL
#>   ..@ swap       : NULL
#>   ..@ umap       : NULL

Preprocessing (Optional)

espresso provides simple preprocessing functions for log-scaling of expression profile and filtering out of low expressed and low variable genes. If the input expression profile need not be preprocessed, this procedure can be ignored.

Log-scaling

The input gene expression data can be scaled by log10 as preprocessing. The base can be changed by specifying base = "2" or base = "e".

# Log-scaling of the expression profile
# In this example, log-scaling is unnecessary since the expression values is already log-scaled.
# eobj <- logScale(eobj)

Gene filtering

The low expressed and low variable genes can be filtered out by the following command. As default setting, genes with expression level of higher than 1 in at least two cells and whose standard deviation of expression levels through all cells more than 0.05 can pass the filtering.

# Filter out genes with low expression and low variance
eobj <- filterGenes(eobj)
#> Filtering genes...
#> 19699 genes are filtered out (23749 -> 4050 genes).

The parameters can be changed by specifying the arguments ncell, expressed, and sd, i.e., the default setting is ncell = 2, expressed = 1.0, and sd = 0.05.

# Filter out genes with low expression and low variance
eobj <- filterGenes(eobj, ncell = 2, expressed = 1.0, sd = 0.05)

GraphSOM clustering

Before starting GraphSOM clustering, the parameters for GraphSOM clustering should be initialized.

eobj <- initGraphSOM(eobj)

In this case, all cell samples and genes are used for the clustering, so that it will take a long time to execute when the input gene expression profile is quite large.

Therefore, if downsampling is required, graphSOM can be executed with subsets of cell samples selected uniformly random from each domain, and repeats the clustering with different sets. The nsamples indicates the number of samples extracted from each domain, i.e., nsamples = 10 under a dataset with 4 domains means that 40 cell samples will be selected. If the total number of samples of a certain domain is less than the value specified nsamples, all samples of the domain are selected. The rept is the number of repeats for GraphSOM clustering with different sample sets.

In addition, the coef is the coefficient of ARI value for score function. The GraphSOM results are evaluated by prediction accuracy, ARI, and a score function (= accuracy + coef * ARI). The default is coef = 1.0.

eobj <- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3)

If the sample set after downsampling needs to be fixed, set a value to the parameter seed.

Note

Even if the same random seed is given, the selected samples may differ depending on the version of R.

eobj <- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3, seed = 0)

From the version 0.2.8, GraphSOM clustering introduces the cluster swap method by default. If the cluster swap method is not needed, set FALSE to the option swap of initGraphSOM function.

eobj <- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3, swap = FALSE, seed = 0)

After initializing GraphSOM setting, GraphSOM can be executed. The most simple way to execute GraphSOM clustering under the parameter setting given by initGraphSOM is as follow.

eobj <- graphSOM(eobj)

As another option, graphSOM can be executed with a specific gene set by setting it to the parameter gset as a vector.

eobj <- graphSOM(eobj, gset = c("Actn2", "Hand2", "Vegfb", "Wnt5a"))
#> GraphSOM clustering
#> Evaluation
#> Complete

In addition to these, graphSOM has several parameters, so please refer to the manual for their details.

The graphSOM clustering results can be shown by the following.

eobj@summary
#> $mean_score
#> [1] 0.7524326
#> 
#> $var_score
#> [1] 0.0003480211
#> 
#> $max_score
#> [1] 0.7730137
#> 
#> $min_score
#> [1] 0.7366345
#> 
#> $mean_acc
#> [1] 0.6433402
#> 
#> $var_acc
#> [1] 0.0001070489
#> 
#> $max_acc
#> [1] 0.6546584
#> 
#> $min_acc
#> [1] 0.6343685
#> 
#> $mean_ari
#> [1] 0.2181847
#> 
#> $var_ari
#> [1] 0.0002766807
#> 
#> $max_ari
#> [1] 0.2367107
#> 
#> $min_ari
#> [1] 0.2045319

The results are mean values and mean variance of the score, the prediction accuracy, and ARI for the repeats specified in graphSOM function. The result of each repeat can also be shown.

eobj@score
#>            score       acc       ari
#> rept.1 0.7366345 0.6343685 0.2045319
#> rept.2 0.7730137 0.6546584 0.2367107
#> rept.3 0.7476495 0.6409938 0.2133114

The convergence curve can be shown. rept indicates the repeat numbers to be plotted.

plotConvCurve(eobj, rept = 1:2)

The multiple results are plotted at the same time by the following example code.

par(mfrow = c(2,2))
plotConvCurve(eobj)
dev.off()

Feature gene selection based on random forest

espresso provides a function for feature gene selection selectFeatures. This function is a rapper function of Boruta function of Boruta package, which is a feature gene selection method based on random forest. The usage of this function is very simple.

fgenes <- selectFeatures(eobj, maxRuns = 100)

fgenes
#>  [1] "Mylk3"         "Ncl"           "Ptma"          "Tnni1"        
#>  [5] "Actc1"         "Ak1"           "Apoe"          "Cryab"        
#>  [9] "Csrp2"         "Cst3"          "Fbxl22"        "Fbxo32"       
#> [13] "Hspb2"         "Myh7"          "Myl3"          "Myl4"         
#> [17] "Myl7"          "Pgam2"         "Pkm"           "Slc25a4"      
#> [21] "Tagln"         "Tmod1"         "Tnni3"         "Tpi1"         
#> [25] "Tpm1"          "Tuba1a"        "Actn2"         "Atp5c1"       
#> [29] "Hcfc1r1"       "Myl1"          "Myl6"          "Npm3"         
#> [33] "Tgfb1i1"       "Bnip3"         "Crip1"         "Gyg"          
#> [37] "Hmgb1"         "Dlk1"          "Hspe1"         "Ldha"         
#> [41] "Pdlim1"        "Pdlim5"        "Ptpla"         "Marcksl1"     
#> [45] "Naca"          "Rps7"          "Atp5o"         "Gm10075"      
#> [49] "Apobec2"       "Isl1"          "Lta4h"         "Actb"         
#> [53] "Atp2a2"        "Dcp1b"         "Itga6"         "Nr2f2"        
#> [57] "Anxa2"         "Cox6a2"        "Rassf5"        "Tdgf1"        
#> [61] "Wbp5"          "Suclg2"        "Tkt"           "Krt19"        
#> [65] "Lrrc39"        "Nop16"         "Atp5g1"        "Rpl7a"        
#> [69] "Rplp2"         "Cdh2"          "Hnrnpab"       "Fundc2"       
#> [73] "4833424O15Rik" "Ndufs6"

Please note that the result of Boruta depends on random seed. Thus, if the results need to be fixed, set a value to the parameter seed.

fgenes <- selectFeatures(eobj, maxRuns = 100, seed = 0)

Boruta classifies genes into the three categories, “Confirmed”, “Tentative”, and “Rejected”, according to their importance. This function returns a vector of “NOT Rejected” genes as features by default. If only “Confirmed” genes are required, set "c" to the parameter decision.

fgenes <- selectFeatures(eobj, maxRuns = 100, decision = "c", seed = 0)

In order to resolve genes left as “Tentative”, maxRuns may be increased, where the parameter maxRuns is the maximum number of importance source runs (refer the manual of Boruta package). Empirically, it is good to determine the parameters so that around 100 genes are selected.

Usage of Gene Ontology (GO) information (optional)

In order to execute graphSOM with functional gene sets already defined, espresso provides us with gene sets in GO database.

# Mouse gene sets (Ensembl release version 101)
data(mm_symbol_v101)

# Human gene sets (Ensembl release version 101)
data(hs_symbol_v101)

If another version, another species, or another gene identifier are required, the following command enable a user to create another gene set via R biomaRt package. The parameters to be set to species are gid according to biomaRt. Note that if the latest version is required, please do not specify version in the function.

# Get GO information
hs_symbol_v98 <- getGO(species = 'hsapiens', gid = 'hgnc_symbol', version = '98')

# Save 'hs_symbol_v98' as an R data file.
save(hs_symbol_v98, file = "hs_symbol_v98")

# Load 'hs_symbol_v98'
load(hs_symbol_v98)

The GO information are stored as a list, and each element of list has two vectors, term and genes.

mm_symbol_v101[["GO:0003209"]][["term"]]
#> [1] "cardiac atrium morphogenesis"

mm_symbol_v101[["GO:0003209"]][["genes"]]
#>  [1] "Acvr1"  "Bmp10"  "Bmpr2"  "Ccm2l"  "Ccn1"   "Cfc1"   "Dll4"   "Eng"   
#>  [9] "Gata4"  "Gja5"   "Heg1"   "Hey2"   "Isl1"   "Mesp1"  "Myh6"   "Nkx2-5"
#> [17] "Nog"    "Notch1" "Notch2" "Nsd2"   "Pitx2"  "Prox1"  "Shox2"  "Smo"   
#> [25] "Sos1"   "Sox4"   "Tbx20"  "Tbx5"   "Tgfb2"  "Tnnt2"  "Wnt2"   "Wnt5a" 
#> [33] "Zfpm1"

Thus, graphSOM execution using GO genes can be done by

eobj <- graphSOM(eobj, gset = mm_symbol_v101[["GO:0003209"]][["genes"]])

Gene set optimization by replica exchange Markov chain Monte Carlo method (RX-MCMC)

espresso optimizes gene sets by replica exchange Markov chain Monte Carlo method (RX-MCMC). The parameters itr, n_ex, and n_repl are the number of iterations of MCMC until each replica exchange, the number of times of exchanges, and the number of replicas, respectively. Thus, the total number of MCMC steps is itr * n_ex. If the user’s machine has multi-core processors, the MCMC computations of each replica are parallelized.

eobj <- initGraphSOM(eobj, nsamples = 10, rept = 1, coef = 0.5)
eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4)

Although the number of cores is automatically detected, it can be specified by n_cl. In addition, as the default setting, MCMC optimizations are started with 3 genes randomly selected from the gene set specified by gset. However it can be also modified by n_ig. Note that the length of fgenes must be larger than n_ig.

eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, n_ig = 3)

The random seed can be fixed by the seed option in the rxmcmc function.

eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, n_ig = 3, seed = 0)

The results of RX-MCMC are stored in espresso objects and they can be accessed by @mcmc.

eobj@mcmc[["sampling"]][["repl.1"]]
#>         mean_score var_score max_score min_score  mean_acc var_acc   max_acc
#> step.1   0.7761541        NA 0.7761541 0.7761541 0.6360248      NA 0.6360248
#> step.2   0.9046352        NA 0.9046352 0.9046352 0.7064182      NA 0.7064182
#> step.3   1.0212497        NA 1.0212497 1.0212497 0.7726708      NA 0.7726708
#> step.4   1.0891609        NA 1.0891609 1.0891609 0.8053830      NA 0.8053830
#> step.5   1.0895137        NA 1.0895137 1.0895137 0.8033126      NA 0.8033126
#> step.6   1.1095087        NA 1.1095087 1.1095087 0.8132505      NA 0.8132505
#> step.7   1.1616804        NA 1.1616804 1.1616804 0.8414079      NA 0.8414079
#> step.8   1.1616804        NA 1.1616804 1.1616804 0.8414079      NA 0.8414079
#> step.9   1.1663870        NA 1.1663870 1.1663870 0.8451346      NA 0.8451346
#> step.10  1.1682801        NA 1.1682801 1.1682801 0.8476190      NA 0.8476190
#>           min_acc  mean_ari var_ari   max_ari   min_ari gene_size
#> step.1  0.6360248 0.2802585      NA 0.2802585 0.2802585         3
#> step.2  0.7064182 0.3964340      NA 0.3964340 0.3964340         4
#> step.3  0.7726708 0.4971578      NA 0.4971578 0.4971578         5
#> step.4  0.8053830 0.5675557      NA 0.5675557 0.5675557         6
#> step.5  0.8033126 0.5724020      NA 0.5724020 0.5724020         7
#> step.6  0.8132505 0.5925164      NA 0.5925164 0.5925164         8
#> step.7  0.8414079 0.6405451      NA 0.6405451 0.6405451         9
#> step.8  0.8414079 0.6405451      NA 0.6405451 0.6405451        10
#> step.9  0.8451346 0.6425049      NA 0.6425049 0.6425049        11
#> step.10 0.8476190 0.6413221      NA 0.6413221 0.6413221        12
eobj@mcmc[["max"]][["repl.1"]]
#>         mean_score var_score max_score min_score  mean_acc var_acc   max_acc
#> step.1   0.7761541        NA 0.7761541 0.7761541 0.6360248      NA 0.6360248
#> step.2   0.9046352        NA 0.9046352 0.9046352 0.7064182      NA 0.7064182
#> step.3   1.0212497        NA 1.0212497 1.0212497 0.7726708      NA 0.7726708
#> step.4   1.0891609        NA 1.0891609 1.0891609 0.8053830      NA 0.8053830
#> step.5   1.0895137        NA 1.0895137 1.0895137 0.8033126      NA 0.8033126
#> step.6   1.1095087        NA 1.1095087 1.1095087 0.8132505      NA 0.8132505
#> step.7   1.1616804        NA 1.1616804 1.1616804 0.8414079      NA 0.8414079
#> step.8   1.1616804        NA 1.1616804 1.1616804 0.8414079      NA 0.8414079
#> step.9   1.1663870        NA 1.1663870 1.1663870 0.8451346      NA 0.8451346
#> step.10  1.1682801        NA 1.1682801 1.1682801 0.8476190      NA 0.8476190
#>           min_acc  mean_ari var_ari   max_ari   min_ari gene_size
#> step.1  0.6360248 0.2802585      NA 0.2802585 0.2802585         3
#> step.2  0.7064182 0.3964340      NA 0.3964340 0.3964340         4
#> step.3  0.7726708 0.4971578      NA 0.4971578 0.4971578         5
#> step.4  0.8053830 0.5675557      NA 0.5675557 0.5675557         6
#> step.5  0.8033126 0.5724020      NA 0.5724020 0.5724020         7
#> step.6  0.8132505 0.5925164      NA 0.5925164 0.5925164         8
#> step.7  0.8414079 0.6405451      NA 0.6405451 0.6405451         9
#> step.8  0.8414079 0.6405451      NA 0.6405451 0.6405451         9
#> step.9  0.8451346 0.6425049      NA 0.6425049 0.6425049        11
#> step.10 0.8476190 0.6413221      NA 0.6413221 0.6413221        12
eobj@mcmc[["exchange"]]
#>         repl.1 repl.2 repl.3 repl.4
#> step.1   1.000  0.100   0.01  0.001
#> step.2   1.000  0.010   0.10  0.001
#> step.3   1.000  0.010   0.10  0.001
#> step.4   0.100  0.001   1.00  0.010
#> step.5   0.100  0.001   1.00  0.010
#> step.6   0.010  0.001   1.00  0.100
#> step.7   0.010  0.001   1.00  0.100
#> step.8   0.001  0.010   0.10  1.000
#> step.9   0.001  0.010   0.10  1.000
#> step.10  0.001  0.100   0.01  1.000
eobj@mcmc[["final_genes"]]
#> $repl.1
#>  [1] "Rassf5"   "Myl3"     "Atp5o"    "Apoe"     "Dlk1"     "Marcksl1"
#>  [7] "Fbxo32"   "Atp5c1"   "Ncl"      "Itga6"    "Tmod1"    "Tdgf1"   
#> 
#> $repl.2
#>  [1] "Wbp5"    "Myl3"    "Actn2"   "Ncl"     "Apoe"    "Fbxo32"  "Bnip3"  
#>  [8] "Slc25a4" "Atp5o"   "Tdgf1"   "Cst3"   
#> 
#> $repl.3
#>  [1] "Crip1"  "Hmgb1"  "Ptma"   "Apoe"   "Tdgf1"  "Bnip3"  "Tpi1"   "Fbxo32"
#>  [9] "Pdlim1" "Rps7"   "Naca"   "Tagln" 
#> 
#> $repl.4
#> [1] "Anxa2"  "Fbxl22" "Apoe"   "Dlk1"   "Myl1"   "Tuba1a" "Csrp2"  "Actn2"
eobj@mcmc[["best_repl"]]
#>   repl step
#> 1    3   10
eobj@mcmc[["best_genes"]]
#> $repl.3
#>  [1] "Crip1"  "Hmgb1"  "Ptma"   "Apoe"   "Tdgf1"  "Bnip3"  "Tpi1"   "Fbxo32"
#>  [9] "Pdlim1" "Rps7"   "Naca"   "Tagln"

The results of RX-MCMC can be also obtained as text files as follows, where dir is a path to an output directory.

writeMCMC(eobj, dir = "output")

The histories of RX-MCMC optimization and replica exchanges can be plotted by plotMCMC function.

plotMCMC(eobj)

Plot distance maps

In order to visualize the distance between domains and between cell samples before/after clustering for all replications, espresso provides plotDistMap. The usage is quite simple.

# size_d: The font and cell sizes for distance map of domains (default: NULL).
# size_s: The font and cell sizes for distance map of cell samples (default: 5).
plotDistMap(eobj, size_s = 5)

Compute and plot UMAP

espresso provides a function for dimension reduction method named Uniform Manifold Approximation and Projection (UMAP) to show the distribution of the cell samples in 2D or 3D space (the default space is 2D). To compute and plot UMAP, use runUMAP and plotUMAP functions, respectively. The UMAP result is stored @umap.

eobj <- runUMAP(eobj)
head(eobj@umap)
#>                           [,1]       [,2]
#> TGCCCATGTGCTGTAT-9  -0.0225598 -2.2024913
#> GCTGCGAAGGTGCAAC-10  0.4925257  2.6053568
#> CATCGAAAGTGTCCAT-10  0.4897042 -2.5706015
#> GAACGGATCTGATTCT-10  0.7458983 -2.0646923
#> ACTGCTCTCATTGCCC-10 -0.8827290  1.5327852
#> AAGGAGCTCATACGGT-10  2.3228327 -0.6908234
plotUMAP(eobj, file = 'umap_2D.png')
#> quartz_off_screen 
#>                 2
UMAP

UMAP

This function imports uwot packages, thus all parameters for UMAP computation can be given according to the original package. When using custom settings for UMAP, give the UMAP parameters umap_param as a list to runUMAP by the following way. For the details of the parameters, refer to uwot::umap function. The feature genes can be given by gset parameter while UMAP is performed with genes stored in @gset if gset = NULL.

eobj <- runUMAP(eobj, gset = eobj@mcmc[["best_genes"]][[1]],
                umap_param = list("n_neighbors" = 15, "n_components" = 3, "n_epochs" = 1000))
head(eobj@umap)
#>                           [,1]       [,2]        [,3]
#> TGCCCATGTGCTGTAT-9  -0.6878206 -0.1425771  0.65135581
#> GCTGCGAAGGTGCAAC-10 -0.4250149 -0.2897149  0.69028134
#> CATCGAAAGTGTCCAT-10 -0.8812406  0.2021387  0.24999088
#> GAACGGATCTGATTCT-10 -0.7272825  0.4317129 -0.03455686
#> ACTGCTCTCATTGCCC-10 -0.6010484 -0.3180909  1.52406007
#> AAGGAGCTCATACGGT-10 -1.1745350  0.3119029 -0.00599524
# size: the size for plotted points.
# magnify: Multiplicative factor to apply to size of window when producing legend (default: 1).
plotUMAP(eobj, size = 1.5, magnify = 1, file = "umap_3D.png") 
UMAP

UMAP

Although the plotUMAP imports rgl package, scatterplot3d package can be used instead of rgl by rgl = FALSE (default: TRUE) when rgl is not available.

plotUMAP(eobj, file = "umap_3D_scatterplot.png", rgl = FALSE)
#> quartz_off_screen 
#>                 2
UMAP

UMAP

The export format is automatically detected from its extension (Supported: png, pdf, eps)

plotUMAP(eobj, file = 'umap.pdf')

The RGL window can be closed after plotting by rglclose = TRUE.

plotUMAP(eobj, file = 'umap.pdf', rglclose = TRUE)

The file saving can be suppress by save = FALSE. This is useful when you want to use png(), pdf(), and postscript() functions directly in order to specify their parameters (e.g., width, height, and so on) manually.

pdf(file = 'umap.pdf', width = 6, height = 6)
plotUMAP(eobj, save = FALSE)
dev.off()

Note

When the exported format is ‘pdf’ or ‘eps’, and rgl = TRUE, the legend information is not printed due to the design of rgl package. In addition, it often takes much time to draw UMAP plot as a PDF file when the number of samples is large. In such case, please output as a PNG file.

Other options of UMAP plotting

Plotting colors can be changed by any colors according to the domains.

domain_colors <- list(AHF = "orange",
                      Atrial = "pink",
                      LV = "green",
                      OFT = "lightblue",
                      pSHF = "purple",
                      RV = "blue",
                      SV = "plum"
                      )
plotUMAP(eobj, domcol = domain_colors, file = "umap_3D_color1.png", size = 1.5)
UMAP

UMAP

Color coding in groups different from the domains is possible with the color_by_group and grpcol options. The color_by_group must be a data.frame of two columns with a header given by the following format, where the header must be ‘sample’ and ‘group’.

sample group
Cell 1 Group 1
Cell 2 Group 2
Cell 3 Group 1
Cell 4 Group 4

The color of each group can be specified by grpcol of the same format as domcol. The following is an example.

path_group <- system.file("extdata", "group_soysa19_e825.txt", package = "espresso")
group <- read.table(path_group, header = T, as.is = TRUE)
head(group)
#>                sample group
#> 1  TGCCCATGTGCTGTAT-9   FHF
#> 2 GCTGCGAAGGTGCAAC-10   FHF
#> 3 CATCGAAAGTGTCCAT-10   FHF
#> 4 GAACGGATCTGATTCT-10   FHF
#> 5 ACTGCTCTCATTGCCC-10   FHF
#> 6 AAGGAGCTCATACGGT-10   FHF
group_colors <- list(FHF = "orange", SHF = "blue")
plotUMAP(eobj, color_by_group = group, grpcol = group_colors, size = 1.5, file = "umap_3D_color2.png")
UMAP

UMAP

In addition, the coloring of each sample can also be based on the expression levels of a gene rather than the domains by specifying a gene name to the argument color_by_gene. In such case, the domcol is ignored.

plotUMAP(eobj, color_by_gene = "Actn2", file = "umap_3D_color3.png", size = 1.5)
UMAP

UMAP

If movie = TRUE, a movie of the resulting UMAP is generated. This option is valid when rgl = TRUE.

plotUMAP(eobj, movie = TRUE)

The above script uses all cell samples for UMAP plotting. If you want to plot UMAP with downsampled sets, the following script can work.

eobj.tmp <- eobj
samples <- eobj.tmp@ssets[["set.1"]]
eobj.tmp@exprs <- eobj.tmp@exprs[match(samples, rownames(eobj.tmp@exprs)), , drop = FALSE]
eobj.tmp <- runUMAP(eobj.tmp, gset = eobj.tmp@mcmc[["best_genes"]][[1]], 
                    umap_param = list("n_neighbors" = 5, "n_components" = 3, "n_epochs" = 1000))
plotUMAP(eobj.tmp, movie = TRUE, domcol = domain_colors)

UMAP is even more effective for overviewing distribution of cell samples when the number of cell samples is large. The following figure is UMAP using GO:0003209 for 3,331 cell samples obtained from GSE126128_E775E825_WTKO_10X.csv.gz.

UMAP

UMAP

Appendix

Run graphSOM and rxmcmc of previous version (0)

In espresso of version 0, the reproducibility between the results of graphSOM and rxmcmc was not guaranteed, and a random number generation method different from version 1.0.0 or later was employed. The followings are examples of the way to run graphSOM and rxmcmc of version 0.

graphSOM(eobj, gset = fgenes, seed = 0, version = "0")
rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, seed = 0, version = "0")

Get gene description via biomaRt

The gene description can be obtained via biomaRt package as followings:

library(biomaRt)
fgenes <- c("Actn2", "Hand2", "Vegfb", "Wnt5a")
ds <- useEnsembl(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl')
ann <- getBM(attributes = c("mgi_symbol", "description"), 
             mart = ds, filters = "mgi_symbol", values = fgenes)

Note

If the version of R is 3.6 or later, the above script may not work depending on the version of biomaRt (e.g., >= 2.42.0). In that case, add the option useCache = False to the getBM function.

ann <- getBM(attributes = c("mgi_symbol", "description"), 
             mart = ds, filters = "mgi_symbol", values = fgenes, useCache = FALSE)


Citing espresso

To cite your use of the espresso software, please reference the following paper:

  1. Tomoya Mori, Toshiro Takase, Kuan-Chun Lan, Junko Yamane, Cantas Alev, Kenji Osafune, Jun Yamashita, and Wataru Fujibuchi. eSPRESSO: a spatial self-organizing-map clustering method for single-cell transcriptomes of various tissue structure using graph-based networks. bioRxiv. doi: http://doi.org/10.1101/2020.12.31.424948.