espresso is a package for domain estimation of cells only from gene expression profile based on graph-structured stochastic self-organizing map (GraphSOM) and Markov chain Monte Carlo (MCMC) method.

Usage

Data format

espresso requires three inputs, i.e., expression profile, domain topology, and domain assignment information for cell samples.

Expression profile exprs
The expression profile is given as a numeric matrix whose rows and columns indicate cell samples and genes, respectively.

	Gene 1	Gene 2	Gene 3	…
Cell 1	0.15	0.28	1.33	…
Cell 2	1.08	0.62	0.01	…
Cell 3	0.74	0.42	0.87	…
…	…	…	…	…

Domain topology topology
The domain topology is given as an adjacent matrix whose rows and columns indicate domains. In this matrix, 1 and 0 mean that corresponding domains are connected (i.e., adjacent each other in living a tissue or an organ) and unconnected, respectively. Note that this matrix must be symmetric and its diagonals must be filled by 0 since self-loops are not assumed in espresso.

	Domain 1	Domain 2	Domain 3	Domain 4
Domain 1	0	1	1	0
Domain 2	1	0	1	1
Domain 3	1	1	0	1
Domain 4	0	1	1	0

Domain assignment asgmt
The Domain assignment is given as a data.frame with two columns which indicate cell samples and corresponding domains.

sample	domain
Cell 1	Domain 1
Cell 2	Domain 2
Cell 3	Domain 1
Cell 4	Domain 4
…	…

espresso contains example inputs of raw datasets for the above three inputs: “exprs_soysa19_e825.txt”, “topology_soysa19_e825.txt”, and “asgmt_soysa19_e825.txt”, which can be loaded by the following commands. The expression profile “exprs_soysa19_e825.txt” was created from GSE126128_E775E825_WTKO_10X.csv.gz by random selection of 10 single-cells from each domain i.e., 70 cell-samples in total.

# Get paths of input files
path_exprs <- system.file("extdata", "exprs_soysa19_e825.txt", package = "espresso")
path_topology <- system.file("extdata", "topology_soysa19_e825.txt", package = "espresso")
path_asgmt <- system.file("extdata", "asgmt_soysa19_e825.txt", package = "espresso")

# Load data
exprs <- t(read.table(path_exprs, header = TRUE, row.names = 1, sep = "\t", 
                      as.is = TRUE, check.names = FALSE))
topology <- as.matrix(read.table(path_topology, header = T, row.names = 1))
asgmt <- read.table(path_asgmt, header = T, as.is = TRUE)

# Show the contents of the objects
exprs[1:5, 1:5]
#>                     0610005C13Rik 0610007P08Rik 0610007P14Rik 0610007P22Rik
#> TGCCCATGTGCTGTAT-9              0     0.0000000     0.0000000     0.0000000
#> GCTGCGAAGGTGCAAC-10             0     0.0000000     0.5096020     0.8455493
#> CATCGAAAGTGTCCAT-10             0     0.0000000     0.0000000     0.4440263
#> GAACGGATCTGATTCT-10             0     0.3703834     0.6400520     0.3703834
#> ACTGCTCTCATTGCCC-10             0     0.0000000     0.6228278     0.8312940
#>                     0610008F07Rik
#> TGCCCATGTGCTGTAT-9              0
#> GCTGCGAAGGTGCAAC-10             0
#> CATCGAAAGTGTCCAT-10             0
#> GAACGGATCTGATTCT-10             0
#> ACTGCTCTCATTGCCC-10             0
topology
#>        AHF OFT RV LV Atrial SV pSHF
#> AHF      0   1  0  0      0  0    1
#> OFT      1   0  1  0      0  0    0
#> RV       0   1  0  1      0  0    0
#> LV       0   0  1  0      1  0    0
#> Atrial   0   0  0  1      0  1    0
#> SV       0   0  0  0      1  0    1
#> pSHF     1   0  0  0      0  1    0
head(asgmt)
#>                sample domain
#> 1  TGCCCATGTGCTGTAT-9 Atrial
#> 2 GCTGCGAAGGTGCAAC-10 Atrial
#> 3 CATCGAAAGTGTCCAT-10 Atrial
#> 4 GAACGGATCTGATTCT-10 Atrial
#> 5 ACTGCTCTCATTGCCC-10 Atrial
#> 6 AAGGAGCTCATACGGT-10 Atrial

Initialization of eSPRESSO object

Before executing GraphSOM clustering, espresso object should be prepared. The variables exprs, topology, and asgmt are inputs explained in Data format. initEsp is a function to initialize and return an espresso object.

eobj <- initEsp(exprs, topology, asgmt)

The espresso object is defined as S4 class, and its slots can be accessed by @ symbol.

str(eobj)
#> Formal class 'Espresso' [package "espresso"] with 28 slots
#>   ..@ exprs      : num [1:70, 1:23749] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:70] "TGCCCATGTGCTGTAT-9" "GCTGCGAAGGTGCAAC-10" "CATCGAAAGTGTCCAT-10" "GAACGGATCTGATTCT-10" ...
#>   .. .. ..$ : chr [1:23749] "0610005C13Rik" "0610007P08Rik" "0610007P14Rik" "0610007P22Rik" ...
#>   ..@ exprs_som  : NULL
#>   ..@ rept_som   : NULL
#>   ..@ topology   : int [1:7, 1:7] 0 1 0 0 0 0 1 1 0 1 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#>   .. .. ..$ : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#>   ..@ dist       : NULL
#>   ..@ map        : NULL
#>   ..@ map_method : NULL
#>   ..@ radius     : NULL
#>   ..@ lsteps     : NULL
#>   ..@ rept       : NULL
#>   ..@ bmu        : NULL
#>   ..@ bmu_method : NULL
#>   ..@ stochastic : NULL
#>   ..@ rmin       : NULL
#>   ..@ rmax       : NULL
#>   ..@ domain2name:'data.frame':  7 obs. of  2 variables:
#>   .. ..$ domain: int [1:7] 1 2 3 4 5 6 7
#>   .. ..$ name  : chr [1:7] "AHF" "OFT" "RV" "LV" ...
#>   ..@ gset       : chr [1:23749] "0610005C13Rik" "0610007P08Rik" "0610007P14Rik" "0610007P22Rik" ...
#>   ..@ ssets      : NULL
#>   ..@ asgmt      :'data.frame':  70 obs. of  2 variables:
#>   .. ..$ sample: chr [1:70] "TGCCCATGTGCTGTAT-9" "GCTGCGAAGGTGCAAC-10" "CATCGAAAGTGTCCAT-10" "GAACGGATCTGATTCT-10" ...
#>   .. ..$ domain: chr [1:70] "Atrial" "Atrial" "Atrial" "Atrial" ...
#>   ..@ score      : NULL
#>   ..@ summary    : NULL
#>   ..@ mcmc       : NULL
#>   ..@ seed       : NULL
#>   ..@ coef       : NULL
#>   ..@ nmin       : NULL
#>   ..@ nmax       : NULL
#>   ..@ swap       : NULL
#>   ..@ umap       : NULL

Preprocessing (Optional)

espresso provides simple preprocessing functions for log-scaling of expression profile and filtering out of low expressed and low variable genes. If the input expression profile need not be preprocessed, this procedure can be ignored.

Log-scaling

The input gene expression data can be scaled by log10 as preprocessing. The base can be changed by specifying base = "2" or base = "e".

# Log-scaling of the expression profile
# In this example, log-scaling is unnecessary since the expression values is already log-scaled.
# eobj <- logScale(eobj)

Gene filtering

The low expressed and low variable genes can be filtered out by the following command. As default setting, genes with expression level of higher than 1 in at least two cells and whose standard deviation of expression levels through all cells more than 0.05 can pass the filtering.

# Filter out genes with low expression and low variance
eobj <- filterGenes(eobj)
#> Filtering genes...
#> 19699 genes are filtered out (23749 -> 4050 genes).

The parameters can be changed by specifying the arguments ncell, expressed, and sd, i.e., the default setting is ncell = 2, expressed = 1.0, and sd = 0.05.

# Filter out genes with low expression and low variance
eobj <- filterGenes(eobj, ncell = 2, expressed = 1.0, sd = 0.05)

GraphSOM clustering

Before starting GraphSOM clustering, the parameters for GraphSOM clustering should be initialized.

eobj <- initGraphSOM(eobj)

In this case, all cell samples and genes are used for the clustering, so that it will take a long time to execute when the input gene expression profile is quite large.

Therefore, if downsampling is required, graphSOM can be executed with subsets of cell samples selected uniformly random from each domain, and repeats the clustering with different sets. The nsamples indicates the number of samples extracted from each domain, i.e., nsamples = 10 under a dataset with 4 domains means that 40 cell samples will be selected. If the total number of samples of a certain domain is less than the value specified nsamples, all samples of the domain are selected. The rept is the number of repeats for GraphSOM clustering with different sample sets.

In addition, the coef is the coefficient of ARI value for score function. The GraphSOM results are evaluated by prediction accuracy, ARI, and a score function (= accuracy + coef * ARI). The default is coef = 1.0.

eobj <- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3)

If the sample set after downsampling needs to be fixed, set a value to the parameter seed.

Note

Even if the same random seed is given, the selected samples may differ depending on the version of R.

eobj <- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3, seed = 0)

From the version 0.2.8, GraphSOM clustering introduces the cluster swap method by default. If the cluster swap method is not needed, set FALSE to the option swap of initGraphSOM function.

eobj <- initGraphSOM(eobj, nsamples = 10, coef = 0.5, rept = 3, swap = FALSE, seed = 0)

After initializing GraphSOM setting, GraphSOM can be executed. The most simple way to execute GraphSOM clustering under the parameter setting given by initGraphSOM is as follow.

eobj <- graphSOM(eobj)

As another option, graphSOM can be executed with a specific gene set by setting it to the parameter gset as a vector.

eobj <- graphSOM(eobj, gset = c("Actn2", "Hand2", "Vegfb", "Wnt5a"))
#> GraphSOM clustering
#> Evaluation
#> Complete

In addition to these, graphSOM has several parameters, so please refer to the manual for their details.

The graphSOM clustering results can be shown by the following.

eobj@summary
#> $mean_score
#> [1] 0.7524326
#> 
#> $var_score
#> [1] 0.0003480211
#> 
#> $max_score
#> [1] 0.7730137
#> 
#> $min_score
#> [1] 0.7366345
#> 
#> $mean_acc
#> [1] 0.6433402
#> 
#> $var_acc
#> [1] 0.0001070489
#> 
#> $max_acc
#> [1] 0.6546584
#> 
#> $min_acc
#> [1] 0.6343685
#> 
#> $mean_ari
#> [1] 0.2181847
#> 
#> $var_ari
#> [1] 0.0002766807
#> 
#> $max_ari
#> [1] 0.2367107
#> 
#> $min_ari
#> [1] 0.2045319

The results are mean values and mean variance of the score, the prediction accuracy, and ARI for the repeats specified in graphSOM function. The result of each repeat can also be shown.

eobj@score
#>            score       acc       ari
#> rept.1 0.7366345 0.6343685 0.2045319
#> rept.2 0.7730137 0.6546584 0.2367107
#> rept.3 0.7476495 0.6409938 0.2133114

The convergence curve can be shown. rept indicates the repeat numbers to be plotted.

plotConvCurve(eobj, rept = 1:2)

The multiple results are plotted at the same time by the following example code.

par(mfrow = c(2,2))
plotConvCurve(eobj)
dev.off()

Feature gene selection based on random forest

espresso provides a function for feature gene selection selectFeatures. This function is a rapper function of Boruta function of Boruta package, which is a feature gene selection method based on random forest. The usage of this function is very simple.

fgenes <- selectFeatures(eobj, maxRuns = 100)

fgenes
#>  [1] "Mylk3"         "Ncl"           "Ptma"          "Tnni1"        
#>  [5] "Actc1"         "Ak1"           "Apoe"          "Cryab"        
#>  [9] "Csrp2"         "Cst3"          "Fbxl22"        "Fbxo32"       
#> [13] "Hspb2"         "Myh7"          "Myl3"          "Myl4"         
#> [17] "Myl7"          "Pgam2"         "Pkm"           "Slc25a4"      
#> [21] "Tagln"         "Tmod1"         "Tnni3"         "Tpi1"         
#> [25] "Tpm1"          "Tuba1a"        "Actn2"         "Atp5c1"       
#> [29] "Hcfc1r1"       "Myl1"          "Myl6"          "Npm3"         
#> [33] "Tgfb1i1"       "Bnip3"         "Crip1"         "Gyg"          
#> [37] "Hmgb1"         "Dlk1"          "Hspe1"         "Ldha"         
#> [41] "Pdlim1"        "Pdlim5"        "Ptpla"         "Marcksl1"     
#> [45] "Naca"          "Rps7"          "Atp5o"         "Gm10075"      
#> [49] "Apobec2"       "Isl1"          "Lta4h"         "Actb"         
#> [53] "Atp2a2"        "Dcp1b"         "Itga6"         "Nr2f2"        
#> [57] "Anxa2"         "Cox6a2"        "Rassf5"        "Tdgf1"        
#> [61] "Wbp5"          "Suclg2"        "Tkt"           "Krt19"        
#> [65] "Lrrc39"        "Nop16"         "Atp5g1"        "Rpl7a"        
#> [69] "Rplp2"         "Cdh2"          "Hnrnpab"       "Fundc2"       
#> [73] "4833424O15Rik" "Ndufs6"

Please note that the result of Boruta depends on random seed. Thus, if the results need to be fixed, set a value to the parameter seed.

fgenes <- selectFeatures(eobj, maxRuns = 100, seed = 0)

Boruta classifies genes into the three categories, “Confirmed”, “Tentative”, and “Rejected”, according to their importance. This function returns a vector of “NOT Rejected” genes as features by default. If only “Confirmed” genes are required, set "c" to the parameter decision.

fgenes <- selectFeatures(eobj, maxRuns = 100, decision = "c", seed = 0)

In order to resolve genes left as “Tentative”, maxRuns may be increased, where the parameter maxRuns is the maximum number of importance source runs (refer the manual of Boruta package). Empirically, it is good to determine the parameters so that around 100 genes are selected.

Usage of Gene Ontology (GO) information (optional)

In order to execute graphSOM with functional gene sets already defined, espresso provides us with gene sets in GO database.

# Mouse gene sets (Ensembl release version 101)
data(mm_symbol_v101)

# Human gene sets (Ensembl release version 101)
data(hs_symbol_v101)

If another version, another species, or another gene identifier are required, the following command enable a user to create another gene set via R biomaRt package. The parameters to be set to species are gid according to biomaRt. Note that if the latest version is required, please do not specify version in the function.

# Get GO information
hs_symbol_v98 <- getGO(species = 'hsapiens', gid = 'hgnc_symbol', version = '98')

# Save 'hs_symbol_v98' as an R data file.
save(hs_symbol_v98, file = "hs_symbol_v98")

# Load 'hs_symbol_v98'
load(hs_symbol_v98)

The GO information are stored as a list, and each element of list has two vectors, term and genes.

mm_symbol_v101[["GO:0003209"]][["term"]]
#> [1] "cardiac atrium morphogenesis"

mm_symbol_v101[["GO:0003209"]][["genes"]]
#>  [1] "Acvr1"  "Bmp10"  "Bmpr2"  "Ccm2l"  "Ccn1"   "Cfc1"   "Dll4"   "Eng"   
#>  [9] "Gata4"  "Gja5"   "Heg1"   "Hey2"   "Isl1"   "Mesp1"  "Myh6"   "Nkx2-5"
#> [17] "Nog"    "Notch1" "Notch2" "Nsd2"   "Pitx2"  "Prox1"  "Shox2"  "Smo"   
#> [25] "Sos1"   "Sox4"   "Tbx20"  "Tbx5"   "Tgfb2"  "Tnnt2"  "Wnt2"   "Wnt5a" 
#> [33] "Zfpm1"

Thus, graphSOM execution using GO genes can be done by

eobj <- graphSOM(eobj, gset = mm_symbol_v101[["GO:0003209"]][["genes"]])

Gene set optimization by replica exchange Markov chain Monte Carlo method (RX-MCMC)

espresso optimizes gene sets by replica exchange Markov chain Monte Carlo method (RX-MCMC). The parameters itr, n_ex, and n_repl are the number of iterations of MCMC until each replica exchange, the number of times of exchanges, and the number of replicas, respectively. Thus, the total number of MCMC steps is itr * n_ex. If the user’s machine has multi-core processors, the MCMC computations of each replica are parallelized.

eobj <- initGraphSOM(eobj, nsamples = 10, rept = 1, coef = 0.5)
eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4)

Although the number of cores is automatically detected, it can be specified by n_cl. In addition, as the default setting, MCMC optimizations are started with 3 genes randomly selected from the gene set specified by gset. However it can be also modified by n_ig. Note that the length of fgenes must be larger than n_ig.

eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, n_ig = 3)

The random seed can be fixed by the seed option in the rxmcmc function.

eobj <- rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, n_ig = 3, seed = 0)

The results of RX-MCMC are stored in espresso objects and they can be accessed by @mcmc.

eobj@mcmc[["sampling"]]: History of sampling
eobj@mcmc[["max"]]: History of the samples with the best score
eobj@mcmc[["exchange"]]: History of replica exchange
eobj@mcmc[["final_genes"]]: The final genes for each replicas
eobj@mcmc[["best_repl"]]: Information of the best replicas
eobj@mcmc[["best_genes"]]: The best genes

eobj@mcmc[["sampling"]][["repl.1"]]
#>         mean_score var_score max_score min_score  mean_acc var_acc   max_acc
#> step.1   0.7761541        NA 0.7761541 0.7761541 0.6360248      NA 0.6360248
#> step.2   0.9046352        NA 0.9046352 0.9046352 0.7064182      NA 0.7064182
#> step.3   1.0212497        NA 1.0212497 1.0212497 0.7726708      NA 0.7726708
#> step.4   1.0891609        NA 1.0891609 1.0891609 0.8053830      NA 0.8053830
#> step.5   1.0895137        NA 1.0895137 1.0895137 0.8033126      NA 0.8033126
#> step.6   1.1095087        NA 1.1095087 1.1095087 0.8132505      NA 0.8132505
#> step.7   1.1616804        NA 1.1616804 1.1616804 0.8414079      NA 0.8414079
#> step.8   1.1616804        NA 1.1616804 1.1616804 0.8414079      NA 0.8414079
#> step.9   1.1663870        NA 1.1663870 1.1663870 0.8451346      NA 0.8451346
#> step.10  1.1682801        NA 1.1682801 1.1682801 0.8476190      NA 0.8476190
#>           min_acc  mean_ari var_ari   max_ari   min_ari gene_size
#> step.1  0.6360248 0.2802585      NA 0.2802585 0.2802585         3
#> step.2  0.7064182 0.3964340      NA 0.3964340 0.3964340         4
#> step.3  0.7726708 0.4971578      NA 0.4971578 0.4971578         5
#> step.4  0.8053830 0.5675557      NA 0.5675557 0.5675557         6
#> step.5  0.8033126 0.5724020      NA 0.5724020 0.5724020         7
#> step.6  0.8132505 0.5925164      NA 0.5925164 0.5925164         8
#> step.7  0.8414079 0.6405451      NA 0.6405451 0.6405451         9
#> step.8  0.8414079 0.6405451      NA 0.6405451 0.6405451        10
#> step.9  0.8451346 0.6425049      NA 0.6425049 0.6425049        11
#> step.10 0.8476190 0.6413221      NA 0.6413221 0.6413221        12
eobj@mcmc[["max"]][["repl.1"]]
#>         mean_score var_score max_score min_score  mean_acc var_acc   max_acc
#> step.1   0.7761541        NA 0.7761541 0.7761541 0.6360248      NA 0.6360248
#> step.2   0.9046352        NA 0.9046352 0.9046352 0.7064182      NA 0.7064182
#> step.3   1.0212497        NA 1.0212497 1.0212497 0.7726708      NA 0.7726708
#> step.4   1.0891609        NA 1.0891609 1.0891609 0.8053830      NA 0.8053830
#> step.5   1.0895137        NA 1.0895137 1.0895137 0.8033126      NA 0.8033126
#> step.6   1.1095087        NA 1.1095087 1.1095087 0.8132505      NA 0.8132505
#> step.7   1.1616804        NA 1.1616804 1.1616804 0.8414079      NA 0.8414079
#> step.8   1.1616804        NA 1.1616804 1.1616804 0.8414079      NA 0.8414079
#> step.9   1.1663870        NA 1.1663870 1.1663870 0.8451346      NA 0.8451346
#> step.10  1.1682801        NA 1.1682801 1.1682801 0.8476190      NA 0.8476190
#>           min_acc  mean_ari var_ari   max_ari   min_ari gene_size
#> step.1  0.6360248 0.2802585      NA 0.2802585 0.2802585         3
#> step.2  0.7064182 0.3964340      NA 0.3964340 0.3964340         4
#> step.3  0.7726708 0.4971578      NA 0.4971578 0.4971578         5
#> step.4  0.8053830 0.5675557      NA 0.5675557 0.5675557         6
#> step.5  0.8033126 0.5724020      NA 0.5724020 0.5724020         7
#> step.6  0.8132505 0.5925164      NA 0.5925164 0.5925164         8
#> step.7  0.8414079 0.6405451      NA 0.6405451 0.6405451         9
#> step.8  0.8414079 0.6405451      NA 0.6405451 0.6405451         9
#> step.9  0.8451346 0.6425049      NA 0.6425049 0.6425049        11
#> step.10 0.8476190 0.6413221      NA 0.6413221 0.6413221        12
eobj@mcmc[["exchange"]]
#>         repl.1 repl.2 repl.3 repl.4
#> step.1   1.000  0.100   0.01  0.001
#> step.2   1.000  0.010   0.10  0.001
#> step.3   1.000  0.010   0.10  0.001
#> step.4   0.100  0.001   1.00  0.010
#> step.5   0.100  0.001   1.00  0.010
#> step.6   0.010  0.001   1.00  0.100
#> step.7   0.010  0.001   1.00  0.100
#> step.8   0.001  0.010   0.10  1.000
#> step.9   0.001  0.010   0.10  1.000
#> step.10  0.001  0.100   0.01  1.000
eobj@mcmc[["final_genes"]]
#> $repl.1
#>  [1] "Rassf5"   "Myl3"     "Atp5o"    "Apoe"     "Dlk1"     "Marcksl1"
#>  [7] "Fbxo32"   "Atp5c1"   "Ncl"      "Itga6"    "Tmod1"    "Tdgf1"   
#> 
#> $repl.2
#>  [1] "Wbp5"    "Myl3"    "Actn2"   "Ncl"     "Apoe"    "Fbxo32"  "Bnip3"  
#>  [8] "Slc25a4" "Atp5o"   "Tdgf1"   "Cst3"   
#> 
#> $repl.3
#>  [1] "Crip1"  "Hmgb1"  "Ptma"   "Apoe"   "Tdgf1"  "Bnip3"  "Tpi1"   "Fbxo32"
#>  [9] "Pdlim1" "Rps7"   "Naca"   "Tagln" 
#> 
#> $repl.4
#> [1] "Anxa2"  "Fbxl22" "Apoe"   "Dlk1"   "Myl1"   "Tuba1a" "Csrp2"  "Actn2"
eobj@mcmc[["best_repl"]]
#>   repl step
#> 1    3   10
eobj@mcmc[["best_genes"]]
#> $repl.3
#>  [1] "Crip1"  "Hmgb1"  "Ptma"   "Apoe"   "Tdgf1"  "Bnip3"  "Tpi1"   "Fbxo32"
#>  [9] "Pdlim1" "Rps7"   "Naca"   "Tagln"

The results of RX-MCMC can be also obtained as text files as follows, where dir is a path to an output directory.

writeMCMC(eobj, dir = "output")

The histories of RX-MCMC optimization and replica exchanges can be plotted by plotMCMC function.

plotMCMC(eobj)

Plot distance maps

In order to visualize the distance between domains and between cell samples before/after clustering for all replications, espresso provides plotDistMap. The usage is quite simple.

# size_d: The font and cell sizes for distance map of domains (default: NULL).
# size_s: The font and cell sizes for distance map of cell samples (default: 5).
plotDistMap(eobj, size_s = 5)

Compute and plot UMAP

espresso provides a function for dimension reduction method named Uniform Manifold Approximation and Projection (UMAP) to show the distribution of the cell samples in 2D or 3D space (the default space is 2D). To compute and plot UMAP, use runUMAP and plotUMAP functions, respectively. The UMAP result is stored @umap.

eobj <- runUMAP(eobj)
head(eobj@umap)
#>                           [,1]       [,2]
#> TGCCCATGTGCTGTAT-9  -0.0225598 -2.2024913
#> GCTGCGAAGGTGCAAC-10  0.4925257  2.6053568
#> CATCGAAAGTGTCCAT-10  0.4897042 -2.5706015
#> GAACGGATCTGATTCT-10  0.7458983 -2.0646923
#> ACTGCTCTCATTGCCC-10 -0.8827290  1.5327852
#> AAGGAGCTCATACGGT-10  2.3228327 -0.6908234
plotUMAP(eobj, file = 'umap_2D.png')
#> quartz_off_screen 
#>                 2

UMAP

This function imports uwot packages, thus all parameters for UMAP computation can be given according to the original package. When using custom settings for UMAP, give the UMAP parameters umap_param as a list to runUMAP by the following way. For the details of the parameters, refer to uwot::umap function. The feature genes can be given by gset parameter while UMAP is performed with genes stored in @gset if gset = NULL.

eobj <- runUMAP(eobj, gset = eobj@mcmc[["best_genes"]][[1]],
                umap_param = list("n_neighbors" = 15, "n_components" = 3, "n_epochs" = 1000))
head(eobj@umap)
#>                           [,1]       [,2]        [,3]
#> TGCCCATGTGCTGTAT-9  -0.6878206 -0.1425771  0.65135581
#> GCTGCGAAGGTGCAAC-10 -0.4250149 -0.2897149  0.69028134
#> CATCGAAAGTGTCCAT-10 -0.8812406  0.2021387  0.24999088
#> GAACGGATCTGATTCT-10 -0.7272825  0.4317129 -0.03455686
#> ACTGCTCTCATTGCCC-10 -0.6010484 -0.3180909  1.52406007
#> AAGGAGCTCATACGGT-10 -1.1745350  0.3119029 -0.00599524
# size: the size for plotted points.
# magnify: Multiplicative factor to apply to size of window when producing legend (default: 1).
plotUMAP(eobj, size = 1.5, magnify = 1, file = "umap_3D.png")

UMAP

Although the plotUMAP imports rgl package, scatterplot3d package can be used instead of rgl by rgl = FALSE (default: TRUE) when rgl is not available.

plotUMAP(eobj, file = "umap_3D_scatterplot.png", rgl = FALSE)
#> quartz_off_screen 
#>                 2

UMAP

The export format is automatically detected from its extension (Supported: png, pdf, eps)

plotUMAP(eobj, file = 'umap.pdf')

The RGL window can be closed after plotting by rglclose = TRUE.

plotUMAP(eobj, file = 'umap.pdf', rglclose = TRUE)

The file saving can be suppress by save = FALSE. This is useful when you want to use png(), pdf(), and postscript() functions directly in order to specify their parameters (e.g., width, height, and so on) manually.

pdf(file = 'umap.pdf', width = 6, height = 6)
plotUMAP(eobj, save = FALSE)
dev.off()

Note

When the exported format is ‘pdf’ or ‘eps’, and rgl = TRUE, the legend information is not printed due to the design of rgl package. In addition, it often takes much time to draw UMAP plot as a PDF file when the number of samples is large. In such case, please output as a PNG file.

Other options of UMAP plotting

Plotting colors can be changed by any colors according to the domains.

domain_colors <- list(AHF = "orange",
                      Atrial = "pink",
                      LV = "green",
                      OFT = "lightblue",
                      pSHF = "purple",
                      RV = "blue",
                      SV = "plum"
                      )
plotUMAP(eobj, domcol = domain_colors, file = "umap_3D_color1.png", size = 1.5)

UMAP

Color coding in groups different from the domains is possible with the color_by_group and grpcol options. The color_by_group must be a data.frame of two columns with a header given by the following format, where the header must be ‘sample’ and ‘group’.

sample	group
Cell 1	Group 1
Cell 2	Group 2
Cell 3	Group 1
Cell 4	Group 4
…	…

The color of each group can be specified by grpcol of the same format as domcol. The following is an example.

path_group <- system.file("extdata", "group_soysa19_e825.txt", package = "espresso")
group <- read.table(path_group, header = T, as.is = TRUE)
head(group)
#>                sample group
#> 1  TGCCCATGTGCTGTAT-9   FHF
#> 2 GCTGCGAAGGTGCAAC-10   FHF
#> 3 CATCGAAAGTGTCCAT-10   FHF
#> 4 GAACGGATCTGATTCT-10   FHF
#> 5 ACTGCTCTCATTGCCC-10   FHF
#> 6 AAGGAGCTCATACGGT-10   FHF
group_colors <- list(FHF = "orange", SHF = "blue")
plotUMAP(eobj, color_by_group = group, grpcol = group_colors, size = 1.5, file = "umap_3D_color2.png")

UMAP

In addition, the coloring of each sample can also be based on the expression levels of a gene rather than the domains by specifying a gene name to the argument color_by_gene. In such case, the domcol is ignored.

plotUMAP(eobj, color_by_gene = "Actn2", file = "umap_3D_color3.png", size = 1.5)

UMAP

If movie = TRUE, a movie of the resulting UMAP is generated. This option is valid when rgl = TRUE.

plotUMAP(eobj, movie = TRUE)

The above script uses all cell samples for UMAP plotting. If you want to plot UMAP with downsampled sets, the following script can work.

eobj.tmp <- eobj
samples <- eobj.tmp@ssets[["set.1"]]
eobj.tmp@exprs <- eobj.tmp@exprs[match(samples, rownames(eobj.tmp@exprs)), , drop = FALSE]
eobj.tmp <- runUMAP(eobj.tmp, gset = eobj.tmp@mcmc[["best_genes"]][[1]], 
                    umap_param = list("n_neighbors" = 5, "n_components" = 3, "n_epochs" = 1000))
plotUMAP(eobj.tmp, movie = TRUE, domcol = domain_colors)

UMAP is even more effective for overviewing distribution of cell samples when the number of cell samples is large. The following figure is UMAP using GO:0003209 for 3,331 cell samples obtained from GSE126128_E775E825_WTKO_10X.csv.gz.

UMAP

Appendix

Run `graphSOM` and `rxmcmc` of previous version (0)

In espresso of version 0, the reproducibility between the results of graphSOM and rxmcmc was not guaranteed, and a random number generation method different from version 1.0.0 or later was employed. The followings are examples of the way to run graphSOM and rxmcmc of version 0.

graphSOM(eobj, gset = fgenes, seed = 0, version = "0")

rxmcmc(eobj, gset = fgenes, itr = 2, n_ex = 5, n_repl = 4, n_cl = 4, seed = 0, version = "0")

Get gene description via biomaRt

The gene description can be obtained via biomaRt package as followings:

library(biomaRt)
fgenes <- c("Actn2", "Hand2", "Vegfb", "Wnt5a")
ds <- useEnsembl(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl')
ann <- getBM(attributes = c("mgi_symbol", "description"), 
             mart = ds, filters = "mgi_symbol", values = fgenes)

Note

If the version of R is 3.6 or later, the above script may not work depending on the version of biomaRt (e.g., >= 2.42.0). In that case, add the option useCache = False to the getBM function.

ann <- getBM(attributes = c("mgi_symbol", "description"), 
             mart = ds, filters = "mgi_symbol", values = fgenes, useCache = FALSE)

espresso

31-Mar-2022: espresso 1.0.1 released.

Table of Contents

Installation

Confirmed version

Quick example

Usage

Data format

Initialization of eSPRESSO object

Preprocessing (Optional)

Log-scaling

Gene filtering

GraphSOM clustering

Feature gene selection based on random forest

Usage of Gene Ontology (GO) information (optional)

Gene set optimization by replica exchange Markov chain Monte Carlo method (RX-MCMC)

Plot distance maps

Compute and plot UMAP

Other options of UMAP plotting

Appendix

Run `graphSOM` and `rxmcmc` of previous version (0)

Get gene description via biomaRt

Citing espresso

espresso

31-Mar-2022: espresso 1.0.1 released.

Table of Contents

Installation

Confirmed version

Quick example

Usage

Data format

Initialization of eSPRESSO object

Preprocessing (Optional)

Log-scaling

Gene filtering

GraphSOM clustering

Feature gene selection based on random forest

Usage of Gene Ontology (GO) information (optional)

Gene set optimization by replica exchange Markov chain Monte Carlo method (RX-MCMC)

Plot distance maps

Compute and plot UMAP

Other options of UMAP plotting

Appendix

Run graphSOM and rxmcmc of previous version (0)

Get gene description via biomaRt

Citing espresso

Run `graphSOM` and `rxmcmc` of previous version (0)