Using scTour to infer cellular dynamics

For version 1.0.0

This tutorial is for version 1.0.0 where prediction functionalities have been separated from inference. Please follow Installation to install the latest version (1.0.0).

Dataset

This notebook uses the dataset from the developing human cortex (Trevino et al., 2021, Cell) to show the basic inference functionalities of scTour. This data (‘EX_development_human_cortex_10X.h5ad’) can be downloaded from here.

[1]:

import sctour as sct
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

[2]:

adata = sc.read('../../../../EX_development_data/EX_development_human_cortex_10X.h5ad')
adata.shape

[2]:

(36318, 19073)

This dataset has substantial batch effect as shown below.

[3]:

sc.pl.umap(adata, color=['celltype', 'Sample batch'], legend_loc='on data')

../_images/notebook_scTour_inference_basic_8_0.png

Model training

Count the number of genes detected in each cell. This step is required before the scTour model training.

[4]:

sc.pp.calculate_qc_metrics(adata, percent_top=None, log1p=False, inplace=True)

Select highly variable genes.

[5]:

sc.pp.highly_variable_genes(adata, flavor='seurat_v3', n_top_genes=1000, subset=True)

/home/ql312/software/anaconda3/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py:144: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version.  Use .loc with labels or .iloc with positions instead.
  df.loc[: int(n_top_genes), 'highly_variable'] = True

Now we are ready to train the scTour model. The default loss_mode is negative binomial conditioned likelihood (nb), which requires raw UMI counts (stored in .X of the AnnData) as input. By default, the percentage of cells used to train the model is set to 0.9 when the total number of cells is less than 10,000 and 0.2 when greater than 10,000. Users can adjust the percentage by using the parameter percent (for example percent=0.6).

There are two parameters alpha_recon_lec and alpha_recon_lode for balancing the reconstruction errors from encoder-derived latent space and from ODE-solver-derived latent space. A small alpha_recon_lec (with a small weight assigned to the reconstruction error from encoder-derived latent space) tends to derive latent representations that order the cells based on their pseudotime and thus cannot well separate cell types with similar developmental orders. The default values for these two parameters are 0.5. Users can adjust them depending on datasets.

[6]:

tnode = sct.train.Trainer(adata, loss_mode='nb', alpha_recon_lec=0.5, alpha_recon_lode=0.5)
tnode.train()

Epoch 400: 100%|██████████| 400/400 [20:20<00:00,  3.04s/epoch, train_loss=269, val_loss=265]

Infer cellular dynamics

pseudotime

Infer the developmental pseudotime based on the trained model.

[7]:

adata.obs['ptime'] = tnode.get_time()

The pseudotime might be returned in reverse order due to the two possible directions for integration when solving an ODE. If this happens, please refer to tutorial scTour inference – Post-inference adjustment for how to correct the ordering.

latent space

Infer the latent representations. The two parameters alpha_z and alpha_predz adjust the weights given to the latent space from variational inference and that from ODE solver. Larger alpha_z skews the latent space towards the intrinsic transcriptomic structure while larger alpha_predz is more representative of the extrinsic pseudotime ordering. Users can adjust the two parameters according to their purposes.

[8]:

#zs represents the latent z from variational inference, and pred_zs represents the latent z from ODE solver
#mix_zs represents the weighted combination of the two, which is used for downstream analysis
mix_zs, zs, pred_zs = tnode.get_latentsp(alpha_z=0.5, alpha_predz=0.5)
adata.obsm['X_TNODE'] = mix_zs

vector field

Infer the transcriptomic vector field.

[9]:

adata.obsm['X_VF'] = tnode.get_vector_field(adata.obs['ptime'].values, adata.obsm['X_TNODE'])

Visualization

Generate a UMAP embedding based on the inferred latent space. Optionally, you can order the cells according to their pseudotime before this step, which may yield a better trajectory for some datasets. (You can also use your own UMAP to visualize the cellular dynamics inferred.)

[10]:

adata = adata[np.argsort(adata.obs['ptime'].values), :]
sc.pp.neighbors(adata, use_rep='X_TNODE', n_neighbors=15)
sc.tl.umap(adata, min_dist=0.1)

Visualize the clusters, sample batches, pseudotime and vector field on the UMAP. When visualizing the vector field, information from the estimated pseudotime can be optionally incorporated by providing the parameter t_key (for example t_key='ptime' where ptime is the pseudotime stored in .obs).

As shown below, the inference by scTour is largely unaffected by batch effects.

[11]:

fig, axs = plt.subplots(ncols=2, nrows=2, figsize=(10, 10))
sc.pl.umap(adata, color='celltype', ax=axs[0, 0], legend_loc='on data', show=False, frameon=False)
sc.pl.umap(adata, color='Sample batch', ax=axs[0, 1], show=False, frameon=False)
sc.pl.umap(adata, color='ptime', ax=axs[1, 0], show=False, frameon=False)
sct.vf.plot_vector_field(adata, zs_key='X_TNODE', vf_key='X_VF', use_rep_neigh='X_TNODE', color='celltype', show=False, ax=axs[1, 1], legend_loc='none', frameon=False, size=100, alpha=0.2)
plt.show()

../_images/notebook_scTour_inference_basic_31_0.png

Robustness to random initializations

For some datasets, the inferred cellular dynamics by scTour may vary when the model is initialized differently. To check the robustness of the cellular dynamics inferred to random initializations, you can run the training process 10 times by providing different seeds for generating random numbers using the parameter random_state in sct.train.Trainer. You can then check the pseudotime, vector field and latent representations derived from each of these 10 models to see whether the inference is stable or not. If not, you can possibly choose the one with the highest frequency.