Unsupervised surgery pipeline with SCVI¶

[1]:

import os
os.chdir('../')
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

[2]:

import scanpy as sc
import torch
import scarchest as sca
from scarchest.dataset.trvae.data_handling import remove_sparsity
import matplotlib.pyplot as plt
import numpy as np
import gdown

[3]:

sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))
torch.set_printoptions(precision=3, sci_mode=False, edgeitems=7)

Set relevant anndata.obs labels and training length¶

Here we use the CelSeq2 and SS2 studies as query data and the other 3 studies as reference atlas. We strongly suggest to use earlystopping to avoid over-fitting. The best earlystopping criteria is the ‘elbo’ for SCVI.

[4]:

condition_key = 'study'
cell_type_key = 'cell_type'
target_conditions = ['Pancreas CelSeq2', 'Pancreas SS2']


vae_epochs = 500
surgery_epochs = 500

early_stopping_kwargs = {
    "early_stopping_metric": "elbo",
    "save_best_state_metric": "elbo",
    "patience": 10,
    "threshold": 0,
    "reduce_lr_on_plateau": True,
    "lr_patience": 8,
    "lr_factor": 0.1,
}

Download Dataset and split into reference dataset and query dataset¶

[5]:

url = 'https://drive.google.com/uc?id=1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd'
output = 'pancreas.h5ad'
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd
To: C:\Users\sergei.rybakov\projects\notebooks\pancreas.h5ad
126MB [00:40, 3.14MB/s]

[5]:

'pancreas.h5ad'

[6]:

adata_all = sc.read('pancreas.h5ad')

[7]:

adata = adata_all.raw.to_adata()
adata = remove_sparsity(adata)
source_adata = adata[~adata.obs[condition_key].isin(target_conditions)].copy()
target_adata = adata[adata.obs[condition_key].isin(target_conditions)].copy()

[8]:

source_adata

[8]:

AnnData object with n_obs × n_vars = 10294 × 1000
    obs: 'batch', 'study', 'cell_type', 'size_factors'

[9]:

target_adata

[9]:

AnnData object with n_obs × n_vars = 5387 × 1000
    obs: 'batch', 'study', 'cell_type', 'size_factors'

Create SCVI model and train it on reference dataset¶

[10]:

sca.dataset.setup_anndata(source_adata, batch_key=condition_key)

INFO      Using batches from adata.obs["study"]
INFO      No label_key inputted, assuming all cells have same label
INFO      Using data from adata.X
INFO      Computing library size prior per batch
INFO      Successfully registered anndata object containing 10294 cells, 1000 vars, 3
          batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates
          and 0 extra continuous covariates.
INFO      Please do not further modify adata until model is trained.

The parameters chosen here proofed to work best in the case of surgery with SCVI.

[12]:

vae = sca.models.SCVI(
    source_adata,
    n_layers=2,
    encode_covariates=True,
    deeply_inject_covariates=False,
    use_layer_norm="both",
    use_batch_norm="none",
)

[13]:

vae.train(n_epochs=vae_epochs, frequency=1, early_stopping_kwargs=early_stopping_kwargs)

INFO      Training for 500 epochs
INFO      KL warmup for 400 epochs
Training...:  19%|█████████████▏                                                      | 97/500 [03:20<18:24,  2.74s/it]INFO      Reducing LR on epoch 97.
Training...:  24%|████████████████▏                                                  | 121/500 [04:26<17:22,  2.75s/it]INFO      Reducing LR on epoch 121.
Training...:  29%|███████████████████▋                                               | 147/500 [05:38<16:11,  2.75s/it]INFO      Reducing LR on epoch 147.
Training...:  31%|████████████████████▉                                              | 156/500 [06:02<15:47,  2.76s/it]INFO      Reducing LR on epoch 156.
Training...:  33%|██████████████████████▍                                            | 167/500 [06:33<15:15,  2.75s/it]INFO      Reducing LR on epoch 167.
Training...:  34%|██████████████████████▋                                            | 169/500 [06:38<15:09,  2.75s/it]INFO
          Stopping early: no improvement of more than 0 nats in 10 epochs
INFO      If the early stopping criterion is too strong, please instantiate it with different
          parameters in the train method.
Training...:  34%|██████████████████████▋                                            | 169/500 [06:41<13:06,  2.37s/it]
INFO      Training is still in warming up phase. If your applications rely on the posterior
          quality, consider training for more epochs or reducing the kl warmup.
INFO      Training time:  265 s. / 500 epochs

Create anndata file of latent representation and compute UMAP¶

[14]:

reference_latent = sc.AnnData(vae.get_latent_representation())
reference_latent.obs["cell_type"] = source_adata.obs[cell_type_key].tolist()
reference_latent.obs["batch"] = source_adata.obs[condition_key].tolist()

[15]:

sc.pp.neighbors(reference_latent, n_neighbors=8)
sc.tl.leiden(reference_latent)
sc.tl.umap(reference_latent)
sc.pl.umap(reference_latent,
           color=['batch', 'cell_type'],
           frameon=False,
           wspace=0.6,
           )

... storing 'cell_type' as categorical
... storing 'batch' as categorical

After pretraining the model can be saved for later use

[16]:

ref_path = 'ref_model/'
vae.save(ref_path, overwrite=True)

Perform surgery on reference model and train on query dataset¶

[17]:

model = sca.models.SCVI.load_query_data(
    target_adata,
    ref_path,
    freeze_dropout = True,
)

INFO      .obs[_scvi_labels] not found in target, assuming every cell is same category
INFO      Using data from adata.X
INFO      Computing library size prior per batch
INFO      Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels']
INFO      Successfully registered anndata object containing 5387 cells, 1000 vars, 5 batches,
          1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0
          extra continuous covariates.
WARNING   Make sure the registered X field in anndata contains unnormalized count data.

[18]:

model.train(n_epochs=surgery_epochs, frequency=1, early_stopping_kwargs=early_stopping_kwargs, weight_decay=0)

INFO      Training for 500 epochs
INFO      KL warmup for 400 epochs
Training...:  11%|███████▍                                                            | 55/500 [01:07<09:02,  1.22s/it]INFO      Reducing LR on epoch 55.
Training...:  14%|█████████▌                                                          | 70/500 [01:25<08:42,  1.21s/it]INFO      Reducing LR on epoch 70.
Training...:  14%|█████████▊                                                          | 72/500 [01:27<08:40,  1.22s/it]INFO
          Stopping early: no improvement of more than 0 nats in 10 epochs
INFO      If the early stopping criterion is too strong, please instantiate it with different
          parameters in the train method.
Training...:  14%|█████████▊                                                          | 72/500 [01:28<08:48,  1.24s/it]
INFO      Training is still in warming up phase. If your applications rely on the posterior
          quality, consider training for more epochs or reducing the kl warmup.
INFO      Training time:  53 s. / 500 epochs

[19]:

query_latent = sc.AnnData(model.get_latent_representation())
query_latent.obs['cell_type'] = target_adata.obs[cell_type_key].tolist()
query_latent.obs['batch'] = target_adata.obs[condition_key].tolist()

WARNING   Make sure the registered X field in anndata contains unnormalized count data.

[20]:

sc.pp.neighbors(query_latent)
sc.tl.leiden(query_latent)
sc.tl.umap(query_latent)
plt.figure()
sc.pl.umap(
    query_latent,
    color=["batch", "cell_type"],
    frameon=False,
    wspace=0.6,
)

... storing 'cell_type' as categorical
... storing 'batch' as categorical

<Figure size 320x320 with 0 Axes>

[21]:

surgery_path = 'surgery_model'
model.save(surgery_path, overwrite=True)

Get latent representation of reference + query dataset and compute UMAP¶

[22]:

adata_full = source_adata.concatenate(target_adata)
full_latent = sc.AnnData(model.get_latent_representation(adata=adata_full))
full_latent.obs['cell_type'] = adata_full.obs[cell_type_key].tolist()
full_latent.obs['batch'] = adata_full.obs[condition_key].tolist()

INFO      Input adata not setup with scvi. attempting to transfer anndata setup
INFO      Using data from adata.X
INFO      Computing library size prior per batch
INFO      Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels']
INFO      Successfully registered anndata object containing 15681 cells, 1000 vars, 5
          batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates
          and 0 extra continuous covariates.

[23]:

sc.pp.neighbors(full_latent)
sc.tl.leiden(full_latent)
sc.tl.umap(full_latent)
plt.figure()
sc.pl.umap(
    full_latent,
    color=["batch", "cell_type"],
    frameon=False,
    wspace=0.6,
)

... storing 'cell_type' as categorical
... storing 'batch' as categorical

<Figure size 320x320 with 0 Axes>