Skip to content

Commit

Permalink
Notebooks pertaining to GSPA submission
Browse files Browse the repository at this point in the history
  • Loading branch information
Aarthi Venkat committed Jun 24, 2024
0 parents commit f900424
Show file tree
Hide file tree
Showing 63 changed files with 21,843 additions and 0 deletions.
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Notebooks for analysis from "Mapping the gene space at single-cell resolution with gene signal pattern analysis"

Gene Signal Pattern Analysis is a Python package for mapping the gene space from single-cell data. For a detailed explanation of GSPA and potential downstream application, see:

Mapping the gene space at single-cell resolution with Gene Signal Pattern Analysis. Aarthi Venkat, Sam Leone, Scott E. Youlten, Eric Fagerberg, John Attanasio, Nikhil S. Joshi, Michael Perlmutter, Smita Krishnaswamy.

By considering gene expression values as signals on the cell-cell graph, GSPA enables complex analyses of gene-gene relationships, including gene cluster analysis, cell-cell communication, and patient manifold learning from gene-gene graphs.

See the following directories to generate results:

`analysis_batch_correction` for Extended Data Figure 1
`analysis_wavelets` for Extended Data Figure 3, 4
`analysis_coexpression` for Figure 2, Extended Data Figure 5, 6, 7, 10, Extended Data Table 1
`analysis_localization` for Figure 3, Extended Data Figure 9, 10, Extended Data Table 1
`analysis_tcells` for Figure 4, Extended Data Figure 11, 12, Extended Data Table 2
`analysis_cellcomm` for Figure 5, Extended Data Figure 13, 14, Extended Data Table 3
`analysis_spatial` for Figure 6, Extended Data Figure 15, Extended Data Table 4
`analysis_patient` for Figure 7, Extened Data Table 5
691 changes: 691 additions & 0 deletions analysis_batch_correction/1.0_batch_correction.ipynb

Large diffs are not rendered by default.

426 changes: 426 additions & 0 deletions analysis_cellcomm/1.0_learn_LR_representation.ipynb

Large diffs are not rendered by default.

735 changes: 735 additions & 0 deletions analysis_cellcomm/2.0_characterize_LR_embeddings.ipynb

Large diffs are not rendered by default.

338 changes: 338 additions & 0 deletions analysis_cellcomm/3.0_compare_to_cellphonedb.ipynb

Large diffs are not rendered by default.

251 changes: 251 additions & 0 deletions analysis_cellcomm/4.0_get_cell_type_specificity.ipynb

Large diffs are not rendered by default.

218 changes: 218 additions & 0 deletions analysis_coexpression/0.0_visualize_cell_trajectories.ipynb

Large diffs are not rendered by default.

167 changes: 167 additions & 0 deletions analysis_coexpression/1.0_visualize_performance_comparison.ipynb

Large diffs are not rendered by default.

491 changes: 491 additions & 0 deletions analysis_coexpression/1.1_calculate_gene_num_cells.ipynb

Large diffs are not rendered by default.

478 changes: 478 additions & 0 deletions analysis_coexpression/1.2_calculate_gene_clusters.ipynb

Large diffs are not rendered by default.

139 changes: 139 additions & 0 deletions analysis_coexpression/1.3_calculate_gene_trajectories.ipynb

Large diffs are not rendered by default.

279 changes: 279 additions & 0 deletions analysis_coexpression/1.4_calculate_gene_archetypes.ipynb

Large diffs are not rendered by default.

502 changes: 502 additions & 0 deletions analysis_coexpression/2.0_calculate_branch_GSPA_embeddings.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

330 changes: 330 additions & 0 deletions analysis_coexpression/4.0_analyze_PBMC.ipynb

Large diffs are not rendered by default.

376 changes: 376 additions & 0 deletions analysis_coexpression/4.1_analyze_EB.ipynb

Large diffs are not rendered by default.

113 changes: 113 additions & 0 deletions analysis_coexpression/evaluate_coexpression.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
"""
Evaluate gene embeddings for GSPA and comparisons based on anti-correlation between gene embedding distance and true gene-gene coexpression. Stratified for library size (due to differences in correlation based on library size) and true coexpression (due to large imbalance and majority of gene pairs not coexpressed).
"""

import numpy as np
from collections import defaultdict
import sys, os
from scipy.stats import spearmanr

model = sys.argv[1]
dataset = sys.argv[2]

if dataset == '2_branches':
datafile = 'splatter_simulated_data_2_branches.npz'
extension = '_2_branches'
elif dataset == '3_branches':
datafile = 'splatter_simulated_data_3_branches.npz'
extension = '_3_branches'
elif dataset == 'linear':
datafile = 'splatter_simulated_data.npz'
extension = ''

# confirm model choice
if model not in ['Eigenscore', 'GFMMD', 'Signals', 'DiffusionEMD', 'GSPA', 'GSPA_QR', 'MAGIC', 'Node2Vec_Gcell', 'GAE_noatt_Gcell', 'GAE_att_Gcell', 'Node2Vec_Ggene', 'GAE_noatt_Ggene', 'GAE_att_Ggene', 'SIMBA', 'siVAE']:
sys.exit('Model choice not in [Eigenscore GFMMD Signals DiffusionEMD GSPA GSPA_QR MAGIC Node2Vec_Gcell GAE_noatt_Gcell GAE_att_Gcell Node2Vec_Ggene GAE_noatt_Ggene GAE_att_Ggene SIMBA siVAE]')

trajectory_data = np.load(f'../data/{datafile}')
data = trajectory_data['data']
true_counts = trajectory_data['true_counts']
true_lib_size = true_counts.T.sum(axis=1)

# get coexpression embeddings
coexpression_results = {}

for id in [7, 8, 9]:
run = f'results/{model}/{id}_results{extension}.npz'
res = np.load(run, allow_pickle=True)
name = res['config'][()]['save_as']
coexpression_results[name] = res['signal_embedding']

print ('Stratify Spearman correlation...')
spearman_res = spearmanr(true_counts)
np.fill_diagonal(spearman_res.correlation, 0)
corr_bins = np.linspace(spearman_res.correlation.min(), spearman_res.correlation.max(), 4)

min_bin_size = float('inf')
for i,corr in enumerate(corr_bins):
if i == 0: continue
choices = np.array(np.where((spearman_res.correlation > corr_bins[i-1]) & (spearman_res.correlation < corr) & (spearman_res.correlation != 0))).T
if choices.shape[0] < min_bin_size:
min_bin_size = choices.shape[0]

spearmans = defaultdict(list)

print ('Stratify library size...')
min_lib_size= float('inf')
for i,corr in enumerate(corr_bins):

choices_bin = []
if i == 0: continue

## res.correlation does not equal zero, excluding self edges
choices = np.array(np.where((spearman_res.correlation > corr_bins[i-1]) & (spearman_res.correlation < corr) & (spearman_res.correlation != 0))).T

lib_size_mean_per_pair = np.vstack((true_lib_size[choices[:, 0]], true_lib_size[choices[:, 1]])).mean(axis=0)
lib_size = np.linspace(lib_size_mean_per_pair.min(), lib_size_mean_per_pair.max(), 3)

for j,bin in enumerate(lib_size):
if j == 0: continue
choices_ = np.array(np.where((lib_size_mean_per_pair > lib_size[j-1]) & (lib_size_mean_per_pair < bin))).T
if choices_.shape[0] < min_lib_size:
min_lib_size = choices_.shape[0]

print ('Sample by stratification...')
samples = []
for i,corr in enumerate(corr_bins):

choices_bin = []
if i == 0: continue

## res.correlation does not equal zero, excluding self edges
choices = np.array(np.where((spearman_res.correlation > corr_bins[i-1]) & (spearman_res.correlation < corr) & (spearman_res.correlation != 0))).T

lib_size_mean_per_pair = np.vstack((true_lib_size[choices[:, 0]], true_lib_size[choices[:, 1]])).mean(axis=0)
lib_size = np.linspace(lib_size_mean_per_pair.min(), lib_size_mean_per_pair.max(), 3)

for j,bin in enumerate(lib_size):
if j == 0: continue
choices_ = np.array(np.where((lib_size_mean_per_pair > lib_size[j-1]) & (lib_size_mean_per_pair < bin))).T
choices_bin.append(choices_[np.random.choice(choices_.shape[0], size=min_lib_size, replace=False, )])

samples.append(choices[np.vstack(choices_bin).flatten()])

samples = np.vstack(samples)

f = open(f"results/{model}/spearmanr{extension}_demap_789.txt", "w")

for (name, embedding) in coexpression_results.items():
X= []
y=[]

for (a,b) in samples:
X.append(np.linalg.norm(embedding[a] - embedding[b]))
y.append(spearman_res.correlation[a][b])

X = np.array(X)
y = np.array(y)

spearmans[name].append(spearmanr(X, y).correlation)

f.write(f'{name} Spearman {np.median(spearmans[name])}\n')

f.close()
6 changes: 6 additions & 0 deletions analysis_coexpression/perform_comparison.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
for model in Eigenscore GFMMD Signals DiffusionEMD GSPA GSPA_QR MAGIC Node2Vec_Ggene GAE_noatt_Ggene GAE_att_Ggene SIMBA siVAE; do
echo ${model}
for dataset in linear 2_branches 3_branches; do
python evaluate_coexpression.py ${model} ${dataset}
done
done
157 changes: 157 additions & 0 deletions analysis_localization/1.0_visualize_performance_comparison.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

612 changes: 612 additions & 0 deletions analysis_localization/2.0_feature_select_PBMC.ipynb

Large diffs are not rendered by default.

2,231 changes: 2,231 additions & 0 deletions analysis_localization/2.1_feature_select_EB.ipynb

Large diffs are not rendered by default.

40 changes: 40 additions & 0 deletions analysis_localization/evaluate_localization.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import numpy as np
from collections import defaultdict
import os, sys, glob
from scipy.stats import spearmanr

model = sys.argv[1]
dataset = sys.argv[2]

if dataset == '2_branches':
extension = '_2_branches'
if dataset == 'sparse_branches':
extension = '_sparse_branches'
if dataset == '3_branches':
extension = '_3_branches'
elif dataset == 'linear':
extension = ''

# confirm model choice
if model not in ['SIMBA', 'siVAE', 'Eigenscore','GFMMD', 'Signals', 'DiffusionEMD', 'GSPA', 'GSPA_QR', 'MAGIC', 'Node2Vec_Ggene', 'GAE_noatt_Ggene', 'GAE_att_Ggene']:
sys.exit('Model choice not in [SIMBA siVAE Eigenscore GFMMD Signals DiffusionEMD GSPA GSPA_QR MAGIC Node2Vec_Ggene GAE_noatt_Ggene GAE_att_Ggene]')

spearmans = defaultdict(list)
labels_y = np.load(f'../data/localization_signals{extension}.npz')['spread']

# get embeddings
localization_scores = {}

for id in [7, 8, 9]:
run = f'results/{model}/{id}_results{extension}.npz'
res = np.load(run, allow_pickle=True)
name = res['config'][()]['save_as']
localization_scores[name] = res['localization_score']

f = open(f'results/{model}/spearmanr{extension}_789.txt', 'w')

for (name, score) in localization_scores.items():
spearmans[name] = spearmanr(score, labels_y).correlation
f.write(f'{name} Spearman {spearmans[name]}\n')

f.close()
6 changes: 6 additions & 0 deletions analysis_localization/perform_comparison.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
for model in GFMMD Eigenscore GAE_att_Ggene Signals GSPA GSPA_QR MAGIC DiffusionEMD Node2Vec_Ggene GAE_noatt_Ggene SIMBA siVAE; do
echo ${model}
for dataset in linear 2_branches 3_branches; do
python evaluate_localization.py ${model} ${dataset}
done
done
415 changes: 415 additions & 0 deletions analysis_runtime/1.0_visualize_runtime.ipynb

Large diffs are not rendered by default.

30 changes: 30 additions & 0 deletions analysis_runtime/evaluate_runtime.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import sys
import numpy as np
import pandas as pd
import time
import gpsa

count = int(sys.argv[1])
qr_decompose = sys.argv[2] == 'True'

data = np.load(f'../data/large_splatter_simulated_data.npz')
counts = pd.DataFrame(data['data']).sample(count).values

start = time.time()
gspa_op = gspa.GSPA(qr_decompose=qr_decompose)
gspa_op.construct_graph(counts)
end = time.time()
print ('Construct graph', end - start)

start = time.time()
gspa_op.build_diffusion_operator()
gspa_op.build_wavelet_dictionary()
end = time.time()
print ('Wavelet dictionary', end - start)

start = time.time()
gene_signals = data.T # embed all measured genes
gene_ae, gene_pc = gspa_op.get_gene_embeddings(gene_signals)
gene_localization = gspa_op.calculate_localization()
end = time.time()
print ('Gene embedding', end - start)
19 changes: 19 additions & 0 deletions analysis_runtime/perform_comparison.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash

#SBATCH --job-name=runtime_fast
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --time=24:00:00
#SBATCH --partition=scavenge
#SBATCH --mem-per-cpu=100G
#SBATCH --mail-type=ALL

module load miniconda
conda activate gspa

for count in 100 1000 5000 25000 50000 100000;
do for reduced in False True;
do echo ${count} ${reduced}; python 0.0_fast_GSPA.py ${count} ${reduced};
done
done
Loading

0 comments on commit f900424

Please sign in to comment.