SpaceCat allows you to easily generate hundreds of distinct features from multiplexed imaging data to enable easy downstream quantification. These features are all interpretable and easy to caluclate, and don't require complex postprocessing or heavy computational resources!
All that you need to get started is a segmentation mask (identifying the location of each cell) and a cell table (which assigns each cell to a cell type). If you haven't segmented and classified your cells yet, there are many options for segmenting cells in multiplexed imaging data, for example Mesmer, CellPose, or StarDist. Similarly, if you're looking for a good option for assigning cell types, a couple options to look at are Nimbus, Pixie, or Stellar. However, no matter which algorithms you choose for segmentation and cell assignment, you can still run SpaceCat on your data! You should pick the approach you find to generate the most accurate results, as that will directly influence the quality of the features that are extracted.
In order to take advantage of all the features in SpaceCat, we recommend you generate a hierarchy of cell type assignments. This will allow you to tailor the level of granularity of the cell types you use for different features. For example, you might want to look more broadly at total T cell abundance for some features, whereas in others it would be more beneficial to look specifically at exhausted CD8 T cells. SpaceCat cannot generate these hierarchies for you; it's up to you to decide how to group your cells together. However, once you've generated this grouping, you are then ready to run SpaceCat!
We have provided an example dataset to help you understand what your data should look like before utilizing SpaceCat. If you want to test out our tool, you can quickly run SpaceCat using the code below and the example files!
You can clone the git repository and install the necessary dependencies using the provided yml file. First clone the repository with
git clone https://github.com/angelolab/SpaceCat
Then change the working directory and create the conda environment using the provided yml file.
cd SpaceCat
conda env create -f environment.yml
Once the environment is created, you can activate it with
conda activate spacecat_env
We are currently working on making SpaceCat pip installable!
Your cell table will require a minimum amount of data to generate fundamental features. You can refer to cell_table.csv
in the example data for reference.
- The image name column identifies specific regions of tissue in your data, denoted by
fov
in the example cell table. - The segmentation label column identifies the label for each cell in the image, denoted by
label
in the example cell table. - The cell area column identifies the area of each cell, denoted by
area
in the example cell table. - The two centroid columns should detail the x and y location of each cell in the image, denoted by
centroid-0
andcentroid-1
in the example cell table. - At least one cluster assignment column needs to specify each cell type in order to generate SpaceCat features; however we recommend you include two or more levels of granularity.
These columns are denoted by
cell_meta_cluster
,cell_cluster
, andcell_cluster_broad
in the example cell table. - Columns detailing the normalized signal intensity in each cell are included in the example cell table, so that functional marker features to be generated. You may include all markers in the initial cell table conversion, and filter which functional markers you would specifically like to include as features in the preprocessing section below.
The metadata file should contain any image level and tissue specific information.
The one required column is image column, named the same as in the cell table, so that the metadata can be merged into the cell table and stored in the anndata object created in the next step.
The example metadata.csv contains the image column fov
, a column denoting which tissue sample an image belongs to Tissue_ID
,
a column denoting which patient the tissue belongs to Patient_ID
, information on which timepoint the sample was collected at Timepoint
, and
the location of the sample Localization
.
If your single cell table is in csv format, it can be quickly converted to anndata using the below code.
Step 1: Read in the data files.
import os
import anndata
import pandas as pd
# read in data
data_dir = '/Documents/SpaceCat/data'
cell_table = pd.read_csv(os.path.join(data_dir, 'cell_table.csv'))
Step 2: Provide the necessary input variables.
The required variables are:
markers
: list of columns in your cell table representing marker intensitycentroid_cols
: list of two columns that denote the centroid values of each cellcell_data_cols
: list including the necessary columns above, as well as any other columns from the table containing cell level features you would like to include
# define column groupings
markers = ['Au', 'CD11c', 'CD14', 'CD163', 'CD20', 'CD3', 'CD31', 'CD38', 'CD4', 'CD45', 'CD45RB', 'CD45RO', 'CD56', 'CD57',
'CD68', 'CD69', 'CD8', 'CK17', 'Calprotectin', 'ChyTr', 'Collagen1', 'ECAD', 'FAP', 'FOXP3', 'Fe', 'Fibronectin',
'GLUT1', 'H3K27me3', 'H3K9ac', 'HLA1', 'HLADR', 'IDO', 'Ki67', 'LAG3', 'Noodle', 'PD1', 'PDL1', 'SMA', 'TBET',
'TCF1', 'TIM3', 'Vim']
centroid_cols = ['centroid-0', 'centroid-1']
cell_data_cols = ['fov', 'label', 'cell_meta_cluster', 'cell_cluster', 'cell_cluster_broad',
'compartment_example', 'compartment_area_example', 'area', 'major_axis_length', 'cell_size',
'distance_to__B', 'distance_to__Cancer', 'distance_to__Granulocyte', 'distance_to__Mono_Mac',
'distance_to__NK', 'distance_to__Other', 'distance_to__Structural', 'distance_to__T']
Step 3: Create the anndata object and save.
# create anndata from table subsetted for marker info, which will be stored in adata.X
adata = anndata.AnnData(cell_table.loc[:, markers])
# store all other cell data in adata.obs
adata.obs = cell_table.loc[:, cell_data_cols, ]
adata.obs_names = [str(i) for i in adata.obs_names]
# store cell centroid data in adata.obsm
adata.obsm['spatial'] = cell_table.loc[:, centroid_cols].values
# save the anndata object
os.makedirs(os.path.join(data_dir, 'adata'), exist_ok=True)
adata.write_h5ad(os.path.join(data_dir, 'adata', 'adata.h5ad'))
Once your data is stored in an anndata object, simply read in the single cell data with
adata = anndata.read_h5ad(os.path.join(data_dir, 'adata', 'adata.h5ad'))
Your single cell data will require some preprocessing before SpaceCat can generate features. This preprocessed anndata output will only need to be generated once and saved. Preprocessing consists of two steps, checking functional marker positivity within cells and assigning each cell to a compartment region based on any provided masks.
Here you can provide appropriate thresholds to indicate whether a cell is positive for each marker. This information will be used to generate functional marker features in SpaceCat.
If you have not identified thresholds for your markers, simply provide None
instead of a value and the threshold will be automatically
set at one standard deviation above the median marker intensity. Example shown below.
functional_marker_thresholds = [['Ki67', 0.002], ['CD38', 0.004], ['CD45RB', 0.001], ['CD45RO', 0.002],
['CD57', 0.002], ['CD69', 0.002], ['GLUT1', 0.002], ['IDO', 0.001],
['LAG3', 0.002], ['PD1', 0.0005], ['PDL1', 0.001],
['HLA1', 0.001], ['HLADR', 0.001], ['TBET', 0.0015], ['TCF1', 0.001],
['TIM3', 0.001], ['Vim', 0.002], ['Fe', 0.1]]
# functional_marker_thresholds = [['Ki67', None], ['CD38', 0.004], ...]
Provided compartment masks must be individual binary tiffs per compartment, with the file name indicating the compartment name you would like to use for feature generation; example compartment masks can be found in data/compartment_masks. You can use the Generalized Masking script in ark-analysis to create channel and/or cell masks for your data. We recommend you perform additional processing to ensure your compartment masks do not overlap.
The required variables are:
image_key
,seg_label_key
as described aboveseg_dir
: the directory where the segmentation masks for each image are storedmask_dir
: the directory where the various compartment masks are stored in a single folder for each imageseg_mask_substr
: the substring indicating the file is a segmentation mask, in the example data note the files are of the form 'TONIC_TMA4_R9C6_whole_cell.tiff', where '_whole_cell' is theseg_mask_substr
- if your segmentation mask file names are simply the image names and have no suffix, set
seg_mask_substr=None
- if your segmentation mask file names are simply the image names and have no suffix, set
If you do not have compartment masks set the seg_dir
and mask_dir
variables to None
, so cell to
compartment assignment can be skipped.
Then, you can run the preprocessing function and save the resulting anndata.
from SpaceCat.preprocess import preprocess_table
adata_processed = preprocess_table(adata, functional_marker_thresholds, image_key='fov',
seg_label_key='label', seg_dir=seg_dir, mask_dir=mask_dir,
seg_mask_substr='_whole_cell')
adata_processed.write_h5ad(os.path.join(data_dir, 'adata', 'adata_processed.h5ad'))
Once we have the appropriate input, we can set the required parameters and generate some features!
The required variables are:
image_key
,seg_label_key
,cell_area_key
,cluster_key
as described abovefunctional_feature_level
: which of the cluster levels to generate functional features fordiversity_feature_level
: which of the cluster levels to generate cell diversity features forpixel_radius
: radius in pixels which will be used to define the neighbors of each cell, for the example data 50 pixels (~20 microns) is used
Optional variables are:
compartment_key
: column name in .obs which contains cell assignments to various region types in the image, will be 'compartment' if cells were assigned using the preprocessing stepcompartment_area_key
: column name in .obs which stores the compartment areas, will be 'compartment_area' if cells were assigned using the preprocessing stepspecified_ratios_cluster_key
: the cluster level you wish to compute specific cell ratios for, in the example below:'cell_cluster'
- ratios are already calculated across all pairwise combinations of your most broad cell cluster, this is an optional additional ratio specification
specified_ratios
: a list of cell type pairs to compute ratios for, all specified cell types must belong to thespecified_ratios_cluster_key
classification, see below for an exampleper_cell_stats
: list of specifications so SpaceCat can pull additional features from the cell table, there are 3 required inputs for each cell stat list:- 1: the category name you would like to give the set of features, in the example below:
'morphology'
- 2: the cell cluster level to calculate this statistic at, in the example below:
'cell_cluster'
- 3: the list of columns in the cell table for each statistic you would like to include, in the example below:
['area', 'major_axis_length']
- 1: the category name you would like to give the set of features, in the example below:
per_img_stats
: list of specifications so SpaceCat can include any additional image level features, there are 2 required inputs for each image stat list:- 1: the category name you would like to give the set of features, in the example below:
'fiber'
or'mixing_score'
- 2: the dataframe containing the image level stats, one column must be the
image_key
, while the other columns will indicate individual feature names to be included
- 1: the category name you would like to give the set of features, in the example below:
When provided with compartment information, SpaceCat will calculate region specific features for your data, as well as at the image level.
If you do not have compartment assignments and areas for each cell, set both of these variables to None
to direct
SpaceCat to compute only the image level features.
If you do not have an additional per_cell_stats
or per_img_stats
then you can exclude these from the run_spacecat()
function call. You can also exclude the specified_ratios_cluster_key
and specified_ratios
variables if you are not interested in this feature.
from SpaceCat.features import SpaceCat
# Initialize the class
adata_processed = anndata.read_h5ad(os.path.join(data_dir, 'adata', 'adata_processed.h5ad'))
features = SpaceCat(adata_processed, image_key='fov', seg_label_key='label', cell_area_key='area',
cluster_key=['cell_cluster', 'cell_cluster_broad'],
compartment_key='compartment', compartment_area_key='compartment_area')
# read in image level dataframes
fiber_df = pd.read_csv(os.path.join(data_dir, 'fiber_stats_table.csv'))
fiber_tile_df = pd.read_csv(os.path.join(data_dir, 'fiber_stats_per_tile.csv'))
mixing_df = pd.read_csv(os.path.join(data_dir, 'mixing_scores.csv'))
kmeans_img_proportions = pd.read_csv(os.path.join(data_dir, 'neighborhood_image_proportions.csv'))
kmeans_compartment_proportions = pd.read_csv(os.path.join(data_dir, 'neighborhood_compartment_proportions.csv'))
# specify cell type pairs to compute a ratio for
ratio_pairings = [('CD8T', 'CD4T'), ('CD4T', 'Treg'), ('CD8T', 'Treg'), ('CD68_Mac', 'CD163_Mac')]
# specify addtional per cell and per image stats
per_cell_stats=[
['morphology', 'cell_cluster', ['area', 'major_axis_length']],
['linear_distance', 'cell_cluster_broad', ['distance_to__B', 'distance_to__Cancer', 'distance_to__Granulocyte', 'distance_to__Mono_Mac',
'distance_to__NK', 'distance_to__Other', 'distance_to__Structural', 'distance_to__T']]
]
per_img_stats=[
['fiber', fiber_df],
['fiber', fiber_tile_df],
['mixing_score', mixing_df],
['kmeans_cluster', kmeans_img_proportions],
['kmeans_cluster', kmeans_compartment_proportions]
]
# Generate features and save anndata
adata_processed = features.run_spacecat(functional_feature_level='cell_cluster', diversity_feature_level='cell_cluster', pixel_radius=50,
specified_ratios_cluster_key='cell_cluster', specified_ratios=ratio_pairings,
per_cell_stats=per_cell_stats, per_img_stats=per_img_stats)
adata_processed.write_h5ad(os.path.join(data_dir, 'adata', 'adata_processed.h5ad'))
The output feature tables are stored in adata_processed.uns, and can be saved out to csvs if preferred.
# Save finalized tables to csv
os.makedirs(os.path.join(data_dir, 'SpaceCat'), exist_ok=True)
adata_processed.uns['combined_feature_data'].to_csv(os.path.join(data_dir, 'SpaceCat', 'combined_feature_data.csv'), index=False)
adata_processed.uns['combined_feature_data_filtered'].to_csv(os.path.join(data_dir, 'SpaceCat', 'combined_feature_data_filtered.csv'), index=False)
adata_processed.uns['feature_metadata'].to_csv(os.path.join(data_dir, 'SpaceCat', 'feature_metadata.csv'), index=False)
adata_processed.uns['excluded_features'].to_csv(os.path.join(data_dir, 'SpaceCat', 'excluded_features.csv'), index=False)
If you would like to merge your metadata into any of the above tables, you can do so using the shared image key
metadata = pd.read_csv(os.path.join(data_dir, 'metadata.csv'))
features_filtered = pd.read_csv(os.path.join(data_dir, 'SpaceCat', 'combined_feature_data_filtered.csv'))
# merge metadata into output table
merge_table = pd.merge(features_filtered, metadata, on=['fov'])
You can also add the metadata in your anndata object and save with
adata_processed.uns['metadata'] = metadata
adata_processed.write_h5ad(os.path.join(data_dir, 'adata', 'adata_processed.h5ad'))
All features are computed separately in each image. In addition, if you provided optional compartment assignments, the features will also be computed within each compartment.
density
: The number of cells divided by the area of the region.density_ratio
: The ratio between the densities of cell types. This is done for every pairwise combination of cells at the broadest level of clustering.functional_marker
: For each cell type/functional marker combination, the proportion of cells positive for that marker, using the supplied marker-specific thresholds to determine positivity.cell_diversity
: This diversity feature is based on cell proximity. For each cell in the image, the proportions of each cell type within a specified pixel radius was computed. Then the Shannon diversity index was calculated on these proportions.region_diversity
: This diversity feature is based on cell abundance. For the broadest cell cluster level, the proportion of cells of each lower cell type was extracted. We then compute the diversity of cell types within each broad category using the Shannon Diversity index.- This feature was computed for cells at a broad level of clustering that were composed of at least two distinct lower cell types.
density_proportion
: For each lower level cell type in a given broader cell type category, the proportion of the number of broader cells that the lower cell type represented was calculated.- This feature was computed for cells at a broad level of clustering that were composed of at least two distinct lower cell types.
Coming soon:
kmeans_cluster
: Using k-means clustering to define cell neighborhoods in each image, we then calculated the proportion of cells belonging to each of the identified clusters across the region.