Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we use MELD on single replicate #56

Open
Rohit-Satyam opened this issue Jan 24, 2022 · 8 comments
Open

Can we use MELD on single replicate #56

Rohit-Satyam opened this issue Jan 24, 2022 · 8 comments

Comments

@Rohit-Satyam
Copy link

Rohit-Satyam commented Jan 24, 2022

Hi Developers!!

We have single-cell data for two-time points each containing control and knockout sample (t1_ctr, t1_ko, t2_ctr,tr_ko). We wish to find out how if the knockout cells are arrested in lifecycle more at time T1 or at Time T2. Is this achievable because we have one replicate per time_period?

Also, in the tutorial and the following code chunk:

with Parallel(n_jobs=36) as p:
    for knn in knn_range:
        # doing this outside the parallel loop because building the graph takes the longest
        benchmarker.fit_graph(adata.X, knn=knn)
        print(knn)
        curr_results = p(delayed(simulate_pdf_calculate_likelihood)(benchmarker, seed, beta) \
                                       for seed in range(25) for beta in beta_range)
        curr_results = pd.DataFrame(curr_results, columns = ['MSE', 'seed', 'beta', 'knn'])
        results.append(curr_mse)

what is adata.X as it is giving an error

@dburkhardt
Copy link
Member

Hi Rohit! You can run MELD on each pair of samples independently (i.e. t1_ctr vs t1_ko; t2_ctr vs tr_ko), then compare the magnitude of the MELD likelihoods if they were calculated on the same graph. To know if you can build a single graph, you can run PHATE on the datasets and then examine the amount of overlap between the pairs of samples. If they are totally overlapping, then you're good to go with one graph. Otherwise, I would run MELD start to finish separately on each pair of samples.

adata here is an Annotated Data Matrix (AnnData). https://anndata.readthedocs.io/en/latest/

Please let me know if this is helpful or if you still have questions

@Rohit-Satyam
Copy link
Author

Rohit-Satyam commented Jan 25, 2022

Below is my PHATE scatter plot using default parameters when I ran PHATE on all 4 Samples together as a single matrix. Is there a measure of overlap that we can use to determine if we can build a single graph or does it rely on visual inspection of the PHATE plot?

Besides, for AnnData (adata.X) assay, should I use the raw counts (data) or square root transformed data (data_sqrt). I mostly do Single Cell in R so Anndata is new to me. The documentation here says it has to be just a data matrix.

Edit 1: It must be mentioned in the documentation that adata.X must be libnorm matrix and not the raw counts. I found it in a issue here

temp1

@dburkhardt
Copy link
Member

This looks pretty clearly like there's minimal overlap between the T1 vs T2 datasets so I would analyze them separately

@stanleyjs
Copy link
Contributor

Hi Rohit,

I am a co-author on MELD. I want to chime in on your question about quantifying the batch effect.
I agree that by visual inspection alone it is clear that T1 and T2 are not well-mixed.

However, there are ways to quantify this without visual inspection. A very simple metric will be a multinomial test. First, treat the probability of a sample being in batch t1 or t2 as a multinomial with a single count. For example, let's say you have 400 t1 and 600 t2 cells. Then, for a given cell, if the batches are distributed uniformly, you have the probabilities in the null hypothesis that p(t1) = 0.4 and p(t2) = 0.6. Call this the global distribution. Next, for each cell, you essentially check whether its local neighborhood of k cells resembles that global multinomial - how close are the local batch distributions to the global distribution? This quantifies how "non-uniform" each cell's neighborhood is relative to the global frequency of the batches. For example, if the global probabilities are p(t1) = 0.4 and p(t2) = 0.6, but the local probabilities for a neighborhood of radius 25 around some cell c are p(t1 | c25) = 0.9 and p(t2 | c25) = 0.1 , then it's clear that t1 is enriched around c and the local distribution is very different from the global null. If you want to get a total metric of non-uniformity based on the local cell distributions, you just take some aggregation of the test statistic over the cells, such as the mean, AUC, etc.

The precise test you need is this https://en.wikipedia.org/wiki/Multinomial_test. You can use Pearson's chi-squared test for this.

I think that this or a very similar approach was demonstrated for biological purposes in "A test metric for assessing single-cell RNA-seq batch correction" by Buttner et. al (2019) in Nature Methods. https://www.nature.com/articles/s41592-018-0254-1

Jay

@Rohit-Satyam
Copy link
Author

Rohit-Satyam commented Jan 28, 2022

@dburkhardt I split up my assay matrix separately for time points T1 and T2 and I was planning to run a benchmark for parameter search separately on them. However, I got an error curr_mse not defined. Is it a typo in the tutorial?

Edit 1: Okay, I got it. It's curr_results as discussed here.

@Rohit-Satyam
Copy link
Author

Rohit-Satyam commented Jan 30, 2022

I don't understand why Likelihood estimates for both Control and Knockout are all 1. Is it because of a lack of replicates? I will reiterate how I performed the analysis.

  1. We have single-cell RNAseq data obtained from the lifecycle of a pathogen. The cells were harvested from two-time points in the lifecycle T1 and T2 (each time point contains a control and knockout sample). Thus we have 4 samples.
  2. I performed the preprocessing using Dropletutils and Seurat and performed DE analysis thereafter. I was then asked to find out if the knockout cells are arrested in lifecycle more at time T1 or at Time T2.
  3. I, therefore, wish to use MELD to obtain Likelihood Estimates. I split the assay and metadata into T1_data and T2_data, ran PHATE and Parameter search and MELD separately. I ran MELD on Raw Counts matrix (not on sqrt normalized). But I get the likelihood estimates of 1 for both control and knockout.

Is there another metric that I can use to quantify and say the cells at time T1 are more likely to get arrested because of knockout than at time T2??

image
image

@Rohit-Satyam
Copy link
Author

Rohit-Satyam commented Feb 17, 2022

Hi!! I was able to resolve the above problem now that I have two biological replicates for T1 (now we have restricted our analysis to T1 only). I have the following queries:

EDIT 1: When having a Control-knockout design, we can justify reassignment of some knockout cells transcriptionally as control cells (because some of the cells might not have undergone successful knockout). However, how can we justify some cells in control samples that are assigned knockout sample labels by MELD (they should be 100 percent assigned to control tags).

  1. Now that I have Average Likelihood Estimates for both Control and Knockout Samples, can I assign new sample labels for each cell based on these probabilities and then check how many cells have new sample labels (as a measure of QC) which replicate is good? For example something like this:
    image

Seems like rep2 have the least number of cells getting new sample labels (13%) (the likelihood estimates agree with original sample labels) as compared to rep1 (21%). But how to decide sample tags for marginal cases such as a cell with T1_ctrl.avg=0.524 and T1_ko.avg=0.476. Currently, for such cells, I assign the sample tag as T1_ctrl.

  1. Also in the tutorial you say

I've already looked through the clusters and determined which the number of clusters looks best to me.

How did you shortlist what number of clusters look best to pick in each cluster? For instance, I have the following clusters shown below. The first plot in each line represents the Average likelihood estimate of T1_ko. Is the following choice correct?

picked_clusters = {
    1:2,2:3,3:5}

Also I am a bit lost about the choice of the following parameters in code:

## code block 41
## How do I decide n_clusters parameter, can I use default 10
meld.VertexFrequencyCluster(n_clusters = 3)

## code block 42
## Since I am unable to view the contents of vfc_op_per_cluster, I wish to understand what does code block 42 does and 
## the loop is given below

for n in [2,3,4,5]:                ##why 2,3,4,5
        clusters_by_n[n] = curr_vfc.predict(n)

image

@Rohit-Satyam
Copy link
Author

If you are planning to organize a workshop on MELD, I would like to join. I have been watching the tutorials already present on youtube and I have many questions. How can I join the slack channel of old workshop where other MELD users can help me out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants