Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Relative Log Expression (RLE) Plots #348

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jonathjd
Copy link

@jonathjd jonathjd commented Dec 7, 2024

Reference Issue or PRs

Issue Reference: #320

What does your PR implement? Be specific.

The Relative Log Expression (RLE) is a useful diagnostic plot to visualize the differences in count distributions between samples. The x-axis is each sample from the count matrix and the y-axis is the log difference between each gene and the median expression of that gene across all samples.

Given:

gene_ij is the expression of gene j in sample i
median_j is the median expression of gene j across all samples
The RLE for gene j in sample i is calculated as:

RLE_ij = Log2(gene_ij/median_j)

Where:

gene_{ij} is the count of gene j in sample i
median_j is the median count of gene j across all samples

This issue takes in the raw counts self.X, a normalize boolean, design_matrix.index for the sample_ids, and a save_path and produces an RLE plot.

rle_plot

The normalize boolean is set to False by default but can be set to True to normalize the raw counts before plotting

nlog_rle_plot

This example was produced with the synthetic data in the ./datasets/ dir

Copy link
Collaborator

@BorisMuzellec BorisMuzellec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR @jonathjd !

I just have one suggestion to make: unless I missed something,I think we should avoid re-computing size factors when we already have them.

Happy to merge once this is fixed!

Comment on lines 1642 to 1646
if normalize:
print("Plotting normalized RLE plot...")
geometric_mean = np.exp(np.mean(np.log(count_matrix + 1), axis=0))
size_factors = np.median(count_matrix / geometric_mean, axis=1)
count_matrix = count_matrix / size_factors[:, np.newaxis]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is recomputing median-of-ratios size factors, right? (up to the +1 in the log)

In that case, it might be better to add a size_factors argument to make_rle_plot.

From there, in dds.plot_rle, if normalize=True, we would first check whether size factors (self.obsm["size_factors"]) were already computed. If so, we pass them directly. If not, we call self.fit_size_factors first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BorisMuzellec ,

For some reason when I use the size factors computed in dds.fit_size_factors() I do not get an RLE plot with the sample medians centered around 0.

image

But when I compute the sizefactors internally I do get an RLE plot with the sample medians centered around 0

image

Do you know what could be causing this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we would expect the sample medians to be zero given that RLE plot subtracts the gene medians. Also, I would be wary of making any conclusion from the test data.

That being said what you wrote is what is implemented in R's plotRLE method (https://rdrr.io/github/davismcc/scater/src/R/plotRLE.R). If it's standard, I'm happy to keep it as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonathjd I think we may still have some consistency issues though: if we're using log(counts + 1) for size factors, shouldn't we also do the same everywhere (gene medians, and plotting)

Copy link
Author

@jonathjd jonathjd Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BorisMuzellec Happy holidays! Sorry for the delay, I was on vacation. It's probably a good idea to add the psuedocount to the gene_median calculations so we don't divide by a gene with 0 counts across all samples. I can add that now.

I added the psuedocount and ran some tests and the psuedocount alters the gene medians distance from the sample medians and skews the plot significantly.

pydeseq2/dds.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants