-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Relative Log Expression (RLE) Plots #348
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR @jonathjd !
I just have one suggestion to make: unless I missed something,I think we should avoid re-computing size factors when we already have them.
Happy to merge once this is fixed!
pydeseq2/utils.py
Outdated
if normalize: | ||
print("Plotting normalized RLE plot...") | ||
geometric_mean = np.exp(np.mean(np.log(count_matrix + 1), axis=0)) | ||
size_factors = np.median(count_matrix / geometric_mean, axis=1) | ||
count_matrix = count_matrix / size_factors[:, np.newaxis] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is recomputing median-of-ratios size factors, right? (up to the +1 in the log)
In that case, it might be better to add a size_factors
argument to make_rle_plot
.
From there, in dds.plot_rle
, if normalize=True
, we would first check whether size factors (self.obsm["size_factors"]
) were already computed. If so, we pass them directly. If not, we call self.fit_size_factors
first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @BorisMuzellec ,
For some reason when I use the size factors computed in dds.fit_size_factors() I do not get an RLE plot with the sample medians centered around 0.
But when I compute the sizefactors internally I do get an RLE plot with the sample medians centered around 0
Do you know what could be causing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we would expect the sample medians to be zero given that RLE plot subtracts the gene medians. Also, I would be wary of making any conclusion from the test data.
That being said what you wrote is what is implemented in R's plotRLE
method (https://rdrr.io/github/davismcc/scater/src/R/plotRLE.R). If it's standard, I'm happy to keep it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonathjd I think we may still have some consistency issues though: if we're using log(counts + 1)
for size factors, shouldn't we also do the same everywhere (gene medians, and plotting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BorisMuzellec Happy holidays! Sorry for the delay, I was on vacation. It's probably a good idea to add the psuedocount to the gene_median
calculations so we don't divide by a gene with 0 counts across all samples. I can add that now.
I added the psuedocount and ran some tests and the psuedocount alters the gene medians distance from the sample medians and skews the plot significantly.
Reference Issue or PRs
Issue Reference: #320
What does your PR implement? Be specific.
The Relative Log Expression (RLE) is a useful diagnostic plot to visualize the differences in count distributions between samples. The x-axis is each sample from the count matrix and the y-axis is the log difference between each gene and the median expression of that gene across all samples.
Given:
gene_ij
is the expression of genej
in samplei
median_j
is the median expression of genej
across all samplesThe RLE for gene
j
in samplei
is calculated as:Where:
gene_{ij}
is the count of genej
in samplei
median_j
is the median count of gene j across all samplesThis issue takes in the raw counts
self.X
, anormalize
boolean,design_matrix.index
for the sample_ids, and asave_path
and produces an RLE plot.The normalize boolean is set to
False
by default but can be set toTrue
to normalize the raw counts before plottingThis example was produced with the synthetic data in the
./datasets/
dir