Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taxonomy_mapping() does not find taxonomy column names when using latest docker image #30

Open
meghanaturner opened this issue Jun 26, 2023 · 12 comments

Comments

@meghanaturner
Copy link

Using the latest scrattch-mapping docker image release leads to an error on line 22 of R/taxonomy_mapping() because colnames(AIT.anndata$uns$clusterInfo) returns NULL.

This issue can be fixed by switching back to the 0.16 version of the docker image. Using 0.16 and the exact same taxonomy h5ad file (//allen/programs/celltypes/workgroups/rnaseqanalysis/mFISH/meghanturner/brain3_mapping/taxonomies/AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_dense_for_mapping.h5ad), colnames(AIT.anndata$uns$clusterInfo) returns the expected list of column names and the code runs as it should.

@berl @egelfan2

@UCDNJJ
Copy link
Collaborator

UCDNJJ commented Jun 26, 2023

Thanks for reporting this error!

Starting with the latest scrattch-mapping docker (bicore/scrattch_mapping:latest). I wasn't able to recreate this issue from our test cases, so you've found some fun edge case. To be complete, I also tried loading AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_dense_for_mapping.h5ad directly with anndata::read_h5ad() and found the column names available for AIT.anndata$uns$clusterInfo.

I'll need some additional info to figure out what's going on. A few questions:

  • Are you using anndata::read_h5ad() or scrattch_mapping::loadTaxonomy() to load the taxonomy into R?
  • Can you share the directory containing all the taxonomy files that should have been created with scrattch_mapping::buildTaxonomy()?
  • Do you mind sharing your script that works only under scrattch_mapping version 0.16?
  • Can you share the error report?

@berl
Copy link

berl commented Jun 27, 2023

FYI @scseeman

@meghanaturner
Copy link
Author

  • I'm using anndata::read_h5ad() to load the taxonomy
  • The taxonomy file wasn't built directly with scrattch_mapping::buildTaxonomy()
  • /allen/programs/celltypes/workgroups/rnaseqanalysis/mFISH/meghanturner/brain3_mapping/scrattch-mapping_batch.R
  • Error: Error in taxonomy_mapping(AIT.anndata = taxonomy_anndata, query.data = query_data, : Not all label.cols exists in AIT.anndata$uns$clusterInfo

That's interesting that you can find the column names when you load with anndata::read_h5ad() in the latest. For me, there's different behavior in how the column names are accessible between the two versions.

In the latest version:

  • colnames(anndata$uns$clusterInfo), which is called in line 22 of taxonomy_mapping, returns NULL
  • whereas, anndata$uns$clusterInfo$columns returns the expected column names:

[1] "sample_id" "cl" "cluster_label"
[4] "Level2_id_label" "Level1_id_label" "supertype_id_label"
[7] "class_id_label" "nt_type_label" "cluster_id.AIT16"
[10] "library_prep" "gene.counts.0" "doublet_score"
[13] "roi" "umi.counts" "qc.score"
[16] "method" "region_label" "region_id"
[19] "sex" "external_donor_name" "age"
[22] "platform" "knn.dist" "knn.dist.z"
[25] "medical_conditions" "broad_region" "cluster_id"
[28] "neighborhood" "batch"

  • class(taxonomy_anndata$uns$clusterInfo) returns:

[1] "pandas.core.frame.DataFrame" "pandas.core.generic.NDFrame"
[3] "pandas.core.base.PandasObject" "pandas.core.accessor.DirNamesMixin"
[5] "pandas.core.indexing.IndexingMixin" "pandas.core.arraylike.OpsMixin"
[7] "python.builtin.object"

In 0.16 the opposite is true:

  • anndata$uns$clusterInfo$columns returns NULL
  • colnames(anndata$uns$clusterInfo), which is called in line 22 of taxonomy_mapping, returns the expected column names:

[1] "sample_id" "cl" "cluster_label"
[4] "Level2_id_label" "Level1_id_label" "supertype_id_label"
[7] "class_id_label" "nt_type_label" "cluster_id.AIT16"
[10] "library_prep" "gene.counts.0" "doublet_score"
[13] "roi" "umi.counts" "qc.score"
[16] "method" "region_label" "region_id"
[19] "sex" "external_donor_name" "age"
[22] "platform" "knn.dist" "knn.dist.z"
[25] "medical_conditions" "broad_region" "cluster_id"
[28] "neighborhood" "batch"

  • class(taxonomy_anndata$uns$clusterInfo) returns:

[1] "data.frame"

@UCDNJJ
Copy link
Collaborator

UCDNJJ commented Jun 27, 2023

I was able to consistently retrieve anndata$uns$clusterInfo as a data.frame under both scrattch_mapping versions. This has to be an environment issue thats leading to you seeing pandas.core.frame.DataFrame under the latest docker. One last ask: Can you return the sessionInfo() for each scrattch_mapping docker when you are running it.

Somehow the version of anndata (R library) was downgraded in the latest scrattch mapping docker. I suspect this is the culprit:

bicore/scrattch_mapping:latest -- anndata_0.7.5.3
bicore/scrattch_mapping:0.16 -- anndata_0.7.5.6

@scseeman
Copy link

@meghanaturner @UCDNJJ a couple of weeks ago I was having issues with :latest docker loading at all. I talked with Anish about it and realize that I never actually heard if it got fixed. I've been using the singularity file /allen/programs/celltypes/workgroups/rnaseqanalysis/bicore/singularity/scrattch_mapping_0.2.sif directly instead of the docker and that has worked fine

@meghanaturner
Copy link
Author

Indeed, 0.16 has anndata="0.7.5.6" and latest has anndata="0.7.5.3"

It looks like R was also downgraded:

0.16 sessionInfo():

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_4.2.2

latest sessionInfo():

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_4.2.0

@UCDNJJ
Copy link
Collaborator

UCDNJJ commented Jul 27, 2023

Hi @meghanaturner, when you have time can you check that this issues is resolved when using this docker image: docker://njjai/scrattch_mapping:0.4. This docker image has the most up to date versions of the anndata and R packages that the previous latest image was supposed to contain.

  • singularity shell --cleanenv docker://njjai/scrattch_mapping:0.4

Forewarning, quite a few changes exist in this new update. So if you hit an error let us know.

@meghanaturner
Copy link
Author

Hi @UCDNJJ, this docker image seems to have fixed the original issue I reported where the column names weren't found

@meghanaturner
Copy link
Author

@UCDNJJ However, read_h5ad() no longer reads in sparse matrices as the data type that scrattch-mapping is expecting to find. This problem spontaneously showed up in docker://bicore/scrattch_mapping:0.16 and docker://bicore/scrattch_mapping:latest a couple weeks ago, and does not appear to be fixed by docker://njjai/scrattch_mapping:0.4.

"Error caught for Correlation mapping."
<simpleError in validObject(.Object): invalid class “dgCMatrix” object: 'Dim' slot does not have length 2>
Error in rownames<-(*tmp*, value = colnames(query.data)) :
attempt to set 'rownames' on an object with no dimensions
Calls: taxonomy_mapping -> rownames<-

The same error is thrown for a dgCMatrix. The workaround is to only use taxonomy and spatial anndata objects where X is a dense matrix.

As an alternative to read_h5ad, I tried using

loadTaxonomy(taxonomyDir = "//allen/programs/celltypes/workgroups/rnaseqanalysis/mFISH/meghanturner/brain3_mapping/AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_cscSparseX.h5ad",
anndata_file = "AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_cscSparseX.h5ad")

but despite the documentation for the taxonomyDir argument suggesting that it supports direct h5ad files that aren't part of a shiny taxonomy folder, it errors out with: Required files to load Allen Institute taxonomy are missing.

I saw that you split off scrattch-taxonomy, including loadTaxonomy(), from scrattch-mapping into it's own repo. Should I raise this issue over there?

@UCDNJJ
Copy link
Collaborator

UCDNJJ commented Aug 1, 2023

Interesting, we definitely don't want to be using dense matrices all the time! Let's leave this issue here for now.

We need to do a better job with documentation but you should always use loadTaxonomy() since we do some work in that function to make sure the anndata object is initialized for mapping. The anndata_file argument assumes an .h5ad file that was generated with buildTaxonomy() which is why you are seeing that error about missing required files.

I took a quick look and AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_cscSparseX.h5ad doesn't appear to have been setup with buildTaxonomy() so this .h5ad will not work with scrattch.mapping. I would suggest running buildTaxonomy() using the count matrix and metadata from that object, you can follow the steps in this tutorial: build_taxonomy

Also, can see if you can run the tutorial without error: mapping

@meghanaturner
Copy link
Author

In attempting to follow the build_taxonomy tutorial, I am unable to load the counts matrix from the taxonomy I'm using into R.

I am not familiar with R, so I'm not sure what R's anndata package is expecting to find in an ad.X stored as a CSR sparse matrix. And the tutorial does not provide any suggestions of how to read in counts matrices from other h5ad files (it just does library(tasic2016data); taxonomy.counts = tasic_2016_counts)

# import libraries
library(scrattch.mapping)
library(umap)

# taxonomy I want to use for mapping my spatial data
taxonomy_h5ad_path = "//allen/programs/celltypes/workgroups/rnaseqanalysis/shiny/Taxonomies/AIT17.0_mouse/Prepare/AIT17.0.logCPM.sampled100.h5ad"

# Load taxonomy anndata file
taxonomy_anndata = read_h5ad(taxonomy_h5ad_path )

# Load the count data
taxonomy.counts = taxonomy_anndata$X   # ***this line fails***

Error:

Error in py_ref_to_r(x) : negative length vectors are not allowed
Calls: <Anonymous> ... py_to_r.numpy.ndarray -> NextMethod -> py_to_r.default -> py_ref_to_r

@UCDNJJ
Copy link
Collaborator

UCDNJJ commented Aug 1, 2023

So I also tried running through your code both in a separate R environment and within the scrattch.mapping docker. Both produced the same error.

Could this error be arising due to the dataset size or some change in the .h5ad file that happened a few weeks ago.

I can use the same approach you shared with a dataset or ~340k cells and ~22k genes: /allen/programs/celltypes/workgroups/rnaseqanalysis/shiny/10x_seq/NHP_BG_AIT_115/NHP_BG_AIT115_complete.h5ad. R successfully reads in the anndata$X as a dgR sparse matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants