Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Python based anndata testfiles #170

Closed
wants to merge 31 commits into from
Closed

Conversation

LouiseDck
Copy link
Collaborator

Instead of relying on round-trip tests, we also want to be able to read in all possible Python arrays, matrices and dataframes.

@LouiseDck LouiseDck self-assigned this Jul 3, 2024
@LouiseDck LouiseDck force-pushed the dataset-generator branch from 71a4291 to b4d8dfd Compare July 4, 2024 18:23
@rcannood
Copy link
Collaborator

rcannood commented Jul 8, 2024

Instead of relying on round-trip tests, we also want to be able to read in all possible Python arrays, matrices and dataframes.

I think we should still do round-trip tests, but the difference is that our current round-trip tests currently:

  • use R functions to generate data
  • use CRAN anndata to move R data into Python with reticulate
  • use Python anndata to write to data to disk
  • use anndataR to read data from disk
  • compare original data to resulting data

(the reverse round-trip is also tested)

While this PR will enable tests:

  • use reticulate to call Python functions which generate data
  • use Python anndata to write data to disk
  • use anndataR to read data from disk
  • compare original data to resulting data

What I'm currently wondering about is how to compare the original data in Python to the resulting data in R. Do we introduce a simple JSON format to store which assertions to make (i.e. let the python function write out which assertions R should be making to check whether the anndata was created correctly)? Or do we let the Python functions generate data in a predictable manner so that we can reproduce the same results in R?

@LouiseDck
Copy link
Collaborator Author

I think we should still do round-trip tests, but the difference is that our current round-trip tests currently:

use R functions to generate data
use CRAN anndata to move R data into Python with reticulate
use Python anndata to write to data to disk
use anndataR to read data from disk
compare original data to resulting data

I agree.

What I'm currently wondering about is how to compare the original data in Python to the resulting data in R. Do we introduce a simple JSON format to store which assertions to make (i.e. let the python function write out which assertions R should be making to check whether the anndata was created correctly)? Or do we let the Python functions generate data in a predictable manner so that we can reproduce the same results in R?

After discussion we settled on generating the same dataset in both Python and R, and checking if there are differences when reading in both datasets, and reporting differences using h5diffs.

@LouiseDck
Copy link
Collaborator Author

For documentation purposes: the python anndata generator functions have moved to a separate package: dummy-anndata

rcannood and others added 12 commits October 3, 2024 20:48
* re-enable matrices with NAs tests in X and layers

* one more

* Ensure that matrices are never written as nullables
Take care when using reticulate for testing: before writing, convert NA to NaN

* Fix Windows writing NA error

* Remove commented code since no longer needed

* remove commented code (no longer needed)

---------

Co-authored-by: Louise Deconinck <[email protected]>
* Update write_h5ad_categorical

* fix styling

* Update write_h5ad_categorical

* Adjust H5AD categorical write test

* Add write_h5ad_attributes function

Replace repeated code in individual writers

* ignore cyclomatic complexity warning for `write_h5ad_element` warning

* formatting changes

* in write_h5ad_attributes, allow file to be an open hdf5 file

* Add write_h5ad_boolean_attribute()

* Add write_h5ad_boolean_array()

Helper function for writing ENUM boolean arrays

* Remove compression argument from write_h5ad_boolean_array

Don't think it is possible to write compressed data using the workaround
and ENUM format should be fairly space efficient anyway

* Correctly read categorical levels from H5AD

Fixes array when there are more levels than values

* Fix writing scalar H5AD attributes

Correctly check the is_scalra argument

* add lintr exceptions

* fix nolint

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>
* port rownames-related changes from #166 and #169

* run styler

* fix test

* style

* style

* fix docs

* fix documentation

* simplify helper functions

* simplify test

* add more documentation to AnnData

* fix docs
* Update write_h5ad_categorical

* fix styling

* Update write_h5ad_categorical

* Adjust H5AD categorical write test

* Add write_h5ad_attributes function

Replace repeated code in individual writers

* ignore cyclomatic complexity warning for `write_h5ad_element` warning

* formatting changes

* in write_h5ad_attributes, allow file to be an open hdf5 file

* wip

* wip

* substitute mentions of rhdf5 with hdf5r

* strip obs_names and var_names from framework

* update

* fix tests and finalize

* remove mentions of obs_names and var_names in the constructor

* make sure filenames are always unique

* add mode to various functions

* manually close anndatas in tests (where needed)

* only close when pointer is valid

* move match

* use $close() instead of $close_all()

* switch to different branch

* simplify test

* gc afterclosing the adata in write_h5ad

* guess the dtype and the space

* update docs

* use hhoeflin's remote

* bugfix in hdf5r has been released

* update: nevermind, the fix wasn't included in the release yet

* minor fixes

* bump version number

* remove remotes

* remove references to rhdf5

* fix attributes

* style

* fix write h5ad helpers

* fix unit tests

* fix linting issues

* move hdf5 helpers

* reuse existing functionality

* add test (this seems to have been fixed at some point)

* improve guessing of dtype when storing a logical vector

* fix styling

* reenable more tests

---------

Co-authored-by: Luke Zappia <[email protected]>
* Tidy user interface

Co-authored-by: Luke Zappia <[email protected]>

* Update docs

Co-authored-by: Luke Zappia <[email protected]>

* update docs

* fix linting issues

---------

Co-authored-by: Luke Zappia <[email protected]>
@LouiseDck
Copy link
Collaborator Author

Superseded by #207 and dummy-anndata

@LouiseDck LouiseDck closed this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants