Skip to content

0.5.0

Compare
Choose a tag to compare
@aaronwolen aaronwolen released this 12 Aug 21:04
3ef1772

Notable

This release includes a new version of TileDB-VCF's schema for representing genomic data, which now indexes variants by start position. Together with numerous improvements made to the ingestion algorithm, TileDB-VCF now supports overlapping variants.

Note: Arrays created using previous schemas are still fully supported.

Core

Added

  • C API methods for querying and counting fmt/info attributes (#115).
  • Install option to ignore system installs of TileDB and force build the pinned version of TileDB as an external project (#138).

Changed

  • Updated TileDB to 2.0.8 (#126, #139, #141).
  • Updated schema to (v3) to store variants by their start position (#105, #114).
  • Data types for commonly used genotype fields are now correctly defined (#112).
  • VCF records are now accessed internally using htslib's iterator during ingestion (#118).

Fixed

  • Don't access nodes in the record heap that have already been released (#120).
  • Fix segfault from reading an extra info value from the query result (#128).
  • Java API is now built and tested on CI (#135).
  • Don't take reference to query in reader futures (#137).
  • Fixed bug retrieving fixed-length attributes containing null values (#142, #143).
  • Fixed CI clang-format task (#126).

Spark

Added

  • New verbose option .option("verbose", true) for providing additional information when querying array (#121).
  • Add spark task stage/ID to partition reader logs (#129).

Changed

  • Schema is now dynamic based on materialized attributes and available fmt/info fields (#115). Use select * to pull in all available attributes and df.schema() will describe and show all possible fields.
  • Add pos as alias for the pos_start queryable attribute (#124).
  • The samples and sampleFile options are now mutually exclusive, an error is thrown if both options are passed (#132).

Python

Added

  • New attributes() method to retrieve all queryable attributes available in a dataset (#127)
  • ingest_samples() gains arguments for setting the location and amount of scratch space to use when ingesting samples from S3 (#119, #122).
  • Dataset class gained a verbose option that provides additional information when writing to or reading from an array (#121).

Changed

  • Renamed TileDBVCFDataset class to Dataset (#116).
  • Dataset.read()'s sample arguments (samples and samples_file) are now mutually exclusive, an error is thrown is both are defined (#134).

Fixed

  • Buffers are now refreshed when performing multiple reads with the same Dataset object (#133).

CLI

Changed

  • The --sample-names and --samples-file arguments are now optional. When omitted all samples are exported by default. Previously one or the either was required.

Docker Images

Added

Changed

  • All images now use /data as their working directory rather than /tmp.
  • tiledbvcf-py can be used to execute a script or launch an interactive Python session.

Fixed

  • The environment variable AWS_EC2_METADATA_DISABLED is now set to avoid slow downs when querying S3 arrays outside of EC2.