Release 0.5.0 · TileDB-Inc/TileDB-VCF

Notable

This release includes a new version of TileDB-VCF's schema for representing genomic data, which now indexes variants by start position. Together with numerous improvements made to the ingestion algorithm, TileDB-VCF now supports overlapping variants.

Note: Arrays created using previous schemas are still fully supported.

Core

Added

C API methods for querying and counting fmt/info attributes (#115).
Install option to ignore system installs of TileDB and force build the pinned version of TileDB as an external project (#138).

Changed

Updated TileDB to 2.0.8 (#126, #139, #141).
Updated schema to (v3) to store variants by their start position (#105, #114).
Data types for commonly used genotype fields are now correctly defined (#112).
VCF records are now accessed internally using htslib's iterator during ingestion (#118).

Fixed

Don't access nodes in the record heap that have already been released (#120).
Fix segfault from reading an extra info value from the query result (#128).
Java API is now built and tested on CI (#135).
Don't take reference to query in reader futures (#137).
Fixed bug retrieving fixed-length attributes containing null values (#142, #143).
Fixed CI clang-format task (#126).

Spark

Added

New verbose option .option("verbose", true) for providing additional information when querying array (#121).
Add spark task stage/ID to partition reader logs (#129).

Changed

Schema is now dynamic based on materialized attributes and available fmt/info fields (#115). Use select * to pull in all available attributes and df.schema() will describe and show all possible fields.
Add pos as alias for the pos_start queryable attribute (#124).
The samples and sampleFile options are now mutually exclusive, an error is thrown if both options are passed (#132).

Python

Added

New attributes() method to retrieve all queryable attributes available in a dataset (#127)
ingest_samples() gains arguments for setting the location and amount of scratch space to use when ingesting samples from S3 (#119, #122).
Dataset class gained a verbose option that provides additional information when writing to or reading from an array (#121).

Changed

Renamed TileDBVCFDataset class to Dataset (#116).
Dataset.read()'s sample arguments (samples and samples_file) are now mutually exclusive, an error is thrown is both are defined (#134).

Fixed

Buffers are now refreshed when performing multiple reads with the same Dataset object (#133).

CLI

Changed

The --sample-names and --samples-file arguments are now optional. When omitted all samples are exported by default. Previously one or the either was required.

Docker Images

Added

Improved documentation for the tiledbvcf-cli and tiledbvcf-py Docker images (#110).

Changed

All images now use /data as their working directory rather than /tmp.
tiledbvcf-py can be used to execute a script or launch an interactive Python session.

Fixed

The environment variable AWS_EC2_METADATA_DISABLED is now set to avoid slow downs when querying S3 arrays outside of EC2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.5.0

Notable

Core

Added

Changed

Fixed

Spark

Added

Changed

Python

Added

Changed

Fixed

CLI

Changed

Docker Images

Added

Changed

Fixed