0.5.0
Notable
This release includes a new version of TileDB-VCF's schema for representing genomic data, which now indexes variants by start position. Together with numerous improvements made to the ingestion algorithm, TileDB-VCF now supports overlapping variants.
Note: Arrays created using previous schemas are still fully supported.
Core
Added
- C API methods for querying and counting
fmt
/info
attributes (#115). - Install option to ignore system installs of TileDB and force build the pinned version of TileDB as an external project (#138).
Changed
- Updated TileDB to 2.0.8 (#126, #139, #141).
- Updated schema to (
v3
) to store variants by their start position (#105, #114). - Data types for commonly used genotype fields are now correctly defined (#112).
- VCF records are now accessed internally using
htslib
's iterator during ingestion (#118).
Fixed
- Don't access nodes in the record heap that have already been released (#120).
- Fix segfault from reading an extra
info
value from the query result (#128). - Java API is now built and tested on CI (#135).
- Don't take reference to query in reader futures (#137).
- Fixed bug retrieving fixed-length attributes containing
null
values (#142, #143). - Fixed CI
clang-format
task (#126).
Spark
Added
- New
verbose
option.option("verbose", true)
for providing additional information when querying array (#121). - Add spark task stage/ID to partition reader logs (#129).
Changed
- Schema is now dynamic based on materialized attributes and available
fmt
/info
fields (#115). Useselect *
to pull in all available attributes anddf.schema()
will describe and show all possible fields. - Add
pos
as alias for thepos_start
queryable attribute (#124). - The
samples
andsampleFile
options are now mutually exclusive, an error is thrown if both options are passed (#132).
Python
Added
- New
attributes()
method to retrieve all queryable attributes available in a dataset (#127) ingest_samples()
gains arguments for setting the location and amount of scratch space to use when ingesting samples from S3 (#119, #122).Dataset
class gained averbose
option that provides additional information when writing to or reading from an array (#121).
Changed
- Renamed
TileDBVCFDataset
class toDataset
(#116). Dataset.read()
's sample arguments (samples
andsamples_file
) are now mutually exclusive, an error is thrown is both are defined (#134).
Fixed
- Buffers are now refreshed when performing multiple reads with the same
Dataset
object (#133).
CLI
Changed
- The
--sample-names
and--samples-file
arguments are now optional. When omitted all samples are exported by default. Previously one or the either was required.
Docker Images
Added
- Improved documentation for the
tiledbvcf-cli
andtiledbvcf-py
Docker images (#110).
Changed
- All images now use
/data
as their working directory rather than/tmp
. tiledbvcf-py
can be used to execute a script or launch an interactive Python session.
Fixed
- The environment variable
AWS_EC2_METADATA_DISABLED
is now set to avoid slow downs when querying S3 arrays outside of EC2.