v0.7.0
Notable
This release introduces a new schema (v4) for storing VCF data with TileDB. Variants are now stored in a 3D sparse array, indexed by contig
, start_pos
, and sample
. The contig
and sample
dimensions are now of type TILEDB_STRING_ASCII
and leverage optimizations in TileDB core for string coordinates.
Note: Arrays created using previous schemas are still fully supported.
Complete documentation about the new schema is available here.
A number of other note-worthy improvements are included in this release:
- New
tiledbvcf utils
sub command for consolidating and vacuuming fragment metadata - Removal of registration phase, which no longer needed 🎉
Core
Changed
- Switch to 3D array and row-major layout (#194)
- Use prebuilt artifacts by default for TileDB (#201)
- Create versioned methods for
init_for_reads
/next_read_batch
(#206) - Only fetch vcf headers on a new read sample batch (#207)
- Sort v4 regions in lexicographical order for writes (#208)
- Remove contig check from
process_query_results_v4()
(#210) - htslib was updated to 1.10 (#230)
- Support TileDB 2.2 C++ result estimate API changes (#209)
- Handle all sample export for v4 with optimal range (#221)
- Don't batch v4 reads by samples (#215)
- V4 writes should split fragments by contig (#216)
- Update TileDB to v2.1.4 (#225)
Fixed
- Add support for ASAN and fix leak in htslib plugin (#181)
- Fix leaking of hfile* (#197)
- Reduce the number of times queries open an array (#204)
- Optimize v2/v3 read queries (#205)
- Set the contig for ingestion on seek of VCF file to avoid records seeking beyond a contig's boundaries (#218)
Added
- Added a new memory budget parameter for setting the max size of the TileDB query buffer for write (#217)
- Add option to disable including vcf header stats (#213)
- The number of writes performed during each ingestion batch is now included in the verbose output (#219)
Python
Changed
Fixed
- Sort pandas
dataframes
for unordered comparison in unit tests (#223)