Releases: TileDB-Inc/TileDB-VCF
v0.6.1
This is a minor release that includes two export-related bug fixes. Please note, this is the last version of TileDB-VCF that will create datasets using the current v3
array schema. The next version (v0.7) will introduce a new schema (v4
). If you are planning to perform a large ingestion in the near future, we recommend postponing until this new version is released in the coming weeks.
Core
Added
-
Added support for printing array creation statistics from TileDB (#191)
tiledbvcf create -u my-array --stats
Changed
- Updated TileDB to 2.1.3 (#192)
Fixed
v0.6.0
Core
Changed
- Updated TileDB to v2.1.2 (#187)
- Point ranges are now used for each sample ID to prevent specific queries from retrieving the entire sample dimension (#175)
- Refactor C++ unit tests using CATCH matcher so export results can be checked, regardless of their order (#180)
- CLI unit tests will now report differences where they exist (#177)
- Remove calls to deprecated
max_buffer_elements()
function (#179)
Fixed
- Updated the record intersection algorithm to ensure we only report a single record for each query region and VCF record (#172)
- Remote files are now downloaded to sample-specific temporary locations (#182)
- Every record now independently checks regions for intersections to avoid the possibility that region filtering might exclude a result (#171)
Python
Added
- Add C/Python APIs for retrieving schema version and sample names (#164)
v0.5.3
v0.5.2
Notable
This release introduces significant performance enhancements for exports using large (indexed) BED files.
Core
Added
- Support for reading indexed BED files in parallel (#162).
- BED file parsing times are now included in export's verbose output (#160).
Fixed
- Sort internal index by start position when processing v3 arrays (#163).
- Use htslib's default read capacity size (#161).
All Changes
- 5f870dd Update spark/java versions to 0.5.2
- 38043fd External CI script for collecting native libs
- 76c0735 Don't build native libs if build stage failed
- cfa8cab Switch github releases to drafts
- 95f7892 Use boolean vars in CI
- c554bc0 Add support for reading BED file in parallel.
- b7153c3 V3 Arrays should move regions based on
start_pos
- 8e22e04 Use the default HTSLIB read capacity size
- 13cff2a Add timing for parsing and sorting bed file
0.5.1
Notable
Compressed BED files are now supported for exports.
Core
Added
Changed
- Docker images are now built in a separate CI stage (#158).
CLI
Added
export
gains a--sorted
flag to skip BED file sorting when the file is already known to be sorted (#147)
Python
Fixed
- Fix buffer memory leak (#152)
Java/Spark
Added
- Additional tests for decoding attributes from
info
/fmt
byte blobs (#140)
Changed
getRanges()
's option for specifying genomic regions has been renamed fromranges
toregions
in order to be consistent with the other APIs (#151).
All Changes
- 4bb70cd Add stage for building docker images
- df78833 Merge pull request #157 from TileDB-Inc/ss/log-on-error-java-native-loading
- ce46126 Log all errors for java native lib loading
- cc4b8a1 Merge pull request #156 from TileDB-Inc/aw/ch3038/fix-var-len-filters
- 9994517 Merge pull request #155 from TileDB-Inc/sethshelnutt/ch3038/incorrect-values-in-filter-field
- f5bdb58 fix pytest for varable length filter
- 36b09a3 Filter/Allele index should be based on list_offset
- 5fd6fe9 Add pytest for querying variable length filters
- 6eebc7d Use correct post-condition for limit_partitions
- e1022c7 Add docstring and test for read_dask(limit_partitions=) kwarg
- d58a9a3 Add limit_partitions kwarg option for read_dask/map_dask
- 9a92107 Fix buffer memory leak
- 7c949d9 Add support for regions spark option
- 04f4437 Merge pull request #149 from TileDB-Inc/aw/ch2942/v3-example-array
- 0c0275e Merge pull request #150 from TileDB-Inc/sethshelnutt/ch2994/setting-memory-budget-and-tbb-config-causes
- b9930e4 Delay creating contexts for read
- f4011c8 Pin badge to master branch
- 6269ee8 Don't use deprecated python class in README
- 0409ac7 Use URI for v3 of vcf-samples-20 array
- ad59d23 Added CI step that builds a cross-platform jar
- c401bc5 Make clang-format happy
- dab62cb Update CLI description for no-duplicates arg
- 68803e5 Add CLI flag to skip region sorting
- 7b93018 Update to Gradle 6.6
- a12a147 Reader validity test
- 2398f30 Add parsing of bedfile via htslib
- e504249 Report diffs in run-cli-tests.sh
- efef9e2 Enable trace logging in cli tests
- b31d17f Add htslib plugin for reading files via VFS
- a3d0569 Merge pull request #145 from TileDB-Inc/ss/spark-java-version-0.5.1-snapshot
- c70da3e Update spark/java version to 0.5.1-SNAPSHOT
- ebd7905 Add unit tests which compares VCFInfoFmtDecoder [ch2819]
0.5.0
Notable
This release includes a new version of TileDB-VCF's schema for representing genomic data, which now indexes variants by start position. Together with numerous improvements made to the ingestion algorithm, TileDB-VCF now supports overlapping variants.
Note: Arrays created using previous schemas are still fully supported.
Core
Added
- C API methods for querying and counting
fmt
/info
attributes (#115). - Install option to ignore system installs of TileDB and force build the pinned version of TileDB as an external project (#138).
Changed
- Updated TileDB to 2.0.8 (#126, #139, #141).
- Updated schema to (
v3
) to store variants by their start position (#105, #114). - Data types for commonly used genotype fields are now correctly defined (#112).
- VCF records are now accessed internally using
htslib
's iterator during ingestion (#118).
Fixed
- Don't access nodes in the record heap that have already been released (#120).
- Fix segfault from reading an extra
info
value from the query result (#128). - Java API is now built and tested on CI (#135).
- Don't take reference to query in reader futures (#137).
- Fixed bug retrieving fixed-length attributes containing
null
values (#142, #143). - Fixed CI
clang-format
task (#126).
Spark
Added
- New
verbose
option.option("verbose", true)
for providing additional information when querying array (#121). - Add spark task stage/ID to partition reader logs (#129).
Changed
- Schema is now dynamic based on materialized attributes and available
fmt
/info
fields (#115). Useselect *
to pull in all available attributes anddf.schema()
will describe and show all possible fields. - Add
pos
as alias for thepos_start
queryable attribute (#124). - The
samples
andsampleFile
options are now mutually exclusive, an error is thrown if both options are passed (#132).
Python
Added
- New
attributes()
method to retrieve all queryable attributes available in a dataset (#127) ingest_samples()
gains arguments for setting the location and amount of scratch space to use when ingesting samples from S3 (#119, #122).Dataset
class gained averbose
option that provides additional information when writing to or reading from an array (#121).
Changed
- Renamed
TileDBVCFDataset
class toDataset
(#116). Dataset.read()
's sample arguments (samples
andsamples_file
) are now mutually exclusive, an error is thrown is both are defined (#134).
Fixed
- Buffers are now refreshed when performing multiple reads with the same
Dataset
object (#133).
CLI
Changed
- The
--sample-names
and--samples-file
arguments are now optional. When omitted all samples are exported by default. Previously one or the either was required.
Docker Images
Added
- Improved documentation for the
tiledbvcf-cli
andtiledbvcf-py
Docker images (#110).
Changed
- All images now use
/data
as their working directory rather than/tmp
. tiledbvcf-py
can be used to execute a script or launch an interactive Python session.
Fixed
- The environment variable
AWS_EC2_METADATA_DISABLED
is now set to avoid slow downs when querying S3 arrays outside of EC2.
0.4.3
0.4.2
Changes include:
- Revert gradle git plugin for automatic versioning, some users had problems with it #93
- Fix cmake overrides for python setup #94
- Build shadowJar without classifier #95
- Don't include spark libraries in shadowJar #96
- Allow publishing spark shadowJar to maven #97
- Update python dockerfile #98
- Overhaul README #101
- Update TileDB to 2.0.3 in superbuild #102
0.4.1
This adds support for duplicates in the dataset. There is a new flag CLI flags, --no-duplicates
to disable this new behavior. There is also a corresponding python dataset option.
Additional changes include:
0.4.0
This release update to TileDB 2.0.0, which includes performance optimizations and memory improvements.
Changes Includes: