Skip to content

Latest commit

 

History

History
150 lines (106 loc) · 5.1 KB

README.md

File metadata and controls

150 lines (106 loc) · 5.1 KB

TileDB-VCF Docker Images

TileDB-VCF is a C++ library for efficient storage and retrieval of genomic variant-call data using TileDB Embedded.

Quick reference

Images

Pre-built images are available on Docker Hub.

Variants

Supported tags

  • latest: latest stable release (recommended)
  • dev: development version
  • v0.x.x for a specific version

Parameters

  • AWS_EC2_METADATA_DISABLED Disable the EC2 metadata service (default true). This avoids unnecessary API calls when querying S3-hosted arrays from non-EC2 clients. If you are on an EC2 instance this service can be enabled with -e AWS_EC2_METADATA_DISABLED=false.

Building locally

To build the Dockerfiles locally you need to clone the TileDB-VCF repository from GitHub and run docker build from within the local repo's root directory.

git clone https://github.com/TileDB-Inc/TileDB-VCF.git
cd TileDB-VCF

# build the cli image
docker build -f docker/Dockerfile-cli .

# build the python library
docker build -f docker/Dockerfile-py .

Usage

The following examples are meant to provide a quick overview of how to use these containers. See our docs for more comprehensive introductions to TileDB-VCF's functionality.

Here, we're using a publicly accessible TileDB-VCF dataset hosted on S3 that contains genome-wide variants for 20 synthetic samples. To pass in a local array or save exported files you will need to bind a local directory to the container's /data directory.

CLI

You can use the tiledbvcf-cli image to run any of the CLI's available commands.

List all samples in the dataset:

docker run --rm tiledb/tiledbvcf-cli list \
  --uri s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20

Create a table of all variants within a region of interest for sample v2-WpXCYApL

docker run --rm -v $PWD:/data tiledb/tiledbvcf-cli export \
  --uri s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20 \
  -Ot --tsv-fields "CHR,POS,REF,S:GT"
  --regions chr7:144000320-144008793 \
  --sample-names v2-WpXCYApL

To avoid permission issues when creating TileDB-VCF datasets with Docker you need to provide the current user ID and group ID to the container:

docker run --rm \
  -v $PWD:/data \
  -u "$(id -u):$(id -g)" \
  tiledb/tiledbvcf-cli \
  create -u test-array

User and group IDs should also be specified when ingesting data:

# create a sample file with S3 URIs for 10 example BCF files
printf "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G%i.bcf\n" $(seq 10) > samples.txt

docker run --rm \
  -v $PWD:/data \
  -u "$(id -u):$(id -g)" \
  tiledb/tiledbvcf-cli \
  store -u test-array -f samples.txt --scratch-mb 10 --verbose

Python

You can use the tiledbvcf-py container to execute an external script or launch an interactive Python session.

docker run -it --rm tiledb/tiledbvcf-py

The following script performs the same query as above but returns a pandas DataFrame:

import tiledbvcf

uri = "s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20"

# open the array in 'read' mode
ds = tiledbvcf.Dataset(uri, mode = "r")

ds.read(
  attrs=['sample_name', 'pos_start', 'pos_end', 'alleles'],
  regions=['chr7:144000320-144008793'],
  samples=['v2-WpXCYApL']
)

##     sample_name  pos_start    pos_end         alleles
## 0   v2-WpXCYApL  143999628  144000483  [G, <NON_REF>]
## 1   v2-WpXCYApL  144000484  144001253  [C, <NON_REF>]
## 2   v2-WpXCYApL  144001254  144001494  [G, <NON_REF>]
## 3   v2-WpXCYApL  144001495  144001735  [T, <NON_REF>]
## 4   v2-WpXCYApL  144001736  144001900  [C, <NON_REF>]
## ..          ...        ...        ...             ...
## 66  v2-WpXCYApL  144008445  144008467  [C, <NON_REF>]
## 67  v2-WpXCYApL  144008468  144008479  [T, <NON_REF>]
## 68  v2-WpXCYApL  144008480  144008690  [A, <NON_REF>]
## 69  v2-WpXCYApL  144008691  144008697  [A, <NON_REF>]
## 70  v2-WpXCYApL  144008698  144008846  [C, <NON_REF>]
##
## [71 rows x 4 columns]

Want to learn more?

Check out TileDB-VCF's docs for installation instructions, detailed descriptions of the data format and algorithms, and in-depth tutorials.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.