Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce dataframe file type to replace / supplement beddb and bed2ddb #37

Open
pkerpedjiev opened this issue Nov 11, 2018 · 2 comments
Labels

Comments

@pkerpedjiev
Copy link
Member

The goal of this PR is to discuss the introduction of a new file type that replaces the beddb and bed2ddb formats. This file type will be able to store any type of data and be used as backing for gene annotations, bed-like regions, arbitrary points, etc...

Questions to address:

  1. Tile API: the current API takes a zoom level, start and end position. It works right now because any genomic data is converted to a linearized representation where chromosomes are concatenated using a given chromosome order.

Dataframe-backed files will not have this limitation. The tile API will have to have a chromosome order associated with it to indicate which data should be retrieved between coordinates x0 and x1.

Example API:

def get_1D_tile_data(
  filename='my_file.tsv', 
  tile_position=[1,0],
  group_column=['chr'], 
  position_columns=['start', 'end'],
  group_order=[('chr1', 1000), ('chr2', 5000), ('chrX', 4000), ('chrM', 3000)]
)
  1. Column to use as the index: A dataframe may have the start and end positions at arbitrary positions. The request should include an indicator of which columns to use for the positions of the data.

Use cases

  1. Replacing the current beddb and tile bed2db formats.

Perfomance

Filtering a 970K line file takes about 200ms. It may be possible to improve this through parallelization, sorting, indexing or subdividing the file into sections (e.g. chromosomes)

image

@alexpreynolds
Copy link
Contributor

If you have sorted BED data and absToChr coordinates, you could use bedextract or tabix. Tabix might be better for one-off queries of individual ranges, while bedextract may do better with a query containing multiple ranges.

@nvictus
Copy link
Member

nvictus commented Nov 28, 2018

This point came up on slack. The replacement for beddb should be treatable both as an interval track and as a coverage/vector track. e.g. bigBed natively supports both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants