Introduce dataframe file type to replace / supplement beddb and bed2ddb #37

pkerpedjiev · 2018-11-11T19:55:27Z

The goal of this PR is to discuss the introduction of a new file type that replaces the beddb and bed2ddb formats. This file type will be able to store any type of data and be used as backing for gene annotations, bed-like regions, arbitrary points, etc...

Questions to address:

Tile API: the current API takes a zoom level, start and end position. It works right now because any genomic data is converted to a linearized representation where chromosomes are concatenated using a given chromosome order.

Dataframe-backed files will not have this limitation. The tile API will have to have a chromosome order associated with it to indicate which data should be retrieved between coordinates x0 and x1.

Example API:

def get_1D_tile_data(
  filename='my_file.tsv', 
  tile_position=[1,0],
  group_column=['chr'], 
  position_columns=['start', 'end'],
  group_order=[('chr1', 1000), ('chr2', 5000), ('chrX', 4000), ('chrM', 3000)]
)

Column to use as the index: A dataframe may have the start and end positions at arbitrary positions. The request should include an indicator of which columns to use for the positions of the data.

Use cases

Replacing the current beddb and tile bed2db formats.

Perfomance

Filtering a 970K line file takes about 200ms. It may be possible to improve this through parallelization, sorting, indexing or subdividing the file into sections (e.g. chromosomes)

The text was updated successfully, but these errors were encountered:

alexpreynolds · 2018-11-11T20:09:01Z

If you have sorted BED data and absToChr coordinates, you could use bedextract or tabix. Tabix might be better for one-off queries of individual ranges, while bedextract may do better with a query containing multiple ranges.

nvictus · 2018-11-28T22:47:27Z

This point came up on slack. The replacement for beddb should be treatable both as an interval track and as a coverage/vector track. e.g. bigBed natively supports both.

pkerpedjiev assigned pkerpedjiev and unassigned pkerpedjiev Nov 11, 2018

flekschas added the feature label Nov 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce dataframe file type to replace / supplement beddb and bed2ddb #37

Introduce dataframe file type to replace / supplement beddb and bed2ddb #37

pkerpedjiev commented Nov 11, 2018

alexpreynolds commented Nov 11, 2018

nvictus commented Nov 28, 2018

Introduce dataframe file type to replace / supplement beddb and bed2ddb #37

Introduce dataframe file type to replace / supplement beddb and bed2ddb #37

Comments

pkerpedjiev commented Nov 11, 2018

Questions to address:

Use cases

Perfomance

alexpreynolds commented Nov 11, 2018

nvictus commented Nov 28, 2018