Skip to content
Jouni Siren edited this page Nov 18, 2015 · 1 revision

Background

Building the BWT is a solved problem for sequence collections of up to hundreds of gigabytes in size. For larger collections, there are several issues to consider:

  • Construction time: As a rough guideline, an algorithm indexing 1 MB/s is good for up to 100 gigabytes of data, and somewhat useful until 1 terabyte. Larger datasets require faster algorithms.
  • Construction space: When the datasets are much larger than the amount of memory available, we cannot afford using even a single bit of working space per input character.
  • Available hardware: In a typical computer cluster, a single node has two CPUs (with up to tens of cores), tens to hundreds of gigabytes of local memory, a limited amount of local disk space, and large amounts of shared (but often slow) disk space. Some tools require fast GPUs or large amounts of fast disk space, which are generally not available.
  • Resource usage: Merging large BWTs is easy by doing a lot of redundant work on multiple nodes. Because computer clusters generally do not have large amounts of unused capacity, good tools should make an efficient use of resources.

Algorithms and implementations

There are several BWT construction algorithms based on updating an existing BWT. Some algorithms extend sequences already in the collection, updating BWT(T[i+1,j]) to BWT(T[i,j]). Others insert new sequences to the collection, updating BWT(S) to BWT(S,T). In both cases, the algorithm can either do batch updates to a static BWT representation, or use a dynamic representation for the BWT. Algorithms with a static BWT representation require more working space, while algorithms with a dynamic representation have more space overhead for the BWT.

Some common implementations include:

  • BEETL: (extend, static) on disk
  • NVBIO: (insert, dynamic) using GPU
  • RLCSA: (insert, static)
  • ropebwt: (extend, static) or (insert by extending, dynamic)
  • ropebwt2: (insert by extending, dynamic)

As BWT-merge merges existing BWTs, it is (insert, static).

There are also other algorithms for building the BWT for large read collections. They are based on e.g. partitioning the suffixes, sorting each of the partitions separately, and building the BWT directly.

Clone this wiki locally