Skip to content
Jouni Siren edited this page Nov 18, 2015 · 1 revision

BWT-merge reads and writes many different BWT formats. This document describes how to add support for a hypothetical FooFormat. Remember to mention the format in README.md and on page BWT Formats.

  • The header has three fields:
    • uint64_t bases is the number of characters in the BWT.
    • uint64_t sequences is the number of sequences in the BWT.
    • uint64_t bytes is the number of bytes in the data section.
  • The data section is run-length encoded. The low 3 bits of each byte contain a comp value, while the 5 high-order bits are the length of the run.

formats.h

Each BWT format is represented as a struct with the following interface:

struct FooFormat
{
  static void read(std::ifstream& in, BlockArray& data, sdsl::int_vector<64>& counts);
  static void write(std::ofstream& out, const BlockArray& data, const NativeHeader& info);
  inline static AlphabeticOrder order() { return AO_SORTED; }

  const static std::string name;
  const static std::string tag;
};
  • read() reads the BWT from a file in FooFormat.
    • in is the input stream. It may change from std::ifstream& to std::istream& in the future.
    • The BWT must be written in the native format into data. Any existing contents of data must be discarded.
    • The number of occurrences of each comp value c in the BWT must be written into counts[c]. read() must create a new array first.
  • write() writes the BWT to a file in FooFormat.
    • out is the output stream. It may change from std::ofstream& to std::ostream& in the future.
    • data contains the BWT in the native format.
    • info is the header of the BWT in the native format.
  • order() returns the alphabetic order used in the format. Possible orders are AO_DEFAULT and AO_SORTED. If the format supports both orders, separate structs must be created for them.
  • name is the human-readable name of the format.
  • tag is the short name of the format used as e.g. a command line argument.

It is also a good habit to add a short description of the format in the comment section before the format interfaces.

formats.cpp

Define name and tag near the names and tags of other formats.

const std::string FooFormat::name = "Foo format";
const std::string FooFormat::tag = "foo";

Define the I/O interface:

void
FooFormat::read(std::ifstream& in, BlockArray& data, sdsl::int_vector<64>& counts)
{
  // Initialize the structures.
  data.clear();
  counts = sdsl::int_vector<64>(6, 0);

  // Read the header.
  uint64_t bases = 0, sequences = 0, bytes = 0;
  sdsl::read_member(bases, in);
  sdsl::read_member(sequences, in);
  sdsl::read_member(bytes, in);

  // Read the data.
  RunBuffer run_buffer;
  std::vector<uint8_t> buffer(MEGABYTE);
  for(size_type offset = 0; offset < bytes; offset += MEGABYTE)
  {
    size_type buffer_size = std::min(MEGABYTE, bytes - offset);
    in.read((char*)(buffer.data()), buffer_size);
    for(size_type i = 0; i < buffer_size; i++)
    {
      if(run_buffer.add(buffer[i] & 0x7, buffer[i] >> 3))
      {
        Run::write(data, run_buffer.run);
        counts[run_buffer.run.first] += run_buffer.run.second;
      }
    }
  }
  run_buffer.flush();
  Run::write(data, run_buffer.run);
  counts[run_buffer.run.first] += run_buffer.run.second;
}

void
FooFormat::write(std::ofstream& out, const BlockArray& data, const NativeHeader& info)
{
  // Count the number of bytes/runs.
  uint64_t bytes = 0;
  size_type rle_pos = 0;
  while(rle_pos < data.size())
  {
    range_type run = Run::read(data, rle_pos);
    bytes += (run.second + 30) / 31;
  }

  // Write the header.
  uint64_t bases = info.bases, sequences = info.sequences;
  sdsl::write_member(bases, out);
  sdsl::write_member(sequences, out);
  sdsl::write_member(bytes, out);

  // Write the BWT.
  rle_pos = 0;
  std::vector<uint8_t> buffer; buffer.reserve(MEGABYTE);
  while(rle_pos < data.size())
  {
    range_type run = Run::read(data, rle_pos);
    while(run.second > 0)
    {
      size_type run_length = std::min(run.second, (size_type)31);
      buffer.push_back(run.first | (run_length << 3));
      run.second -= run_length;
      if(buffer.size() >= MEGABYTE)
      {
        out.write((char*)(buffer.data()), buffer.size());
        buffer.clear();
      }
    }
  }
  out.write((char*)(buffer.data()), buffer.size());
}

Add the following line to formatExists():

 || (format == FooFormat::tag)

Add the following line to printFormats() under formats using sorted alphabet:

  printFormat<FooFormat>(stream);

fmi.cpp

Add the following lines to serialize():

  else if(format == FooFormat::tag)
  {
    fmi.serialize<FooFormat>(filename);
  }

Add the following lines to load():

  else if(format == FooFormat::tag)
  {
    fmi.load<FooFormat>(filename);
  }
Clone this wiki locally