-
Notifications
You must be signed in to change notification settings - Fork 3
New Formats
Jouni Siren edited this page Nov 18, 2015
·
1 revision
BWT-merge reads and writes many different BWT formats. This document describes how to add support for a hypothetical FooFormat
. Remember to mention the format in README.md and on page BWT Formats.
- The header has three fields:
-
uint64_t bases
is the number of characters in the BWT. -
uint64_t sequences
is the number of sequences in the BWT. -
uint64_t bytes
is the number of bytes in the data section.
-
- The data section is run-length encoded. The low 3 bits of each byte contain a comp value, while the 5 high-order bits are the length of the run.
Each BWT format is represented as a struct
with the following interface:
struct FooFormat
{
static void read(std::ifstream& in, BlockArray& data, sdsl::int_vector<64>& counts);
static void write(std::ofstream& out, const BlockArray& data, const NativeHeader& info);
inline static AlphabeticOrder order() { return AO_SORTED; }
const static std::string name;
const static std::string tag;
};
-
read()
reads the BWT from a file inFooFormat
.-
in
is the input stream. It may change fromstd::ifstream&
tostd::istream&
in the future. - The BWT must be written in the native format into
data
. Any existing contents ofdata
must be discarded. - The number of occurrences of each comp value
c
in the BWT must be written intocounts[c]
.read()
must create a new array first.
-
-
write()
writes the BWT to a file inFooFormat
.-
out
is the output stream. It may change fromstd::ofstream&
tostd::ostream&
in the future. -
data
contains the BWT in the native format. -
info
is the header of the BWT in the native format.
-
-
order()
returns the alphabetic order used in the format. Possible orders areAO_DEFAULT
andAO_SORTED
. If the format supports both orders, separatestruct
s must be created for them. -
name
is the human-readable name of the format. -
tag
is the short name of the format used as e.g. a command line argument.
It is also a good habit to add a short description of the format in the comment section before the format interfaces.
Define name
and tag
near the names and tags of other formats.
const std::string FooFormat::name = "Foo format";
const std::string FooFormat::tag = "foo";
Define the I/O interface:
void
FooFormat::read(std::ifstream& in, BlockArray& data, sdsl::int_vector<64>& counts)
{
// Initialize the structures.
data.clear();
counts = sdsl::int_vector<64>(6, 0);
// Read the header.
uint64_t bases = 0, sequences = 0, bytes = 0;
sdsl::read_member(bases, in);
sdsl::read_member(sequences, in);
sdsl::read_member(bytes, in);
// Read the data.
RunBuffer run_buffer;
std::vector<uint8_t> buffer(MEGABYTE);
for(size_type offset = 0; offset < bytes; offset += MEGABYTE)
{
size_type buffer_size = std::min(MEGABYTE, bytes - offset);
in.read((char*)(buffer.data()), buffer_size);
for(size_type i = 0; i < buffer_size; i++)
{
if(run_buffer.add(buffer[i] & 0x7, buffer[i] >> 3))
{
Run::write(data, run_buffer.run);
counts[run_buffer.run.first] += run_buffer.run.second;
}
}
}
run_buffer.flush();
Run::write(data, run_buffer.run);
counts[run_buffer.run.first] += run_buffer.run.second;
}
void
FooFormat::write(std::ofstream& out, const BlockArray& data, const NativeHeader& info)
{
// Count the number of bytes/runs.
uint64_t bytes = 0;
size_type rle_pos = 0;
while(rle_pos < data.size())
{
range_type run = Run::read(data, rle_pos);
bytes += (run.second + 30) / 31;
}
// Write the header.
uint64_t bases = info.bases, sequences = info.sequences;
sdsl::write_member(bases, out);
sdsl::write_member(sequences, out);
sdsl::write_member(bytes, out);
// Write the BWT.
rle_pos = 0;
std::vector<uint8_t> buffer; buffer.reserve(MEGABYTE);
while(rle_pos < data.size())
{
range_type run = Run::read(data, rle_pos);
while(run.second > 0)
{
size_type run_length = std::min(run.second, (size_type)31);
buffer.push_back(run.first | (run_length << 3));
run.second -= run_length;
if(buffer.size() >= MEGABYTE)
{
out.write((char*)(buffer.data()), buffer.size());
buffer.clear();
}
}
}
out.write((char*)(buffer.data()), buffer.size());
}
Add the following line to formatExists()
:
|| (format == FooFormat::tag)
Add the following line to printFormats()
under formats using sorted alphabet:
printFormat<FooFormat>(stream);
Add the following lines to serialize()
:
else if(format == FooFormat::tag)
{
fmi.serialize<FooFormat>(filename);
}
Add the following lines to load()
:
else if(format == FooFormat::tag)
{
fmi.load<FooFormat>(filename);
}