Skip to content
zzxx edited this page Oct 19, 2016 · 3 revisions

Introduction

husky::io::BinaryInputFormat provides husky::io::HDFSBinaryInputFormat and husky::io::NFSBinaryInputFormat to read binary files from HDFS and NFS, respectively. Data records in each binary file are supposed to be complete. In other words, one record mustn't be split into pieces and stored in different files. Each binary file is represented as a husky::base::BinStream, thus BinStream deserialization functions can be used to read data from binary files.

Usage

Directly use objects of husky::io::HDFSBinaryInputFormat and husky::io::NFSBinaryInputFormat

// HDFS
husky::io::HDFSBinaryInputFormat infmt;
infmt.set_input("/path/to/data/on/hdfs");
// NFS
husky::io::NFSBinaryInputFormat infmt;
infmt.set_input("/path/to/data/on/nfs");

// Use a BinStream reference to read
husky::load(infmt, [](husky::base::BinStream& file) {
  // `size()` returns the number of remaining bytes that can be read from the file
  while (file.size()) {
    int sz = husky::base::deser<int>(file);  // Read one int32
    for (int i; sz--; )  // A for loop to read int32 to i 
      file >> i;
  }
});

Use unified API husky::io::BinaryInputFormat. This is convenient because you only need to modify the config file if you want to change to another data storage system:

// General
husky::io::BinaryInputFormat infmt("protocol:///path/to/data");
// Read from NFS
husky::io::BinaryInputFormat infmt("nfs:///path/to/data/on/nfs");
// Read from HDFS
husky::io::BinaryInputFormat infmt("hdfs:///path/to/data/on/hdfs");

husky::load(infmt, [](husky::base::BinStream& file) { ... });

Regex filter can be used to ignore files that don't match the pattern. The regex filter doesn't just check the file name, but checks the file path that starts from the base path. To add a regex filter, the regex expression should follow the path and start with a colon:

// To read all the files, which locate in the /A/B/C, on HDFS.
// These files start with `part-` and end with a number.
husky::io::BinaryInputFormat infmt("hdfs:///A/B/C:part-\\d+");

// To read all the files, which locate in the /A/B/C and its sub-directories, on NFS.
// These files start with `part-` and end with a number.
husky::io::BinaryInputFormat infmt("nfs:///A/B/C:.*/part-\\d+$");

Important Notes

  1. Please don't try to serialize anything into the file BinStream. The behavior is undefined.
  2. Please don't try to copy a new BinStream from a file BinStream. The behavior is undefined.
Clone this wiki locally