PartialCsvParser is a C++ CSV parser.
This parser is meant to be created to parse a CSV file in parallel.
Table of Contents generated with DocToc
PartialCsvParser is a Single-header library distributed under public domain.
Just copy PartialCsvParser.hpp into your include path and include it.
You can also git add
the header file to your repository, and even modify it.
I appreciate your pull requests if you make some improvements :)
-
Pretty good single-thread & multi-thread performance.
-
Following graphs show sequential performance comparison with other CSV parsers and scalability evaluation. Check benchmark/ for more detailed explanation on performance.
-
-
Input CSV from both files and memories.
-
Simple interface working with STL (Standard Template Library).
-
Column separator (
,
by default) and line separator (\n
by default) are customizable.- Also usable for TSV parsing.
-
Parses both CSV with header line and without it.
-
UTF-8 support.
-
Range in a file can be specified to parse part of a CSV file.
- Data-parallelism is easily realized by creating threads with different range.
Some examples are available in example/ directory.
You can also build and run them quickly.
$ cd example/
$ cmake . && make
$ ./00_parse_with_1parser
example/00_parse_with_1parser.cpp
/**
* Parses a CSV file and print the contents.
*/
#include <PartialCsvParser.hpp>
#include <vector>
#include <string>
#include <iostream>
int main() {
PCP::CsvConfig csv_config("english.csv");
// parse header line
std::vector<std::string> headers = csv_config.get_headers();
// print headers
std::cout << "Headers:" << std::endl;
for (size_t i = 0; i < headers.size(); ++i)
std::cout << headers[i] << "\t";
std::cout << std::endl << std::endl;
// instantiate parser
PCP::PartialCsvParser parser(csv_config); // parses whole body of CSV without range options.
// parse & print body lines
std::vector<std::string> row;
while (!(row = parser.get_row()).empty()) {
std::cout << "Got a row: ";
for (size_t i = 0; i < row.size(); ++i)
std::cout << row[i] << "\t";
std::cout << std::endl;
}
return 0;
}
Output:
$ ./00_parse_with_1parser
Headers:
Country Name Style
Got a row: Japan Shonan Gold Fruit Beer
Got a row: Scotland Punk IPA IPA
Got a row: Germany Franziskaner Hefe-Weissbier
-
Parsing only. No support to write out a CSV file.
-
Multi-byte line separator like CRLF is not supported.
- This may be easily fixed, thanks :D
-
Enclosure character (typically
"
) is not supported.-
The following CSV file is recognized to have 2-row and 2-column, while it should 1-row and 3-column if
"
is treated as enclosure character.aaa,"bbb ccccccc",ddd
-
The reason: Say you are a parser. You have the range starting with 3rd 'c'. You see
"
in front of you. Is that open-quote or close-quote? You cannot tell without parsing from the beginning of file.
-
Reference manual powered by Doxygen is available.
PCP::PartialCsvParser::PartialCsvParser()
takes 2 offsets: parse_from
and parse_to
.
If you have multiple threads and each of them holds different part of [parse_from, parse_to]
,
CSV file is parsed in parallel.
It is assured that all lines of CSV file are parsed exactly once if all instances of parsers' [parse_from, parse_to]
ranges
cover [
PCP::CsvConfig::body_offset() ,
PCP::CsvConfig::filesize() - 1]
without gaps and overlaps (See the following diagram).
header1,header2 \n aaaaaaaa,bbbbbbbbbb \n ccccccccc,dddd \n
^ ^
body_offset filesize - 1
<----------><-----------><------------->
parser1 parser2 parser3
- Get Google Test.
$ wget https://googletest.googlecode.com/files/gtest-1.7.0.zip
$ unzip gtest-1.7.0.zip
$ mv gtest-1.7.0 /path/to/PartialCsvParser/contrib/gtest
- Build test cases executables.
$ cd test
$ cmake . && make
- Run unit tests and integrated tests.
$ ./run_unit_test
$ ./run_integrated_test
Sho Nakatani, a.k.a. laysakura
This project is distributed under public domain.
See UNLICENSE for more detailed explanation.