Support multi line for CSV #1408

yw-yang · 2018-10-12T12:57:20Z

In some cases, column value may contain carriage return within double quotes, e.g.

"A", "B", "C
D"
"Q", "W", "E
R"

Spark support multiLine option since 2.2(see https://issues.apache.org/jira/browse/SPARK-19610), shall SMV also add it to csvAttributes?

The text was updated successfully, but these errors were encountered:

ninjapapa · 2018-10-12T14:01:02Z

This one is interesting. Need to understand how they implement this. We tested Spark CSV reader in the past and found the performance is quite low compare with our version, may worth to bench mark again.

ninjapapa · 2018-10-12T15:31:58Z

@yw-yang Before we adding support for this, actually you can use SmvInputBase to use Spark native CSV reader. You can refer SmvXmlFile implementation at here
https://github.com/TresAmigosSD/SMV/blob/master/src/main/python/smv/smvinput.py#L182

Basically you can create a lib in your project and define a SmvSparkCsvFile class extends SmvInputBase or SmvInputFromFile in the same way as SmvXmlFile did, then you can extends your project input modules from SmvSparkCsvFile instead of SmvCsvFile.

AliTajeldin · 2018-12-12T02:01:58Z

Just FYI. They use a non-split stream reader (basically a binary file handler rather than the standard hadoop text file reader) to stream the data. This means the entire CSV has to be read using a single task (hence why this option is off by default).
See WholeFileCSVDataSource class in the PR (https://github.com/apache/spark/pull/16976/files#diff-336e0745f97628c82024b2d731ac0166R175)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi line for CSV #1408

Support multi line for CSV #1408

yw-yang commented Oct 12, 2018

ninjapapa commented Oct 12, 2018

ninjapapa commented Oct 12, 2018

AliTajeldin commented Dec 12, 2018

Support multi line for CSV #1408

Support multi line for CSV #1408

Comments

yw-yang commented Oct 12, 2018

ninjapapa commented Oct 12, 2018

ninjapapa commented Oct 12, 2018

AliTajeldin commented Dec 12, 2018