Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi line for CSV #1408

Open
yw-yang opened this issue Oct 12, 2018 · 3 comments
Open

Support multi line for CSV #1408

yw-yang opened this issue Oct 12, 2018 · 3 comments

Comments

@yw-yang
Copy link
Collaborator

yw-yang commented Oct 12, 2018

In some cases, column value may contain carriage return within double quotes, e.g.

"A", "B", "C
D"
"Q", "W", "E
R"

Spark support multiLine option since 2.2(see https://issues.apache.org/jira/browse/SPARK-19610), shall SMV also add it to csvAttributes?

@ninjapapa
Copy link
Contributor

This one is interesting. Need to understand how they implement this. We tested Spark CSV reader in the past and found the performance is quite low compare with our version, may worth to bench mark again.

@ninjapapa
Copy link
Contributor

@yw-yang Before we adding support for this, actually you can use SmvInputBase to use Spark native CSV reader. You can refer SmvXmlFile implementation at here
https://github.com/TresAmigosSD/SMV/blob/master/src/main/python/smv/smvinput.py#L182

Basically you can create a lib in your project and define a SmvSparkCsvFile class extends SmvInputBase or SmvInputFromFile in the same way as SmvXmlFile did, then you can extends your project input modules from SmvSparkCsvFile instead of SmvCsvFile.

@AliTajeldin
Copy link
Contributor

Just FYI. They use a non-split stream reader (basically a binary file handler rather than the standard hadoop text file reader) to stream the data. This means the entire CSV has to be read using a single task (hence why this option is off by default).
See WholeFileCSVDataSource class in the PR (https://github.com/apache/spark/pull/16976/files#diff-336e0745f97628c82024b2d731ac0166R175)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants