Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow: CSV to parquet #841
Workflow: CSV to parquet #841
Changes from 29 commits
3a76080
5aedaca
e185e5d
95585d4
10a529b
4504020
96e66f5
0779b48
7fb6792
437eb0b
e4851df
2657885
6eb04d6
6f4c04e
222c695
fc40687
670e3cc
16b0277
b557bc5
ccedaf8
37120c3
b3cfbaa
ca7df93
cdca5de
6a93130
e2b070e
5b58925
a753aea
6e9fc7f
db73e85
ed1d9e3
e3145c6
19606da
e701cc1
530d35c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes read_csv fall back to the python engine. 2 things:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it's quite likely the average user (including me) would not know this.
Therefore, IMO, I'd think it ought to stay this way so when it changes in the future, the bump in performance would be evident historically because it's what the average user might expect.
The second option I think would be a slightly hacky work-around that the benchmarks aren't designed for. My understanding is these workflows should represent somewhat realistic use cases and not as a means for maximizing performance itself. Could be mistaken though. cc @fjetter @jrbourbeau
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally fine by me if this is intended. Just wanted to check.
read_csv
raises a warning that tells you that it will use the Python engine if you execute this code. That's why I suggested switching it explicitly. There aren't any plans to support regex or multi char seps in the c engine and I doubt that there ever will be, so we don't have to plan for that scenarioThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up ticket, we could parametrize this test with comma-separated and tab-separated files if we care about the performance of the different engines.