Why use qsv instead of a "proper" python data analysis library like pandas? #15

jqnatividad · 2022-04-28T02:14:48Z

jqnatividad
Apr 28, 2022
Maintainer

The need for blazing-fast performance as its essential to a Resource-first upload workflow.

qsv is simply much faster than anything we can write in native python, even if we use a library like pandas.

With python's infamous Global Interpreter Lock, we're limited from doing any serious multithreading. With its garbage collection, we also cannot minimize/optimize memory allocations unlike Rust with its much-heralded borrow checker. And as an interpreted language, its just not fair to compare it with a statically-typed, compiled language that even enables you to exploit CPU level features (SIMD, AVX, SSE, etc.) and add handwritten assembly to squeeze more performance from your code.

By some benchmarks, Rust is up to 300x faster than Python for the equivalent program.

Comparing with pandas directly, pandas took 12 seconds to calculate statistics qsv computed in less than 0.16 seconds (75x faster) for a 124MB, 2.7M row CSV file.¹ Even with the qsv stats --everything option which calculates even more comprehensive statistics (mode, cardinality, median, nullcount, quartiles, skew, upper fence, lower fence), qsv took only 2 seconds (6x faster).

Also, pandas requires loading the entire CSV into memory, which can be a problem with large files. In contrast, qsv for most commands, does streaming analysis, so its memory footprint is constant even with arbitrarily large multi-gigabyte CSV files².

Finally, qsv has its roots in xsv, which was written by Andrew Gallant (@BurntSushi) of ripgrep fame. If you're not familiar with ripgrep, you owe it to yourself to install it on your computer and start using it instead of grep.

If you've used Visual Studio Code and wonder how its "Find in Files" feature is so blazingly-fast... Yep! it's calling a ripgrep binary from VSC, much the same way we're calling qsv from DP+.

qsv whirlwind tour ↩
running qsv stats on a CSV export of all of NYC's 311 data from 2010 to Mar 2022 (27.8M rows, 16gb) took 22.4 seconds, and its memory footprint remained the same, though it did pin all 16 logical processors near 100% utilization on my Ryzen 7 4800H laptop with 32gb memory and 1 TB SSD. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use qsv instead of a "proper" python data analysis library like pandas? #15

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Why use qsv instead of a "proper" python data analysis library like pandas? #15

jqnatividad Apr 28, 2022 Maintainer

Footnotes

Replies: 0 comments

jqnatividad
Apr 28, 2022
Maintainer