Why use qsv instead of a "proper" python data analysis library like pandas? #15
jqnatividad
started this conversation in
FAQ
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The need for blazing-fast performance as its essential to a Resource-first upload workflow.
qsv is simply much faster than anything we can write in native python, even if we use a library like pandas.
With python's infamous Global Interpreter Lock, we're limited from doing any serious multithreading. With its garbage collection, we also cannot minimize/optimize memory allocations unlike Rust with its much-heralded borrow checker. And as an interpreted language, its just not fair to compare it with a statically-typed, compiled language that even enables you to exploit CPU level features (SIMD, AVX, SSE, etc.) and add handwritten assembly to squeeze more performance from your code.
By some benchmarks, Rust is up to 300x faster than Python for the equivalent program.
Comparing with pandas directly, pandas took 12 seconds to calculate statistics qsv computed in less than 0.16 seconds (75x faster) for a 124MB, 2.7M row CSV file.1 Even with the
qsv stats --everything
option which calculates even more comprehensive statistics (mode, cardinality, median, nullcount, quartiles, skew, upper fence, lower fence), qsv took only 2 seconds (6x faster).Also, pandas requires loading the entire CSV into memory, which can be a problem with large files. In contrast, qsv for most commands, does streaming analysis, so its memory footprint is constant even with arbitrarily large multi-gigabyte CSV files2.
Finally, qsv has its roots in xsv, which was written by Andrew Gallant (@BurntSushi) of ripgrep fame. If you're not familiar with ripgrep, you owe it to yourself to install it on your computer and start using it instead of grep.
If you've used Visual Studio Code and wonder how its "
Find in Files
" feature is so blazingly-fast... Yep! it's calling a ripgrep binary from VSC, much the same way we're calling qsv from DP+.Footnotes
qsv whirlwind tour ↩
running
qsv stats
on a CSV export of all of NYC's 311 data from 2010 to Mar 2022 (27.8M rows, 16gb) took 22.4 seconds, and its memory footprint remained the same, though it did pin all 16 logical processors near 100% utilization on my Ryzen 7 4800H laptop with 32gb memory and 1 TB SSD. ↩Beta Was this translation helpful? Give feedback.
All reactions