Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore the possibility to remove the dependencies of Spark and Hadoop #17

Open
kpto opened this issue Oct 25, 2024 · 2 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@kpto
Copy link
Collaborator

kpto commented Oct 25, 2024

Currently the project use pyspark which depends on Spark and Hadoop, dependencies that do not natively run on Windows and make the script not Windows friendly, see #13 for the details. It seems that pyspark is used for reading parquet only rather than an actual distributed computing. To make the script more OS agnostic, I suggest to find an alternative crossplatformed solution to read parquet files.

As to my knowledge, pyspark under the hood uses pyarrow to read parquets so pandas perhaps can replace pyspark as it optionally uses pyarrow to read parquet. Yet, the code uses pyspark SQL style queries to pre-process the data which makes the substitution less easy. Further investigation is needed.

@laulopezreal
Copy link
Contributor

Hi @kpto,

This is a very valid point. Have you come across Polars? https://docs.pola.rs/
It is a pretty good alternative to pandas developed in rust and its gaining friction within the python data science community.
It is open source (though the founder has founded a start-up to commercialise products built on top of the library).

Not sure how it compares to pyspark, but it claims to provide up to 10X speed compared with pandas in big data processing in Python.

This is a very cool podcast in which they interview the co-founder. Worth the watch/listening (it is also available for free on spotify):
https://www.youtube.com/watch?v=TZeTO-CJPns

@kpto
Copy link
Collaborator Author

kpto commented Nov 7, 2024

@laulopezreal Hi thank you for your input and apologise for not responding promptly, I missed your comment here.

Our need is simple actually. Since BioCypher processes row by row so we just need to find a package that allows streaming rows to Python runtime without loading the whole dataset into the memory. Turns out it is surprisingly difficult to find one. I have tested Polars but their streaming feature is still experimental and it's unclear to me that how to really iterate rows one by one. Please see this issue: #25, an issue was mentioned inside as well showing that Polars is not ready for this. Their streaming engine is still improving and I look forward to it's development.

In the end duckdb is the one that does it the best. The memory control is excellent and the overhead of streaming is sensible so I plan to use duckdb. I have also tried pyarrow buy it lacks query operators and it's memory control is not comprehensible to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

3 participants