Explore the possibility to remove the dependencies of Spark and Hadoop #17

kpto · 2024-10-25T15:09:08Z

Currently the project use pyspark which depends on Spark and Hadoop, dependencies that do not natively run on Windows and make the script not Windows friendly, see #13 for the details. It seems that pyspark is used for reading parquet only rather than an actual distributed computing. To make the script more OS agnostic, I suggest to find an alternative crossplatformed solution to read parquet files.

As to my knowledge, pyspark under the hood uses pyarrow to read parquets so pandas perhaps can replace pyspark as it optionally uses pyarrow to read parquet. Yet, the code uses pyspark SQL style queries to pre-process the data which makes the substitution less easy. Further investigation is needed.

laulopezreal · 2024-11-02T11:53:09Z

Hi @kpto,

This is a very valid point. Have you come across Polars? https://docs.pola.rs/
It is a pretty good alternative to pandas developed in rust and its gaining friction within the python data science community.
It is open source (though the founder has founded a start-up to commercialise products built on top of the library).

Not sure how it compares to pyspark, but it claims to provide up to 10X speed compared with pandas in big data processing in Python.

This is a very cool podcast in which they interview the co-founder. Worth the watch/listening (it is also available for free on spotify):
https://www.youtube.com/watch?v=TZeTO-CJPns

kpto · 2024-11-07T15:21:13Z

@laulopezreal Hi thank you for your input and apologise for not responding promptly, I missed your comment here.

Our need is simple actually. Since BioCypher processes row by row so we just need to find a package that allows streaming rows to Python runtime without loading the whole dataset into the memory. Turns out it is surprisingly difficult to find one. I have tested Polars but their streaming feature is still experimental and it's unclear to me that how to really iterate rows one by one. Please see this issue: #25, an issue was mentioned inside as well showing that Polars is not ready for this. Their streaming engine is still improving and I look forward to it's development.

In the end duckdb is the one that does it the best. The memory control is excellent and the overhead of streaming is sensible so I plan to use duckdb. I have also tried pyarrow buy it lacks query operators and it's memory control is not comprehensible to me.

kpto added enhancement New feature or request priority low labels Oct 25, 2024

kpto self-assigned this Oct 25, 2024

kpto added this to OTAR3088 Oct 25, 2024

kpto mentioned this issue Oct 25, 2024

No documentation on how to run this project on Windows #13

Open

kpto removed the priority low label Oct 25, 2024

kpto mentioned this issue Oct 30, 2024

Replace pyspark with duckdb #18

Open

slobentanzer added this to the Version v1 milestone Oct 30, 2024

slobentanzer mentioned this issue Oct 30, 2024

Devise robust adapter principle and backend #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore the possibility to remove the dependencies of Spark and Hadoop #17

Explore the possibility to remove the dependencies of Spark and Hadoop #17

kpto commented Oct 25, 2024

laulopezreal commented Nov 2, 2024

kpto commented Nov 7, 2024 •

edited

Loading

Explore the possibility to remove the dependencies of Spark and Hadoop #17

Explore the possibility to remove the dependencies of Spark and Hadoop #17

Comments

kpto commented Oct 25, 2024

laulopezreal commented Nov 2, 2024

kpto commented Nov 7, 2024 • edited Loading

kpto commented Nov 7, 2024 •

edited

Loading