Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt pyarrow and using it for pure python side hdfs handling #1430

Open
ninjapapa opened this issue Oct 25, 2018 · 4 comments
Open

Adopt pyarrow and using it for pure python side hdfs handling #1430

ninjapapa opened this issue Oct 25, 2018 · 4 comments

Comments

@ninjapapa
Copy link
Contributor

pyarrow has a solid hdfs interface, we can easily use it to replace our current Scala HDFS interface.
Some relevant links:

An obvious side benefit is that we can turn on
spark.sql.execution.arrow.enabled=true
It significantly improved the toPanda performance of Spark DF.

https://arrow.apache.org/blog/2017/07/26/spark-arrow/

@AliTajeldin
Copy link
Contributor

I think arrow integration is only in latest versions of spark. Should we make that support optional to maintain backward compatibility with older versions of spark. That is, fall back to scala hdfs or some other hdfs library (the toPandas will fall back automatically).

@ninjapapa
Copy link
Contributor Author

The hdfs part does not need spark integration. So we can add pyarrow to requirement.txt and use it purely for hdfs api for now. Turn on spark.sql.execution.arrow.enabled or not is a cluster config choice. SMV does not need to change anything.

@AliTajeldin
Copy link
Contributor

makes sense

@ninjapapa
Copy link
Contributor Author

Adding pyarrow python package is easy, however its hdfs interface used libhdfs.so or libhdfs3.so. The first one should be part of standard hadoop package, and exists on real cluster environment, and the second could be installed through condo, as in the 2nd link of the main message shows.

However, when we build SMV, none of those 2 libraries exist. To access HDFS fully from python, we may need to use another library, such as
https://hdfscli.readthedocs.io/en/latest/

Or for better performance we may do:

try:
    fs = pyarrow.hdfs.connect(...)
except:
    fs = hdfs.TokenClient(...)

For short term, will still use the SmvHDFS interface on Scala side.

Defer this issue.

@ninjapapa ninjapapa removed their assignment Oct 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants