-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopt pyarrow and using it for pure python side hdfs handling #1430
Comments
I think arrow integration is only in latest versions of spark. Should we make that support optional to maintain backward compatibility with older versions of spark. That is, fall back to scala hdfs or some other hdfs library (the |
The hdfs part does not need spark integration. So we can add |
makes sense |
Adding However, when we build SMV, none of those 2 libraries exist. To access HDFS fully from python, we may need to use another library, such as Or for better performance we may do: try:
fs = pyarrow.hdfs.connect(...)
except:
fs = hdfs.TokenClient(...) For short term, will still use the SmvHDFS interface on Scala side. Defer this issue. |
pyarrow has a solid hdfs interface, we can easily use it to replace our current Scala HDFS interface.
Some relevant links:
An obvious side benefit is that we can turn on
spark.sql.execution.arrow.enabled=true
It significantly improved the
toPanda
performance of Spark DF.https://arrow.apache.org/blog/2017/07/26/spark-arrow/
The text was updated successfully, but these errors were encountered: