Adopt pyarrow and using it for pure python side hdfs handling #1430

ninjapapa · 2018-10-25T21:28:57Z

pyarrow has a solid hdfs interface, we can easily use it to replace our current Scala HDFS interface.
Some relevant links:

An obvious side benefit is that we can turn on
spark.sql.execution.arrow.enabled=true
It significantly improved the toPanda performance of Spark DF.

https://arrow.apache.org/blog/2017/07/26/spark-arrow/

The text was updated successfully, but these errors were encountered:

AliTajeldin · 2018-10-25T23:46:54Z

I think arrow integration is only in latest versions of spark. Should we make that support optional to maintain backward compatibility with older versions of spark. That is, fall back to scala hdfs or some other hdfs library (the toPandas will fall back automatically).

ninjapapa · 2018-10-26T00:07:21Z

The hdfs part does not need spark integration. So we can add pyarrow to requirement.txt and use it purely for hdfs api for now. Turn on spark.sql.execution.arrow.enabled or not is a cluster config choice. SMV does not need to change anything.

AliTajeldin · 2018-10-26T04:53:21Z

makes sense

ninjapapa · 2018-10-26T22:37:16Z

Adding pyarrow python package is easy, however its hdfs interface used libhdfs.so or libhdfs3.so. The first one should be part of standard hadoop package, and exists on real cluster environment, and the second could be installed through condo, as in the 2nd link of the main message shows.

However, when we build SMV, none of those 2 libraries exist. To access HDFS fully from python, we may need to use another library, such as
https://hdfscli.readthedocs.io/en/latest/

Or for better performance we may do:

try:
    fs = pyarrow.hdfs.connect(...)
except:
    fs = hdfs.TokenClient(...)

For short term, will still use the SmvHDFS interface on Scala side.

Defer this issue.

ninjapapa self-assigned this Oct 26, 2018

ninjapapa mentioned this issue Oct 26, 2018

Remove python call back server #1431

Closed

ninjapapa removed their assignment Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt pyarrow and using it for pure python side hdfs handling #1430

Adopt pyarrow and using it for pure python side hdfs handling #1430

ninjapapa commented Oct 25, 2018

AliTajeldin commented Oct 25, 2018

ninjapapa commented Oct 26, 2018

AliTajeldin commented Oct 26, 2018

ninjapapa commented Oct 26, 2018

Adopt pyarrow and using it for pure python side hdfs handling #1430

Adopt pyarrow and using it for pure python side hdfs handling #1430

Comments

ninjapapa commented Oct 25, 2018

AliTajeldin commented Oct 25, 2018

ninjapapa commented Oct 26, 2018

AliTajeldin commented Oct 26, 2018

ninjapapa commented Oct 26, 2018