Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySpark Arrow Stream Serializer #3

Open
wants to merge 3 commits into
base: pandas-udf-integration
Choose a base branch
from

Conversation

BryanCutler
Copy link

Enable UDF evaluation with Arrow using stream format to load as Pandas Series, modified PythonRDD to support this and maintain backwards compatibility.

@BryanCutler BryanCutler force-pushed the pandas-udf-integration branch from f92d865 to e54cd16 Compare May 22, 2017 21:52
@BryanCutler BryanCutler force-pushed the pandas-udf-integration branch from e54cd16 to 0f294c2 Compare May 22, 2017 22:05
@BryanCutler
Copy link
Author

@icexelloss , here is what I had so far. Feel free to use what you like and let me know if you have any questions.

…s Series, modified PythonRDD to support this and maintain backwards compatibility
@BryanCutler BryanCutler force-pushed the pandas-udf-integration branch from 0f294c2 to 45db636 Compare May 22, 2017 22:11
@icexelloss
Copy link
Owner

Thanks Bryan! This is quite a bit a change. I will take a look this week.

@BryanCutler
Copy link
Author

Sure, no problem!

@icexelloss
Copy link
Owner

Bryan,

You mentioned a ~2.5x speed up comparing this and the original udf methods. How did you run the experiments? I am trying to reproduce your results.

@BryanCutler
Copy link
Author

I was basically using the code below, just manually turning on/off Arrow by commenting a couple lines (I left a note in the code as to what needs to be commented out)

nrows = 1 << 24
df = spark.range(0, nrows, 1, 4).toDF("a").cache()
is_odd = udf(lambda n: n % 2, LongType())
odd = df.withColumn("is_odd", is_odd(col("a")))
flt_odd = odd.filter("is_odd == 1")
t = timeit.repeat(lambda: flt_odd.count(), repeat=10, number=1)
time_df = pd.Series(t)
print(time_df.describe())

icexelloss pushed a commit that referenced this pull request Aug 15, 2019
* null values properly returned

* create joined row object
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants