Replies: 5 comments
-
I think they are three (possibly four) different things:
|
Beta Was this translation helpful? Give feedback.
-
Very interesting questions to figure out. To add to this, it would be useful to understand how Spark's "repartition" and "partitionBy" work, and whether these could help in any way. It seems to be that repartition(x) reallocates the original partitions to x partitions in memory (round-robin?) when an action is called on a dataframe. "partitionBy" on the other hand seems to be used to partition the data on disk on a given (or multiple) column(s). This is after reading some documentation however, not through any actual use of them. |
Beta Was this translation helpful? Give feedback.
-
Yep - a big distinction is between actions on data in memory and actions on data on disc.
|
Beta Was this translation helpful? Give feedback.
-
A useful article on indexing Parquet files |
Beta Was this translation helpful? Give feedback.
-
Running Spark from Zeppelin on the current Yarn-Hadoop deployment stores Spark temporary data on the Zeppelin node.
|
Beta Was this translation helpful? Give feedback.
-
It would be useful to describe how we think Spark, Parquet and HDDD work together to partition the data in a catalog.
If we write down what we think happens, it might help in figuring out what is actually happening under the hood.
Beta Was this translation helpful? Give feedback.
All reactions