Data partitioning #273

Zarquan · 2020-12-09T14:13:33Z

Zarquan
Dec 9, 2020
Maintainer

It would be useful to describe how we think Spark, Parquet and HDDD work together to partition the data in a catalog.
If we write down what we think happens, it might help in figuring out what is actually happening under the hood.

Zarquan · 2020-12-09T14:22:26Z

Zarquan
Dec 9, 2020
Maintainer Author

I think they are three (possibly four) different things:

How Spark partitions data in a RDD in memory, distributed across the worker nodes.
How data is stored in a collection of Parquet files on disc.
- Does Spark influence the way that the data is stored in the Parquet files.
- How can we influence the way that the data is stored in the Parquet files.
How the Hadoop Distributed File System (HDFS) distributes a set of files across storage nodes (which are not necessarily the same machines as the Spark worker nodes).

How a Yarn scheduler manages the loading of data from storage to collect related data on the same Spark worker node.
- Depending on how the data is indexed ?
- Depending on how the data is stored ?

How a Kubernetes scheduler manages the loading of data from storage to collect related data on the same Spark worker node.
- Depending on how the data is indexed ?
- Depending on how the data is stored ?

How AXS adds to that process with spatial indexes
- Does this depend on Yarn and HDFS ?

0 replies

stvoutsin · 2020-12-09T19:33:48Z

stvoutsin
Dec 9, 2020
Maintainer

Very interesting questions to figure out.
I agree that there is indeed a distinction between the Parquet paritions in HDFS and how those are partitioned to each executor in Spark, and I'm not sure how the partitioning in HDFS affects performance when reading into a Dataframe in Spark and running some analysis on the data.

To add to this, it would be useful to understand how Spark's "repartition" and "partitionBy" work, and whether these could help in any way. It seems to be that repartition(x) reallocates the original partitions to x partitions in memory (round-robin?) when an action is called on a dataframe. "partitionBy" on the other hand seems to be used to partition the data on disk on a given (or multiple) column(s). This is after reading some documentation however, not through any actual use of them.

0 replies

Zarquan · 2020-12-09T20:05:58Z

Zarquan
Dec 9, 2020
Maintainer Author

Yep - a big distinction is between actions on data in memory and actions on data on disc.

If we do a repartition() or partitionBy() does this mean we end up with storing a new copy on disc ?
If so, do we need to generate and store multiple copies of the data ?
If not, then how can we arrange the data on disc to minimise the in-memory shuffling needed to run a repartition() or partitionBy()?

0 replies

Zarquan · 2020-12-15T17:40:08Z

Zarquan
Dec 15, 2020
Maintainer Author

A useful article on indexing Parquet files
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/

0 replies

Zarquan · 2021-02-19T17:45:59Z

Zarquan
Feb 19, 2021
Maintainer Author

Running Spark from Zeppelin on the current Yarn-Hadoop deployment stores Spark temporary data on the Zeppelin node.

I think this is because the Spark interpreter in Zeppelin is acting as the Spark master
This means /var/spark/temp on the Zeppelin node needs to be mounted on a local disc
The Hadoop master node doesn't use /var/spark/temp
The Hadoop worker nodes don't use /var/spark/temp

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data partitioning #273

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Data partitioning #273

Zarquan Dec 9, 2020 Maintainer

Replies: 5 comments

Zarquan Dec 9, 2020 Maintainer Author

stvoutsin Dec 9, 2020 Maintainer

Zarquan Dec 9, 2020 Maintainer Author

Zarquan Dec 15, 2020 Maintainer Author

Zarquan Feb 19, 2021 Maintainer Author

Zarquan
Dec 9, 2020
Maintainer

Zarquan
Dec 9, 2020
Maintainer Author

stvoutsin
Dec 9, 2020
Maintainer

Zarquan
Dec 9, 2020
Maintainer Author

Zarquan
Dec 15, 2020
Maintainer Author

Zarquan
Feb 19, 2021
Maintainer Author