"This Project has been archived by the owner, who is no longer providing support. The project remains available to authorized users on a "read only" basis."
This is a POC (Proof of Concept) of Near Data Processing (NDP), on Spark using pushdowns. This includes a demonstration of pushdown on Spark of filter, project and aggregate, for both HDFS, and S3.
NDP is a technique where processing of data is pushed closer to the source of the data in an attempt to leverage locality and limit the a) data transfer and b) processing cycles needed across the set of data.
Our approach will consider use cases on Spark, and specifically focus on the use cases where Spark is processing data using SQL (Structured Query Language). This is typical of data stored in a tabular format, and now very popular in Big Data applications and analytics. Today, these Spark cases use SQL queries, which operate on tables (see background for more detail). These cases require reading very large tables from storage in order to perform relational queries on the data.
Our approach will consist of pushing down portions of the SQL query to the storage itself so that the storage can perform the operations of filter, project, and aggregate and thereby limit the data transfer of these tables back to the compute node. This limiting of data transfer also has a secondary benefit of limiting the amount of data needing to be processed on the compute node after the fetch of data is complete.
This repo has everything needed to demonstrate this capability on a single machine using dockers. Everything from the build to the servers, to running the test are all utilizing dockers.
For more information, see the Design Specification
Spark - The version of spark we build in this repo is a patch which contains
the support for Aggregate pushdown in Spark. apache/spark#29695
pushdown-datasource - This is a new Spark V2 datasource, which has support for
pushdown of filter, project and aggregate.
This data source can operate against either HDFS or S3.
dikeCS - A new S3 server, built on the POCO C++ Libraries,
but supporting S3-select API, which supports SQL Query pushdown, of filter, project and aggregate.
This server utilizes SQLite for the query engine.
dikeHDFS - This contains our NDP server for HDFS.
This server utilizes a proxy in front of HDFS.
This also contains use of SQLite as the query engine.
TPC-H - This is a fork of https://github.com/ssavvides/tpch-spark,
which is: TPC-H queries implemented in Spark using the DataFrames API.
We modified this:
* to support our data source for S3, HDFS
* to support a variety of data formats (csv, tbl)
* to support a variety of testing modes such as selecting the
pushdown used (with or without pushdown, project, filter, aggregate),
selecting the number of partitions, or spark workers, etc.
git git clone https://github.com/futurewei-cloud/caerus-dike.git
cd caerus-dike
git submodule init
git submodule update --recursive --progress
# Alternatively you can update specific submodules only
# git submodule update dikeCS pushdown-datasource
docker network create dike-net
./build.sh
In case you want to delete all the artifacts.
./clean.sh
First, bring up all the server code (hdfs, s3, spark)
./start.sh
Next, run the demo script, and follow the prompts.
./demo.sh
Finally, bring down the servers.
./stop.sh