Dataproc Scala Examples is an effort to assist in the creation of Spark jobs written in Scala to run on Dataproc. Google is providing different pre-implemented Spark jobs and technical guides to run them on GCP.
This guide is based on the WordCount ETL example with common sources and sinks (Kafka, GCS, BigQuery, etc).
It is intended to catalyze your development to run Spark jobs written in Scala on Dataproc.
It is demonstrated how to run Spark jobs using Dataproc Submit, Serverless, Workflow and how to orchestrate them with Cloud Composer.
If you are looking to use Dataproc Templates, please refer to this repository.
Check out the quickstart documentation for quickstarts.
Scala = 2.12.14
Spark = 3.1.2
sbt = 1.6.1
Python = 3.8.12
Airflow = 2.2.3
Composer = composer-2.0.6-airflow-2.2.3
Dataproc = 2.0-debian10
Note: if using Dataproc Serverless (detailed in the guides as one of the options to run jobs), please recompile the jobs using Spark version 3.2.0
- Be aware that the data format used in this guide for data in GCS is Parquet.
- This guide is configured to run the main class, despite Dataproc having the option to specify another class to run.
Follow the setup instructions for installing, testing and compiling the project.
- Create Mock Dataset
- Creates input and output mock WordCount datasets in GCS and BQ to use in other examples
- Streaming - Kafka to GCS
- Runs a Spark Structured Streaming WordCount example from Kafka to GCS
- Batch - GCS to GCS
- Runs a Spark WordCount example from GCS to GCS
- Appendix: Load from GCS to BQ
- Appendix: Create BQ External table pointing to GCS data
- Runs a Spark WordCount example from GCS to GCS
- Batch - GCS to BQ
- Runs a Spark WordCount example from GCS to BQ
This part of the guide provides example DAGs to run on Cloud Composer to orquestrate the jobs from section above.
A) Batch - Dataproc Submit - Creating and Deleting Cluster
B) Batch - Dataproc Workflow
C) Batch - Dataproc Serverless
D) Load from GCS to BQ
- Spark to Dataproc
- BigQuery Write API
- BigQuery External Tables
- Dataproc Serverless
- Dataproc Workflows
- Spark BigQuery Connector
- Data Lake on GCS Architecture
GPC = Google Cloud Plataform
GCS = Google Cloud Storage
BQ = BigQuery
DAG = Direct Acyclic Graph
See the contributing instructions to get started contributing.
All solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.
This repository and its contents are not an official Google Product.