Please refer to the Dataproc Templates (Java - Spark) README for more information.
- BigQueryToGCS
- BigQueryToJDBC (blogpost link)
- CassandraToBigQuery
- CassandraToGCS (blogpost link)
- DataplexGCStoBQ (blogpost link)
- GCSToBigQuery (blogpost link)
- GCSToBigTable
- GCSToGCS (blogpost link)
- GCSToJDBC (blogpost link)
- GCSToSpanner (blogpost link)
- GCSToMongo (blogpost [link] (https://medium.com/google-cloud/importing-data-from-gcs-to-mongodb-using-java-dataproc-serverless-6ff5c8d6f6d5))
- GeneralTemplate
- HBaseToGCS (blogpost link)
- HiveToBigQuery (blogpost link)
- HiveToGCS (blogpost link)
- JDBCToBigQuery (blogpost link)
- JDBCToGCS (blogpost link)
- JDBCToJDBC
- JDBCToSpanner
- KafkaToBQ (blogpost link)
- KafkaToBQDstream
- KafkaToGCS (blogpost link)
- KafkaToGCSDstream
- KafkaToPubSub
- MongoToBQ
- MongoToGCS (blogpost link)
- PubSubToBigQuery (blogpost link)
- PubSubToBigTable (blogpost link)
- PubSubLiteToBigTable (blogpost link) Deprecated and will be removed in Q1 2025
- PubSubToGCS (blogpost link)
- RedshiftToGCS Deprecated and will be removed in Q1 2025
- S3ToBigQuery (blogpost link)
- SnowflakeToGCS (blogpost link)
- SpannerToGCS (blogpost link)
- TextToBigquery Deprecated and will be removed in Q1 2025
- WordCount
...
- Java 8
- Maven 3
The Dataproc Templates (Java - Spark) support both serverless and cluster modes. By default, serverless mode is used. To run on Dataproc clusters, follow these steps:
Submits job to Dataproc Serverless using the batches submit spark command.
Submits job to a Dataproc Standard cluster using the jobs submit spark command.
To run the templates on an existing cluster, you must specify the JOB_TYPE=CLUSTER
and CLUSTER=<full clusterId>
environment variables. For example:
export JOB_TYPE=CLUSTER
export CLUSTER=${DATAPROC_CLUSTER_NAME}
Note: Certain templates may require a newer version of the Dataproc image. Before running a template, make sure your cluster's dataproc image version includes the supported dependencies version listed in the pom.xml.
Some HBase templates that require a custom image to execute are not yet supported in CLUSTER mode.
-
Format Code [Optional]
From either the root directory or v2/ directory, run:
mvn spotless:apply
This will format the code and add a license header. To verify that the code is formatted correctly, run:
mvn spotless:check
The directory to run the commands from is based on whether the changes are under v2/ or not.
-
Building the Project
Build the entire project using the Maven compile command.
mvn clean install
-
Executing a Template File
Once the template is staged on Google Cloud Storage, it can then be executed using the gcloud CLI tool.
To stage and execute the template, you can use the
start.sh
script. This takes-
Environment variables on where and how to deploy the templates
-
Additional options for
gcloud dataproc jobs submit spark
orgcloud beta dataproc batches submit spark
-
Template options, such as the critical
--template
option which says which template to run and--templateProperty
options for passing in properties at runtime (as an alternative to setting them insrc/main/resources/template.properties
). -
Other common template property:
log.level
, which is an optional parameter to define the log level of Spark Context and it defaults to INFO. Possible choices are the Spark log levels: ["ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE", "WARN"]--templateProperty log.level=ERROR
-
Usage syntax:
start.sh [submit-spark-options] -- --template templateName [--templateProperty key=value] [extra-template-options]
For example:
# Set required environment variables. export PROJECT=my-gcp-project export REGION=gcp-region export GCS_STAGING_LOCATION=gs://my-bucket/temp # Set optional environment variables. export SUBNET=projects/<gcp-project>/regions/<region>/subnetworks/test-subnet1 # ID of Dataproc cluster running permanent history server to access historic logs. export HISTORY_SERVER_CLUSTER=projects/<gcp-project>/regions/<region>/clusters/<cluster> # The submit spark options must be separated with a "--" from the template options bin/start.sh \ --properties=<spark.something.key>=<value> \ --version=... \ -- \ --template <TEMPLATE TYPE> --templateProperty <key>=<value>
-
Detailed instructions at README.md
bin/start.sh \ --properties=spark.hadoop.hive.metastore.uris=thrift://hostname/ip:9083 -- --template HIVETOGCS
-
Detailed instructions at README.md
bin/start.sh \ --properties=spark.hadoop.hive.metastore.uris=thrift://hostname/ip:9083 \ -- --template HIVETOBIGQUERY
-
Detailed instructions at README.md
bin/start.sh -- --template SPANNERTOGCS
-
bin/start.sh -- --template PUBSUBTOBQ
-
bin/start.sh -- --template PUBSUBTOGCS
-
bin/start.sh -- --template GCSTOBIGQUERY
-
bin/start.sh -- --template BIGQUERYTOGCS
-
Detailed instructions at README.md
bin/start.sh --files="gs://bucket/path/config.yaml" \ -- --template GENERAL --config config.yaml
With, for example,
config.yaml
:input: shakespeare: format: bigquery options: table: "bigquery-public-data:samples.shakespeare" query: wordcount: sql: "SELECT word, sum(word_count) cnt FROM shakespeare GROUP by word ORDER BY cnt DESC" output: wordcount: format: csv options: header: true path: gs://bucket/output/wordcount/ mode: Overwrite
-