Add DatabricksSparkSubmitJob

external help file: azure.databricks.cicd.tools-help.xml Module Name: azure.databricks.cicd.tools online version: schema: 2.0.0

Add-DatabricksSparkSubmitJob

SYNOPSIS

Creates Spark-Submit Job in Databricks. Script uses Databricks API 2.0 create job query: https://docs.azuredatabricks.net/api/latest/jobs.html#create

SYNTAX

Add-DatabricksSparkSubmitJob [[-BearerToken] <String>] [[-Region] <String>] [-JobName] <String>
 [-SparkVersion] <String> [-NodeType] <String> [[-DriverNodeType] <String>] [-MinNumberOfWorkers] <Int32>
 [-MaxNumberOfWorkers] <Int32> [[-Timeout] <Int32>] [[-MaxRetries] <Int32>]
 [[-ScheduleCronExpression] <String>] [[-Timezone] <String>] [[-SparkSubmitParameters] <String[]>]
 [[-PythonVersion] <String>] [[-Spark_conf] <Hashtable>] [[-CustomTags] <Hashtable>]
 [[-InitScripts] <String[]>] [[-SparkEnvVars] <Hashtable>] [[-ClusterLogPath] <String>]
 [[-InstancePoolId] <String>] [<CommonParameters>]

DESCRIPTION

Creates Spark-Submit Job in Databricks. Script uses Databricks API 2.0 create job query: https://docs.azuredatabricks.net/api/latest/jobs.html#create If the job name exists it will be updated instead of creating a new job. Spark-Submit does not support including libraries on the cluster. Instead, use --jars in the SparkSubmitParameters. Spark-Submit does not support using existing clusters.

EXAMPLES

EXAMPLE 1

Add-DatabricksSparkSubmitJob -BearerToken $BearerToken -Region $Region -JobName "Job1" -SparkVersion "5.3.x-scala2.11" -NodeType "Standard_D3_v2" -MinNumberOfWorkers 2 -MaxNumberOfWorkers 2 -Timeout 100 -MaxRetries 3 -ScheduleCronExpression "0 15 22 ? * *" -Timezone "UTC" -SparkSubmitParameters "--pyFiles", "dbfs:/myscript.py", "myparam" -Libraries '{"pypi":{package:"simplejson"}}', '{"jar": "DBFS:/mylibraries/test.jar"}'

The above example create a job on a new cluster.

PARAMETERS

-BearerToken

Your Databricks Bearer token to authenticate to your workspace (see User Settings in Datatbricks WebUI)

Type: String
Parameter Sets: (All)
Aliases:

Required: False
Position: 1
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-Region

Azure Region - must match the URL of your Databricks workspace, example: northeurope

Type: String
Parameter Sets: (All)
Aliases:

Required: False
Position: 2
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-JobName

Name of the job that will appear in the Job list. If a job with this name exists it will be updated.

Type: String
Parameter Sets: (All)
Aliases:

Required: True
Position: 3
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-SparkVersion

Spark version for cluster that will run the job. Example: 5.3.x-scala2.11

Type: String
Parameter Sets: (All)
Aliases:

Required: True
Position: 4
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-NodeType

Type of worker for cluster that will run the job. Example: Standard_D3_v2.

Type: String
Parameter Sets: (All)
Aliases:

Required: True
Position: 5
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-DriverNodeType

Type of driver for cluster that will run the job. Example: Standard_D3_v2. If not provided the NodeType will be used.

Type: String
Parameter Sets: (All)
Aliases:

Required: False
Position: 6
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-MinNumberOfWorkers

Number of workers for cluster that will run the job. Note: If Min & Max Workers are the same autoscale is disabled.

Type: Int32
Parameter Sets: (All)
Aliases:

Required: True
Position: 7
Default value: 0
Accept pipeline input: False
Accept wildcard characters: False

-MaxNumberOfWorkers

Number of workers for cluster that will run the job. Note: If Min & Max Workers are the same autoscale is disabled.

Type: Int32
Parameter Sets: (All)
Aliases:

Required: True
Position: 8
Default value: 0
Accept pipeline input: False
Accept wildcard characters: False

-Timeout

Timeout, in seconds, applied to each run of the job. If not set, there will be no timeout.

Type: Int32
Parameter Sets: (All)
Aliases:

Required: False
Position: 9
Default value: 0
Accept pipeline input: False
Accept wildcard characters: False

-MaxRetries

An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a FAILED result_state or INTERNAL_ERROR life_cycle_state. The value -1 means to retry indefinitely and the value 0 means to never retry. If not set, the default behavior will be never retry.

Type: Int32
Parameter Sets: (All)
Aliases:

Required: False
Position: 10
Default value: 0
Accept pipeline input: False
Accept wildcard characters: False

-ScheduleCronExpression

By default, job will run when triggered using Jobs UI or sending API request to run. You can provide cron schedule expression for job's periodic run. How to compose cron schedule expression: http://www.quartz-scheduler.org/documentation/quartz-2.1.x/tutorials/tutorial-lesson-06.html

Type: String
Parameter Sets: (All)
Aliases:

Required: False
Position: 11
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-Timezone

Timezone for Cron Schedule Expression. Required if ScheduleCronExpression provided. See here for all possible timezones: http://joda-time.sourceforge.net/timezones.html Example: UTC

Type: String
Parameter Sets: (All)
Aliases:

Required: False
Position: 12
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-SparkSubmitParameters

Array for parameters for job, for example "--pyFiles", "dbfs:/myscript.py", "myparam"

Type: String[]
Parameter Sets: (All)
Aliases:

Required: False
Position: 13
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-PythonVersion

2 or 3 - defaults to 2.

Type: String
Parameter Sets: (All)
Aliases:

Required: False
Position: 14
Default value: 3
Accept pipeline input: False
Accept wildcard characters: False

-Spark_conf

Hashtable. Example @{"spark.speculation"=$true; "spark.streaming.ui.retainedBatches"= 5}

Type: Hashtable
Parameter Sets: (All)
Aliases:

Required: False
Position: 15
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-CustomTags

Custom Tags to set, provide hash table of tags. Example: @{CreatedBy="SimonDM";NumOfNodes=2;CanDelete=$true}

Type: Hashtable
Parameter Sets: (All)
Aliases:

Required: False
Position: 16
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-InitScripts

Init scripts to run post creation. Example: "dbfs:/script/script1", "dbfs:/script/script2"

Type: String[]
Parameter Sets: (All)
Aliases:

Required: False
Position: 17
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-SparkEnvVars

An object containing a set of optional, user-specified environment variable key-value pairs. Key-value pairs of the form (X,Y) are exported as is (i.e., export X='Y') while launching the driver and workers. Example: '@{SPARK_WORKER_MEMORY="29000m";SPARK_LOCAL_DIRS="/local_disk0"}

Type: Hashtable
Parameter Sets: (All)
Aliases:

Required: False
Position: 18
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-ClusterLogPath

Type: String
Parameter Sets: (All)
Aliases:

Required: False
Position: 19
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

-InstancePoolId

Type: String
Parameter Sets: (All)
Aliases:

Required: False
Position: 20
Default value: None
Accept pipeline input: False
Accept wildcard characters: False

CommonParameters

This cmdlet supports the common parameters: -Debug, -ErrorAction, -ErrorVariable, -InformationAction, -InformationVariable, -OutVariable, -OutBuffer, -PipelineVariable, -Verbose, -WarningAction, and -WarningVariable. For more information, see about_CommonParameters.

INPUTS

OUTPUTS

NOTES

Author: Simon D'Morias / Data Thirst Ltd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly