Skip to content

Commit

Permalink
Refreshing website content from main repo.
Browse files Browse the repository at this point in the history
  • Loading branch information
GitHub Action Website Snapshot committed Nov 18, 2024
1 parent ef35b7e commit 13239f9
Showing 1 changed file with 71 additions and 1 deletion.
72 changes: 71 additions & 1 deletion docs/integrations/spark/configuration/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ title: Usage
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

Configuring the OpenLineage Spark integration is straightforward. It uses built-in Spark configuration mechanisms.
Configuring the OpenLineage Spark integration is straightforward. It uses built-in Spark configuration mechanisms. However, for **Databricks users**, special considerations are required to ensure compatibility and avoid breaking the Spark UI after a cluster shutdown.

Your options are:

Expand All @@ -27,6 +27,10 @@ The setting `config("spark.extraListeners", "io.openlineage.spark.agent.OpenLine
the integration ineffective.
:::

:::note
Databricks For Databricks users, you must include `com.databricks.backend.daemon.driver.DBCEventLoggingListener` in addition to `io.openlineage.spark.agent.OpenLineageSparkListener` in the `spark.extraListeners` setting. Failure to do so will make the Spark UI inaccessible after a cluster shutdown.
:::

<Tabs groupId="spark-app-conf">
<TabItem value="scala" label="Scala">

Expand All @@ -50,6 +54,27 @@ object OpenLineageExample extends App {

spark.stop()
}

// For Databricks
import org.apache.spark.sql.SparkSession

object OpenLineageExample extends App {
val spark = SparkSession.builder()
.appName("OpenLineageExample")
// This line is EXTREMELY important
.config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener,com.databricks.backend.daemon.driver.DBCEventLoggingListener")
.config("spark.openlineage.transport.type", "http")
.config("spark.openlineage.transport.url", "http://localhost:5000")
.config("spark.openlineage.namespace", "spark_namespace")
.config("spark.openlineage.parentJobNamespace", "airflow_namespace")
.config("spark.openlineage.parentJobName", "airflow_dag.airflow_task")
.config("spark.openlineage.parentRunId", "xxxx-xxxx-xxxx-xxxx")
.getOrCreate()

// ... your code

spark.stop()
}
```

</TabItem>
Expand All @@ -71,6 +96,24 @@ spark = SparkSession.builder

# ... your code

spark.stop()

# For Databricks
from pyspark.sql import SparkSession

spark = SparkSession.builder
.appName("OpenLineageExample")
.config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener,com.databricks.backend.daemon.driver.DBCEventLoggingListener")
.config("spark.openlineage.transport.type", "http")
.config("spark.openlineage.transport.url", "http://localhost:5000")
.config("spark.openlineage.namespace", "spark_namespace")
.config("spark.openlineage.parentJobNamespace", "airflow_namespace")
.config("spark.openlineage.parentJobName", "airflow_dag.airflow_task")
.config("spark.openlineage.parentRunId", "xxxx-xxxx-xxxx-xxxx")
.getOrCreate()

# ... your code

spark.stop()
```

Expand All @@ -81,6 +124,10 @@ spark.stop()

The below example demonstrates how to use the `--conf` option with `spark-submit`.

:::note
Databricks Remember to include `com.databricks.backend.daemon.driver.DBCEventLoggingListener` along with the OpenLineage listener.
:::

```bash
spark-submit \
--conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \
Expand All @@ -91,6 +138,17 @@ spark-submit \
--conf "spark.openlineage.parentJobName=airflow_dag.airflow_task" \
--conf "spark.openlineage.parentRunId=xxxx-xxxx-xxxx-xxxx" \
# ... other options

# For Databricks
spark-submit \
--conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener,com.databricks.backend.daemon.driver.DBCEventLoggingListener" \
--conf "spark.openlineage.transport.type=http" \
--conf "spark.openlineage.transport.url=http://localhost:5000" \
--conf "spark.openlineage.namespace=spark_namespace" \
--conf "spark.openlineage.parentJobNamespace=airflow_namespace" \
--conf "spark.openlineage.parentJobName=airflow_dag.airflow_task" \
--conf "spark.openlineage.parentRunId=xxxx-xxxx-xxxx-xxxx" \
# ... other options
```

#### Adding properties to the `spark-defaults.conf` file in the `${SPARK_HOME}/conf` directory
Expand All @@ -104,13 +162,25 @@ installation, particularly in a shared environment.

The below example demonstrates how to add properties to the `spark-defaults.conf` file.

:::note
Databricks For Databricks users, include `com.databricks.backend.daemon.driver.DBCEventLoggingListener` in the `spark.extraListeners` property.
:::

```properties
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.transport.type=http
spark.openlineage.transport.url=http://localhost:5000
spark.openlineage.namespace=MyNamespace
```

For Databricks,
```properties
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener,com.databricks.backend.daemon.driver.DBCEventLoggingListener
spark.openlineage.transport.type=http
spark.openlineage.transport.url=http://localhost:5000
spark.openlineage.namespace=MyNamespace
```

:::info
The `spark.extraListeners` configuration parameter is **non-additive**. This means that if you set
`spark.extraListeners` via the CLI or via `SparkSession#config`, it will **replace** the value
Expand Down

0 comments on commit 13239f9

Please sign in to comment.