diff --git a/versioned_docs/version-1.26.0/before-ol.svg b/versioned_docs/version-1.26.0/before-ol.svg new file mode 100644 index 0000000..a36cbbc --- /dev/null +++ b/versioned_docs/version-1.26.0/before-ol.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/client/_category_.json b/versioned_docs/version-1.26.0/client/_category_.json new file mode 100644 index 0000000..2aa263f --- /dev/null +++ b/versioned_docs/version-1.26.0/client/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Client Libraries", + "position": 4 +} diff --git a/versioned_docs/version-1.26.0/client/java/_category_.json b/versioned_docs/version-1.26.0/client/java/_category_.json new file mode 100644 index 0000000..8360ca5 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Java", + "position": 1 +} diff --git a/versioned_docs/version-1.26.0/client/java/configuration.md b/versioned_docs/version-1.26.0/client/java/configuration.md new file mode 100644 index 0000000..5911391 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/configuration.md @@ -0,0 +1,118 @@ +--- +sidebar_position: 2 +title: Configuration +--- + +We recommend configuring the client with an `openlineage.yml` file that contains all the +details of how to connect to your OpenLineage backend. + +See [example configurations.](#transports) + +You can make this file available to the client in three ways (the list also presents precedence of the configuration): + +1. Set an `OPENLINEAGE_CONFIG` environment variable to a file path: `OPENLINEAGE_CONFIG=path/to/openlineage.yml`. +2. Place an `openlineage.yml` in the user's current working directory. +3. Place an `openlineage.yml` under `.openlineage/` in the user's home directory (`~/.openlineage/openlineage.yml`). + +## Environment Variables + +The following environment variables are available: + +| Name | Description | Since | +|----------------------|-----------------------------------------------------------------------------|-------| +| OPENLINEAGE_CONFIG | The path to the YAML configuration file. Example: `path/to/openlineage.yml` | | +| OPENLINEAGE_DISABLED | When `true`, OpenLineage will not emit events. | 0.9.0 | + +You can also configure the client with dynamic environment variables. + +import DynamicEnvVars from './partials/java_dynamic_env_vars.md'; + + + +## Facets Configuration + +In YAML configuration file you can also disable facets to filter them out from the OpenLineage event. + +*YAML Configuration* + +```yaml +transport: + type: console +facets: + spark_unknown: + disabled: true + spark: + logicalPlan: + disabled: true +``` + +### Deprecated syntax + +The following syntax is deprecated and soon will be removed: + +```yaml +transport: + type: console +facets: + disabled: + - spark_unknown + - spark.logicalPlan +``` + +The rationale behind deprecation is that some of the facets were disabled by default in some integrations. When we added +something extra but didn't include the defaults, they were unintentionally enabled. + +## Transports + +import Transports from './partials/java_transport.md'; + + + +### Error Handling via Transport + +```java +// Connect to http://localhost:5000 +OpenLineageClient client = OpenLineageClient.builder() + .transport( + HttpTransport.builder() + .uri("http://localhost:5000") + .apiKey("f38d2189-c603-4b46-bdea-e573a3b5a7d5") + .build()) + .registerErrorHandler(new EmitErrorHandler() { + @Override + public void handleError(Throwable throwable) { + // Handle emit error here + } + }).build(); +``` + +### Defining Your Own Transport + +```java +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new MyTransport() { + @Override + public void emit(OpenLineage.RunEvent runEvent) { + // Add emit logic here + } + }).build(); +``` + +## Circuit Breakers + +import CircuitBreakers from './partials/java_circuit_breaker.md'; + + + +## Metrics + +import Metrics from './partials/java_metrics.md'; + + + +## Dataset Namespace Resolver + +import DatasetNamespaceResolver from './partials/java_namespace_resolver.md'; + + diff --git a/versioned_docs/version-1.26.0/client/java/java.md b/versioned_docs/version-1.26.0/client/java/java.md new file mode 100644 index 0000000..cff5f99 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/java.md @@ -0,0 +1,39 @@ +--- +sidebar_position: 5 +--- + +# Java + +## Overview + +The OpenLineage Java is a SDK for Java programming language that users can use to generate and emit OpenLineage events to OpenLineage backends. +The core data structures currently offered by the client are the `RunEvent`, `RunState`, `Run`, `Job`, `Dataset`, +and `Transport` classes, along with various `Facets` that can come under run, job, and dataset. + +There are various [transport classes](#transports) that the library provides that carry the lineage events into various target endpoints (e.g. HTTP). + +You can also use the Java client to create your own custom integrations. + +## Installation + +Java client is provided as library that can either be imported into your Java project using Maven or Gradle. + +Maven: + +```xml + + io.openlineage + openlineage-java + {{PREPROCESSOR:OPENLINEAGE_VERSION}} + +``` + +or Gradle: + +```groovy +implementation("io.openlineage:openlineage-java:{{PREPROCESSOR:OPENLINEAGE_VERSION}}") +``` + +For more information on the available versions of the `openlineage-java`, +please refer to the [maven repository](https://search.maven.org/artifact/io.openlineage/openlineage-java). + diff --git a/versioned_docs/version-1.26.0/client/java/mqz_job_complete.png b/versioned_docs/version-1.26.0/client/java/mqz_job_complete.png new file mode 100644 index 0000000..f412c5e Binary files /dev/null and b/versioned_docs/version-1.26.0/client/java/mqz_job_complete.png differ diff --git a/versioned_docs/version-1.26.0/client/java/mqz_job_running.png b/versioned_docs/version-1.26.0/client/java/mqz_job_running.png new file mode 100644 index 0000000..ff2ad08 Binary files /dev/null and b/versioned_docs/version-1.26.0/client/java/mqz_job_running.png differ diff --git a/versioned_docs/version-1.26.0/client/java/partials/java_circuit_breaker.md b/versioned_docs/version-1.26.0/client/java/partials/java_circuit_breaker.md new file mode 100644 index 0000000..059afc9 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/partials/java_circuit_breaker.md @@ -0,0 +1,107 @@ +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +:::info +This feature is available in OpenLineage versions >= 1.9.0. +::: + +To prevent from over-instrumentation OpenLineage integration provides a circuit breaker mechanism +that stops OpenLineage from creating, serializing and sending OpenLineage events. + +### Simple Memory Circuit Breaker + +Simple circuit breaker which is working based only on free memory within JVM. Configuration should +contain free memory threshold limit (percentage). Default value is `20%`. The circuit breaker +will close within first call if free memory is low. `circuitCheckIntervalInMillis` parameter is used +to configure a frequency circuit breaker is called. Default value is `1000ms`, when no entry in config. +`timeoutInSeconds` is optional. If set, OpenLineage code execution is terminated when a timeout +is reached (added in version 1.13). + + + + +```yaml +circuitBreaker: + type: simpleMemory + memoryThreshold: 20 + circuitCheckIntervalInMillis: 1000 + timeoutInSeconds: 90 +``` + + + +| Parameter | Definition | Example | +--------------------------------------|----------------------------------------------------------------|-------------- +| spark.openlineage.circuitBreaker.type | Circuit breaker type selected | simpleMemory | +| spark.openlineage.circuitBreaker.memoryThreshold | Memory threshold | 20 | +| spark.openlineage.circuitBreaker.circuitCheckIntervalInMillis | Frequency of checking circuit breaker | 1000 | +| spark.openlineage.circuitBreaker.timeoutInSeconds | Optional timeout for OpenLineage execution (Since version 1.13)| 90 | + + + + +| Parameter | Definition | Example | +--------------------------------------|---------------------------------------------|------------- +| openlineage.circuitBreaker.type | Circuit breaker type selected | simpleMemory | +| openlineage.circuitBreaker.memoryThreshold | Memory threshold | 20 | +| openlineage.circuitBreaker.circuitCheckIntervalInMillis | Frequency of checking circuit breaker | 1000 | +| spark.openlineage.circuitBreaker.timeoutInSeconds | Optional timeout for OpenLineage execution (Since version 1.13) | 90 | + + + + +### Java Runtime Circuit Breaker + +More complex version of circuit breaker. The amount of free memory can be low as long as +amount of time spent on Garbage Collection is acceptable. `JavaRuntimeCircuitBreaker` closes +when free memory drops below threshold and amount of time spent on garbage collection exceeds +given threshold (`10%` by default). The circuit breaker is always open when checked for the first time +as GC threshold is computed since the previous circuit breaker call. +`circuitCheckIntervalInMillis` parameter is used +to configure a frequency circuit breaker is called. +Default value is `1000ms`, when no entry in config. +`timeoutInSeconds` is optional. If set, OpenLineage code execution is terminated when a timeout +is reached (added in version 1.13). + + + + +```yaml +circuitBreaker: + type: javaRuntime + memoryThreshold: 20 + gcCpuThreshold: 10 + circuitCheckIntervalInMillis: 1000 + timeoutInSeconds: 90 +``` + + + +| Parameter | Definition | Example | +--------------------------------------|---------------------------------------|------------- +| spark.openlineage.circuitBreaker.type | Circuit breaker type selected | javaRuntime | +| spark.openlineage.circuitBreaker.memoryThreshold | Memory threshold | 20 | +| spark.openlineage.circuitBreaker.gcCpuThreshold | Garbage Collection CPU threshold | 10 | +| spark.openlineage.circuitBreaker.circuitCheckIntervalInMillis | Frequency of checking circuit breaker | 1000 | +| spark.openlineage.circuitBreaker.timeoutInSeconds | Optional timeout for OpenLineage execution (Since version 1.13)| 90 | + + + + + +| Parameter | Definition | Example | +--------------------------------------|---------------------------------------|------------- +| openlineage.circuitBreaker.type | Circuit breaker type selected | javaRuntime | +| openlineage.circuitBreaker.memoryThreshold | Memory threshold | 20 | +| openlineage.circuitBreaker.gcCpuThreshold | Garbage Collection CPU threshold | 10 | +| openlineage.circuitBreaker.circuitCheckIntervalInMillis | Frequency of checking circuit breaker | 1000 | +| spark.openlineage.circuitBreaker.timeoutInSeconds | Optional timeout for OpenLineage execution (Since version 1.13) | 90 | + + + + + +### Custom Circuit Breaker + +List of available circuit breakers can be extended with custom one loaded via ServiceLoader +with own implementation of `io.openlineage.client.circuitBreaker.CircuitBreakerBuilder`. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/client/java/partials/java_dynamic_env_vars.md b/versioned_docs/version-1.26.0/client/java/partials/java_dynamic_env_vars.md new file mode 100644 index 0000000..7fd93b5 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/partials/java_dynamic_env_vars.md @@ -0,0 +1,163 @@ +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + + +The OpenLineage client supports configuration through dynamic environment variables. + +### Configuring OpenLineage Client via Dynamic Environment Variables + +These environment variables must begin with `OPENLINEAGE__`, followed by sections of the configuration separated by a double underscore `__`. +All values in the environment variables are automatically converted to lowercase, +and variable names using snake_case (single underscore) are converted into camelCase within the final configuration. + +#### Key Features + +1. Prefix Requirement: All environment variables must begin with `OPENLINEAGE__`. +2. Sections Separation: Configuration sections are separated using double underscores `__` to form the hierarchy. +3. Lowercase Conversion: Environment variable values are automatically converted to lowercase. +4. CamelCase Conversion: Any environment variable name using single underscore `_` will be converted to camelCase in the final configuration. +5. JSON String Support: You can pass a JSON string at any level of the configuration hierarchy, which will be merged into the final configuration structure. +6. Hyphen Restriction: You cannot use `-` in environment variable names. If a name strictly requires a hyphen, use a JSON string as the value of the environment variable. +7. Precedence Rules: +* Top-level keys have precedence and will not be overwritten by more nested entries. +* For example, `OPENLINEAGE__TRANSPORT='{..}'` will not have its keys overwritten by `OPENLINEAGE__TRANSPORT__AUTH__KEY='key'`. + +#### Examples + + + + +Setting following environment variables: + +```sh +OPENLINEAGE__TRANSPORT__TYPE=http +OPENLINEAGE__TRANSPORT__URL=http://localhost:5050 +OPENLINEAGE__TRANSPORT__ENDPOINT=/api/v1/lineage +OPENLINEAGE__TRANSPORT__AUTH__TYPE=api_key +OPENLINEAGE__TRANSPORT__AUTH__API_KEY=random_token +OPENLINEAGE__TRANSPORT__COMPRESSION=gzip +``` + +is equivalent to passing following YAML configuration: +```yaml +transport: + type: http + url: http://localhost:5050 + endpoint: api/v1/lineage + auth: + type: api_key + apiKey: random_token + compression: gzip +``` + + + + +Setting following environment variables: + +```sh +OPENLINEAGE__TRANSPORT__TYPE=composite +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__TYPE=http +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__URL=http://localhost:5050 +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__ENDPOINT=/api/v1/lineage +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__AUTH__TYPE=api_key +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__AUTH__API_KEY=random_token +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__AUTH__COMPRESSION=gzip +OPENLINEAGE__TRANSPORT__TRANSPORTS__SECOND__TYPE=console +``` + +is equivalent to passing following YAML configuration: +```yaml +transport: + type: composite + transports: + first: + type: http + url: http://localhost:5050 + endpoint: api/v1/lineage + auth: + type: api_key + apiKey: random_token + compression: gzip + second: + type: console +``` + + + + +Setting following environment variables: + +```sh +OPENLINEAGE__TRANSPORT='{"type":"console"}' +OPENLINEAGE__TRANSPORT__TYPE=http +``` + +is equivalent to passing following YAML configuration: +```yaml +transport: + type: console +``` + + + + +Setting following environment variables: + +```sh +OPENLINEAGE__TRANSPORT__TYPE=kafka +OPENLINEAGE__TRANSPORT__TOPIC_NAME=test +OPENLINEAGE__TRANSPORT__MESSAGE_KEY=explicit-key +OPENLINEAGE__TRANSPORT__PROPERTIES='{"key.serializer": "org.apache.kafka.common.serialization.StringSerializer"}' +``` + +is equivalent to passing following YAML configuration: +```yaml +transport: + type: kafka + topicName: test + messageKey: explicit-key + properties: + key.serializer: org.apache.kafka.common.serialization.StringSerializer +``` + +Please note that you can't use environment variables to set Spark properties, as they are not part of the configuration hierarchy. +Following environment variable: +```sh +OPENLINEAGE__TRANSPORT__PROPERTIES__KEY__SERIALIZER="org.apache.kafka.common.serialization.StringSerializer" +``` +would be equivalent to below YAML structure: +```yaml +transport: + properties: + key: + serializer: org.apache.kafka.common.serialization.StringSerializer +``` +which is not a valid configuration for Spark. + + + + + +Setting following environment variables: + +```sh +OPENLINEAGE__DATASET__NAMESPACE_RESOLVERS__RESOLVED_NAME__TYPE=hostList +OPENLINEAGE__DATASET__NAMESPACE_RESOLVERS__RESOLVED_NAME__HOSTS='["kafka-prod13.company.com", "kafka-prod15.company.com"]' +OPENLINEAGE__DATASET__NAMESPACE_RESOLVERS__RESOLVED_NAME__SCHEMA=kafka +``` + +is equivalent to passing following YAML configuration: +```yaml +dataset: + namespaceResolvers: + resolvedName: + type: hostList + hosts: + - kafka-prod13.company.com + - kafka-prod15.company.com + schema: kafka +``` + + + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/client/java/partials/java_metrics.md b/versioned_docs/version-1.26.0/client/java/partials/java_metrics.md new file mode 100644 index 0000000..1b0f369 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/partials/java_metrics.md @@ -0,0 +1,64 @@ +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +:::info +This feature is available in OpenLineage 1.11 and above +::: + +To ease the operational experience of using the OpenLineage integrations, this document details the metrics collected by the Java client and the configuration settings for various metric backends. + +### Metrics collected by Java Client + +The following table outlines the metrics collected by the OpenLineage Java client, which help in monitoring the integration's performance: + +| Metric | Definition | Type | +|-------------------------------------|-------------------------------------------------------|--------| +| `openlineage.emit.start` | Number of events the integration started to send | Counter| +| `openlineage.emit.complete` | Number of events the integration completed sending | Counter| +| `openlineage.emit.time` | Time spent on emitting events | Timer | +| `openlineage.circuitbreaker.engaged`| Status of the Circuit Breaker (engaged or not) | Gauge | + +## Metric Backends + +OpenLineage uses [Micrometer](https://micrometer.io) for metrics collection, similar to how SLF4J operates for logging. Micrometer provides a facade over different metric backends, allowing metrics to be dispatched to various destinations. + +### Configuring Metric Backends + +Below are the available backends and potential configurations using Micrometer's facilities. + +### StatsD + +Full configuration options for StatsD can be found in the [Micrometer's StatsDConfig implementation](https://github.com/micrometer-metrics/micrometer/blob/main/implementations/micrometer-registry-statsd/src/main/java/io/micrometer/statsd/StatsdConfig.java). + + + + +```yaml +metrics: + type: statsd + flavor: datadog + host: localhost + port: 8125 +``` + + + +| Parameter | Definition | Example | +--------------------------------------|---------------------------------------|------------- +| spark.openlineage.metrics.type | Metrics type selected | statsd | +| spark.openlineage.metrics.flavor | Flavor of StatsD configuration | datadog | +| spark.openlineage.metrics.host | Host that receives StatsD metrics | localhost | +| spark.openlineage.metrics.port | Port that receives StatsD metrics | 8125 | + + + + +| Parameter | Definition | Example | +--------------------------------------|---------------------------------------|------------- +| openlineage.metrics.type | Metrics type selected | statsd | +| openlineage.metrics.flavor | Flavor of StatsD configuration | datadog | +| openlineage.metrics.host | Host that receives StatsD metrics | localhost | +| openlineage.metrics.port | Port that receives StatsD metrics | 8125 | + + + diff --git a/versioned_docs/version-1.26.0/client/java/partials/java_namespace_resolver.md b/versioned_docs/version-1.26.0/client/java/partials/java_namespace_resolver.md new file mode 100644 index 0000000..3597177 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/partials/java_namespace_resolver.md @@ -0,0 +1,141 @@ +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +:::info +This feature is available in OpenLineage 1.17 and above +::: + +Oftentimes host addresses are used to access data and a single dataset can be accessed via different +addresses. For example, a Kafka topic can be accessed by a list of kafka bootstrap servers or any +server from the list. In general, a problem can be solved by adding mechanism which resolves host addresses into +logical identifier understood within the organisation. This applies for all clusters like Kafka or Cassandra +which should be identified regardless of current list of hosts they contain. This also applies +for JDBC urls where a physical address of database can change over time. + +### Host List Resolver + +Host List Resolver given a list of hosts, replaces host name within +the dataset namespace into the resolved value defined. + + + + +```yaml +dataset: + namespaceResolvers: + resolved-name: + type: hostList + hosts: ['kafka-prod13.company.com', 'kafka-prod15.company.com'] + schema: "kafka" +``` + + + +| Parameter | Definition | Example | +------------------- ----------------------------------------------------|---------------|-- +| spark.openlineage.dataset.namespaceResolvers.resolved-name.type | Resolver type | hostList | +| spark.openlineage.dataset.namespaceResolvers.resolved-name.hosts | List of hosts | `['kafka-prod13.company.com', 'kafka-prod15.company.com']` | +| spark.openlineage.dataset.namespaceResolvers.resolved-name.schema | Optional schema to be specified. Resolver will be only applied if schema matches the configure one. | `kafka` | + + + + +| Parameter | Definition | Example | +------------------- -------------------------------------------|---------------|-- +| openlineage.dataset.namespaceResolvers.resolved-name.type | Resolver type | hostList | +| openlineage.dataset.namespaceResolvers.resolved-name.hosts | List of hosts | `['kafka-prod13.company.com', 'kafka-prod15.company.com']` | +| openlineage.dataset.namespaceResolvers.resolved-name.schema | Optional schema to be specified. Resolver will be only applied if schema matches the configure one. | `kafka` | + + + + + +### Pattern Namespace Resolver + +Java regex pattern is used to identify a host. Substrings matching a pattern will be replaced with resolved name. + + + + +```yaml +dataset: + namespaceResolvers: + resolved-name: + type: pattern + # 'cassandra-prod7.company.com', 'cassandra-prod8.company.com' + regex: 'cassandra-prod(\d)+\.company\.com' + schema: "cassandra" +``` + + + +| Parameter | Definition | Example | +------------------- -------------------------------------------------|---------------|-------- +| spark.openlineage.dataset.namespaceResolvers.resolved-name.type | Resolver type | pattern | +| spark.openlineage.dataset.namespaceResolvers.resolved-name.hosts | Regex pattern to find and replace | `cassandra-prod(\d)+\.company\.com` | +| spark.openlineage.dataset.namespaceResolvers.resolved-name.schema | Optional schema to be specified. Resolver will be only applied if schema matches the configure one. | `kafka` | + + + + +| Parameter | Definition | Example | +------------------- -------------------------------------------|---------------|-- +| openlineage.dataset.namespaceResolvers.resolved-name.type | Resolver type | pattern | +| openlineage.dataset.namespaceResolvers.resolved-name.hosts | Regex pattern to find and replace | `cassandra-prod(\d)+\.company\.com` | +| openlineage.dataset.namespaceResolvers.resolved-name.schema | Optional schema to be specified. Resolver will be only applied if schema matches the configure one. | `kafka` | + + + + +### Pattern Group Namespace Resolver + +For this resolver, Java regex pattern is used to identify a host. However, instead of configured resolved name, +a `matchingGroup` is used a resolved name. This can be useful when having several clusters +made from hosts with a well-defined host naming convention. + + + + +```yaml +dataset: + namespaceResolvers: + test-pattern: + type: patternGroup + # 'cassandra-test-7.company.com', 'cassandra-test-8.company.com', 'kafka-test-7.company.com', 'kafka-test-8.company.com' + regex: '(?[a-zA-Z-]+)-(\d)+\.company\.com:[\d]*' + matchingGroup: "cluster" + schema: "cassandra" +``` + + + +| Parameter | Definition | Example | +------------------- ----------------------------------------------------|---------------|-- +| spark.openlineage.dataset.namespaceResolvers.pattern-group-resolver.type | Resolver type | patternGroup | +| spark.openlineage.dataset.namespaceResolvers.pattern-group-resolver.regex | Regex pattern to find and replace | `(?[a-zA-Z-]+)-(\d)+\.company\.com:[\d]*` | +| spark.openlineage.dataset.namespaceResolvers.pattern-group-resolver.matchingGroup | Matching group named within the regex | `cluster` | +| spark.openlineage.dataset.namespaceResolvers.pattern-group-resolver.schema | Optional schema to be specified. Resolver will be only applied if schema matches the configure one. | `kafka` | + + + + +| Parameter | Definition | Example | +------------------- ----------------------------------------------------|---------------|-- +| openlineage.dataset.namespaceResolvers.pattern-group-resolver.type | Resolver type | patternGroup | +| openlineage.dataset.namespaceResolvers.pattern-group-resolver.regex | Regex pattern to find and replace | `(?[a-zA-Z-]+)-(\d)+\.company\.com` | +| openlineage.dataset.namespaceResolvers.pattern-group-resolver.matchingGroup | Matching group named within the regex | `cluster` | +| openlineage.dataset.namespaceResolvers.pattern-group-resolver.schema | Optional schema to be specified. Resolver will be only applied if schema matches the configure one. | `kafka` | + + + + +### Custom Resolver + +Custom resolver can be added by implementing: + * `io.openlineage.client.dataset.namespaceResolver.DatasetNamespaceResolver` + * `io.openlineage.client.dataset.namespaceResolver.DatasetNamespaceResolverBuilder` + * `io.openlineage.client.dataset.namespaceResolver.DatasetNamespaceResolverConfig` + +Config class can be used to pass any namespace resolver parameters through standard configuration +mechanism (Spark & Flink configuration or `openlineage.yml` file provided). Standard `ServiceLoader` +approach is used to load and initiate custom classes. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/client/java/partials/java_transport.md b/versioned_docs/version-1.26.0/client/java/partials/java_transport.md new file mode 100644 index 0000000..a79c3d5 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/partials/java_transport.md @@ -0,0 +1,899 @@ +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +**Tip:** See current list of [all supported transports](https://github.com/OpenLineage/OpenLineage/tree/main/client/java/src/main/java/io/openlineage/client/transports). + +### [HTTP](https://github.com/OpenLineage/OpenLineage/tree/main/client/java/src/main/java/io/openlineage/client/transports/HttpTransport.java) + +Allows sending events to HTTP endpoint, using [ApacheHTTPClient](https://hc.apache.org/index.html). + +#### Configuration + +- `type` - string, must be `"http"`. Required. +- `url` - string, base url for HTTP requests. Required. +- `endpoint` - string specifying the endpoint to which events are sent, appended to `url`. Optional, default: `/api/v1/lineage`. +- `urlParams` - dictionary specifying query parameters send in HTTP requests. Optional. +- `timeoutInMillis` - integer specifying timeout (in milliseconds) value used while connecting to server. Optional, default: `5000`. +- `auth` - dictionary specifying authentication options. Optional, by default no authorization is used. If set, requires the `type` property. + - `type` - string specifying the "api_key" or the fully qualified class name of your TokenProvider. Required if `auth` is provided. + - `apiKey` - string setting the Authentication HTTP header as the Bearer. Required if `type` is `api_key`. +- `headers` - dictionary specifying HTTP request headers. Optional. +- `compression` - string, name of algorithm used by HTTP client to compress request body. Optional, default value `null`, allowed values: `gzip`. Added in v1.13.0. + +#### Behavior + +Events are serialized to JSON, and then are send as HTTP POST request with `Content-Type: application/json`. + +#### Examples + + + + +Anonymous connection: + +```yaml +transport: + type: http + url: http://localhost:5000 +``` + +With authorization: + +```yaml +transport: + type: http + url: http://localhost:5000 + auth: + type: api_key + api_key: f38d2189-c603-4b46-bdea-e573a3b5a7d5 +``` + +Full example: + +```yaml +transport: + type: http + url: http://localhost:5000 + endpoint: /api/v1/lineage + urlParams: + param0: value0 + param1: value1 + timeoutInMillis: 5000 + auth: + type: api_key + api_key: f38d2189-c603-4b46-bdea-e573a3b5a7d5 + headers: + X-Some-Extra-Header: abc + compression: gzip +``` + + + + +Anonymous connection: + +```ini +spark.openlineage.transport.type=http +spark.openlineage.transport.url=http://localhost:5000 +``` + +With authorization: + +```ini +spark.openlineage.transport.type=http +spark.openlineage.transport.url=http://localhost:5000 +spark.openlineage.transport.auth.type=api_key +spark.openlineage.transport.auth.apiKey=f38d2189-c603-4b46-bdea-e573a3b5a7d5 +``` + +Full example: + +```ini +spark.openlineage.transport.type=http +spark.openlineage.transport.url=http://localhost:5000 +spark.openlineage.transport.endpoint=/api/v1/lineage +spark.openlineage.transport.urlParams.param0=value0 +spark.openlineage.transport.urlParams.param1=value1 +spark.openlineage.transport.timeoutInMillis=5000 +spark.openlineage.transport.auth.type=api_key +spark.openlineage.transport.auth.apiKey=f38d2189-c603-4b46-bdea-e573a3b5a7d5 +spark.openlineage.transport.headers.X-Some-Extra-Header=abc +spark.openlineage.transport.compression=gzip +``` + +
+URL parsing within Spark integration +

+ +You can supply http parameters using values in url, the parsed `spark.openlineage.*` properties are located in url as follows: + +`{transport.url}/{transport.endpoint}/namespaces/{namespace}/jobs/{parentJobName}/runs/{parentRunId}?app_name={appName}&api_key={transport.apiKey}&timeout={transport.timeout}&xxx={transport.urlParams.xxx}` + +example: + +`http://localhost:5000/api/v1/namespaces/ns_name/jobs/job_name/runs/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx?app_name=app&api_key=abc&timeout=5000&xxx=xxx` + +

+
+ +
+ + +Anonymous connection: + +```ini +spark.openlineage.transport.type=http +spark.openlineage.transport.url=http://localhost:5000 +``` + +With authorization: + +```ini +openlineage.transport.type=http +openlineage.transport.url=http://localhost:5000 +openlineage.transport.auth.type=api_key +openlineage.transport.auth.apiKey=f38d2189-c603-4b46-bdea-e573a3b5a7d5 +``` + +Full example: + +```ini +openlineage.transport.type=http +openlineage.transport.url=http://localhost:5000 +openlineage.transport.endpoint=/api/v1/lineage +openlineage.transport.urlParams.param0=value0 +openlineage.transport.urlParams.param1=value1 +openlineage.transport.timeoutInMillis=5000 +openlineage.transport.auth.type=api_key +openlineage.transport.auth.apiKey=f38d2189-c603-4b46-bdea-e573a3b5a7d5 +openlineage.transport.headers.X-Some-Extra-Header=abc +openlineage.transport.compression=gzip +``` + + + + +Anonymous connection: + +```java +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.HttpConfig; +import io.openlineage.client.transports.HttpTransport; + +HttpConfig httpConfig = new HttpConfig(); +httpConfig.setUrl("http://localhost:5000"); + +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new HttpTransport(httpConfig)) + .build(); +``` + +With authorization: + +```java +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.ApiKeyTokenProvider; +import io.openlineage.client.transports.HttpConfig; +import io.openlineage.client.transports.HttpTransport; + +ApiKeyTokenProvider apiKeyTokenProvider = new ApiKeyTokenProvider(); +apiKeyTokenProvider.setApiKey("f38d2189-c603-4b46-bdea-e573a3b5a7d5"); + +HttpConfig httpConfig = new HttpConfig(); +httpConfig.setUrl("http://localhost:5000"); +httpConfig.setAuth(apiKeyTokenProvider); + +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new HttpTransport(httpConfig)) + .build(); +``` + +Full example: + +```java +import java.util.Map; + +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.ApiKeyTokenProvider; +import io.openlineage.client.transports.HttpConfig; +import io.openlineage.client.transports.HttpTransport; + +Map queryParams = Map.of( + "param0", "value0", + "param1", "value1" +); + +Map headers = Map.of( + "X-Some-Extra-Header", "abc" +); + +ApiKeyTokenProvider apiKeyTokenProvider = new ApiKeyTokenProvider(); +apiKeyTokenProvider.setApiKey("f38d2189-c603-4b46-bdea-e573a3b5a7d5"); + +HttpConfig httpConfig = new HttpConfig(); +httpConfig.setUrl("http://localhost:5000"); +httpConfig.setEndpoint("/api/v1/lineage"); +httpConfig.setUrlParams(queryParams); +httpConfig.setAuth(apiKeyTokenProvider); +httpConfig.setTimeoutInMillis(headers); +httpConfig.setHeaders(5000); +httpConfig.setCompression(HttpConfig.Compression.GZIP); + +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new HttpTransport(httpConfig)) + .build(); +``` + + +
+ +### [Kafka](https://github.com/OpenLineage/OpenLineage/tree/main/client/java/src/main/java/io/openlineage/client/transports/KafkaTransport.java) +If a transport type is set to `kafka`, then the below parameters would be read and used when building KafkaProducer. +This transport requires the artifact `org.apache.kafka:kafka-clients:3.1.0` (or compatible) on your classpath. + +#### Configuration + +- `type` - string, must be `"kafka"`. Required. +- `topicName` - string specifying the topic on what events will be sent. Required. +- `properties` - a dictionary containing a Kafka producer config as in [Kafka producer config](http://kafka.apache.org/0100/documentation.html#producerconfigs). Required. +- `localServerId` - **deprecated**, renamed to `messageKey` since v1.13.0. +- `messageKey` - string, key for all Kafka messages produced by transport. Optional, default value described below. Added in v1.13.0. + + Default values for `messageKey` are: + - `run:{parentJob.namespace}/{parentJob.name}` - for RunEvent with parent facet + - `run:{job.namespace}/{job.name}` - for RunEvent + - `job:{job.namespace}/{job.name}` - for JobEvent + - `dataset:{dataset.namespace}/{dataset.name}` - for DatasetEvent + +#### Behavior + +Events are serialized to JSON, and then dispatched to the Kafka topic. + +#### Notes + +It is recommended to provide `messageKey` if Job hierarchy is used. It can be any string, but it should be the same for all jobs in +hierarchy, like `Airflow task -> Spark application -> Spark task runs`. + +#### Examples + + + + +```yaml +transport: + type: kafka + topicName: openlineage.events + properties: + bootstrap.servers: localhost:9092,another.host:9092 + acks: all + retries: 3 + key.serializer: org.apache.kafka.common.serialization.StringSerializer + value.serializer: org.apache.kafka.common.serialization.StringSerializer + messageKey: some-value +``` + + + + +```ini +spark.openlineage.transport.type=kafka +spark.openlineage.transport.topicName=openlineage.events +spark.openlineage.transport.properties.bootstrap.servers=localhost:9092,another.host:9092 +spark.openlineage.transport.properties.acks=all +spark.openlineage.transport.properties.retries=3 +spark.openlineage.transport.properties.key.serializer=org.apache.kafka.common.serialization.StringSerializer +spark.openlineage.transport.properties.value.serializer=org.apache.kafka.common.serialization.StringSerializer +spark.openlineage.transport.messageKey=some-value +``` + + + + +```ini +openlineage.transport.type=kafka +openlineage.transport.topicName=openlineage.events +openlineage.transport.properties.bootstrap.servers=localhost:9092,another.host:9092 +openlineage.transport.properties.acks=all +openlineage.transport.properties.retries=3 +openlineage.transport.properties.key.serializer=org.apache.kafka.common.serialization.StringSerializer +openlineage.transport.properties.value.serializer=org.apache.kafka.common.serialization.StringSerializer +openlineage.transport.messageKey=some-value +``` + + + + +```java +import java.util.Properties; + +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.KafkaConfig; +import io.openlineage.client.transports.KafkaTransport; + +Properties kafkaProperties = new Properties(); +kafkaProperties.setProperty("bootstrap.servers", "localhost:9092,another.host:9092"); +kafkaProperties.setProperty("acks", "all"); +kafkaProperties.setProperty("retries", "3"); +kafkaProperties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); +kafkaProperties.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); + +KafkaConfig kafkaConfig = new KafkaConfig(); +KafkaConfig.setTopicName("openlineage.events"); +KafkaConfig.setProperties(kafkaProperties); +KafkaConfig.setLocalServerId("some-value"); + +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new KafkaTransport(httpConfig)) + .build(); +``` + + + + +*Notes*: +It is recommended to provide `messageKey` if Job hierarchy is used. It can be any string, but it should be the same for all jobs in +hierarchy, like `Airflow task -> Spark application`. + +Default values are: +- `run:{parentJob.namespace}/{parentJob.name}/{parentRun.id}` - for RunEvent with parent facet +- `run:{job.namespace}/{job.name}/{run.id}` - for RunEvent +- `job:{job.namespace}/{job.name}` - for JobEvent +- `dataset:{dataset.namespace}/{dataset.name}` - for DatasetEvent + +### [Console](https://github.com/OpenLineage/OpenLineage/tree/main/client/java/src/main/java/io/openlineage/client/transports/ConsoleTransport.java) + +This straightforward transport emits OpenLineage events directly to the console through a logger. +No additional configuration is required. + +#### Behavior + +Events are serialized to JSON. Then each event is logged with `INFO` level to logger with name `ConsoleTransport`. + +#### Notes + +Be cautious when using the `DEBUG` log level, as it might result in double-logging due to the `OpenLineageClient` also logging. + +#### Configuration + +- `type` - string, must be `"console"`. Required. + +#### Examples + + + + +```yaml +transport: + type: console +``` + + + + +```ini +spark.openlineage.transport.type=console +``` + + + + +```ini +openlineage.transport.type=console +``` + + + + +```java +import java.util.Properties; + +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.ConsoleTransport; + +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new ConsoleTransport()) + .build(); +``` + + + + +### [File](https://github.com/OpenLineage/OpenLineage/tree/main/client/java/src/main/java/io/openlineage/client/transports/FileTransport.java) + +Designed mainly for integration testing, the `FileTransport` emits OpenLineage events to a given file. + +#### Configuration + +- `type` - string, must be `"file"`. Required. +- `location` - string specifying the path of the file. Required. + +#### Behavior + +- If the target file is absent, it's created. +- Events are serialized to JSON, and then appended to a file, separated by newlines. +- Intrinsic newline characters within the event JSON are eliminated to ensure one-line events. + +#### Notes for Yarn/Kubernetes + +This transport type is pretty useless on Spark/Flink applications deployed to Yarn or Kubernetes cluster: +- Each executor will write file to a local filesystem of Yarn container/K8s pod. So resulting file will be removed when such container/pod is destroyed. +- Kubernetes persistent volumes are not destroyed after pod removal. But all the executors will write to the same network disk in parallel, producing a broken file. + +#### Examples + + + + +```yaml +transport: + type: file + location: /path/to/your/file +``` + + + + +```ini +spark.openlineage.transport.type=file +spark.openlineage.transport.location=/path/to/your/filext +``` + + + + +```ini +openlineage.transport.type=file +openlineage.transport.location=/path/to/your/file +``` + + + + +```java +import java.util.Properties; + +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.FileConfig; +import io.openlineage.client.transports.FileTransport; + +FileConfig fileConfig = new FileConfig("/path/to/your/file"); + +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new FileTransport(fileConfig)) + .build(); +``` + + + + +## [Composite](https://github.com/OpenLineage/OpenLineage/tree/main/client/java/src/main/java/io/openlineage/client/transports/CompositeTransport.java) + +The `CompositeTransport` is designed to combine multiple transports, allowing event emission to several destinations. This is useful when events need to be sent to multiple targets, such as a logging system and an API endpoint. The events are delivered sequentially - one after another in a defined order. + +#### Configuration + +- `type` - string, must be "composite". Required. +- `transports` - a list or a map of transport configurations. Required. +- `continueOnFailure` - boolean flag, determines if the process should continue even when one of the transports fails. Default is `true`. +- `withThreadPool` - boolean flag, determines if a thread pool for parallel event emission should be kept between event emissions. Default is `true`. + +#### Behavior + +- The configured transports will be initialized and used in sequence (sorted by transport name) to emit OpenLineage events. +- If `continueOnFailure` is set to `false`, a failure in one transport will stop the event emission process, and an exception will be raised. +- If `continueOnFailure` is `true`, the failure will be logged, but the remaining transports will still attempt to send the event. + +#### Notes for Multiple Transports +The composite transport can be used with any OpenLineage transport (e.g. `HttpTransport`, `KafkaTransport`, etc). +Ideal for scenarios where OpenLineage events need to reach multiple destinations for redundancy or different types of processing. + +The `transports` configuration can be provided in two formats: + +1. A list of transport configurations, where each transport may optionally include a `name` field. +2. A map of transport configurations, where the key acts as the name for each transport. +The map format is particularly useful for configurations set via environment variables or Java properties, providing a more convenient and flexible setup. + +##### Why are transport names used? +Transport names are not required for basic functionality. Their primary purpose is to enable configuration of composite transports via environment variables, which is only supported when names are defined. + +#### Examples + + + + +```yaml +transport: + type: composite + continueOnFailure: true + transports: + - type: http + url: http://example.com/api + name: my_http + - type: kafka + topicName: openlineage.events + properties: + bootstrap.servers: localhost:9092,another.host:9092 + acks: all + retries: 3 + key.serializer: org.apache.kafka.common.serialization.StringSerializer + value.serializer: org.apache.kafka.common.serialization.StringSerializer + messageKey: some-value + continueOnFailure: true +``` + + + + +```yaml +transport: + type: composite + continueOnFailure: true + transports: + my_http: + type: http + url: http://example.com/api + name: my_http + my_kafka: + type: kafka + topicName: openlineage.events + properties: + bootstrap.servers: localhost:9092,another.host:9092 + acks: all + retries: 3 + key.serializer: org.apache.kafka.common.serialization.StringSerializer + value.serializer: org.apache.kafka.common.serialization.StringSerializer + messageKey: some-value + continueOnFailure: true +``` + + + + +```ini +spark.openlineage.transport.type=composite +spark.openlineage.transport.continueOnFailure=true +spark.openlineage.transport.transports.my_http.type=http +spark.openlineage.transport.transports.my_http.url=http://example.com/api +spark.openlineage.transport.transports.my_kafka.type=kafka +spark.openlineage.transport.transports.my_kafka.topicName=openlineage.events +spark.openlineage.transport.transports.my_kafka.properties.bootstrap.servers=localhost:9092,another.host:9092 +spark.openlineage.transport.transports.my_kafka.properties.acks=all +spark.openlineage.transport.transports.my_kafka.properties.retries=3 +spark.openlineage.transport.transports.my_kafka.properties.key.serializer=org.apache.kafka.common.serialization.StringSerializer +spark.openlineage.transport.transports.my_kafka.properties.value.serializer=org.apache.kafka.common.serialization.StringSerializer +``` + + + + +```ini +openlineage.transport.type=composite +openlineage.transport.continueOnFailure=true +openlineage.transport.transports.my_http.type=http +openlineage.transport.transports.my_http.url=http://example.com/api +openlineage.transport.transports.my_kafka.type=kafka +openlineage.transport.transports.my_kafka.topicName=openlineage.events +openlineage.transport.transports.my_kafka.properties.bootstrap.servers=localhost:9092,another.host:9092 +openlineage.transport.transports.my_kafka.properties.acks=all +openlineage.transport.transports.my_kafka.properties.retries=3 +openlineage.transport.transports.my_kafka.properties.key.serializer=org.apache.kafka.common.serialization.StringSerializer +openlineage.transport.transports.my_kafka.properties.value.serializer=org.apache.kafka.common.serialization.StringSerializer +``` + + + + +```java +import java.util.Arrays; +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.CompositeConfig; +import io.openlineage.client.transports.HttpConfig; +import io.openlineage.client.transports.HttpTransport; +import io.openlineage.client.transports.KafkaConfig; +import io.openlineage.client.transports.KafkaTransport; + +HttpConfig httpConfig = new HttpConfig(); +httpConfig.setUrl("http://example.com/api"); +KafkaConfig kafkaConfig = new KafkaConfig(); +KafkaConfig.setTopicName("openlineage.events"); +KafkaConfig.setLocalServerId("some-value"); + +CompositeConfig compositeConfig = new CompositeConfig(Arrays.asList( + new HttpTransport(httpConfig), + new KafkaTransport(kafkaConfig) +), true); + +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new CompositeTransport(compositeConfig)) + .build(); +``` + + + + +### [Transform](https://github.com/OpenLineage/OpenLineage/tree/main/client/java/src/main/java/io/openlineage/client/transports/transform/TransformTransport.java) + +The `TransformTransport` is designed to enable event manipulation before emitting the event. Together with `CompositeTransport`, it can be used to +send different events into multiple backends. + +#### Configuration + +- `type` - string, must be "transform". Required. +- `transformerClass` - class name of the event transformer. Class has to implement `io.openlineage.client.transports.transform.EventTransformer` interface and provide public no-arg constructor. Class needs to be available on the classpath. Required. +- `transformerProperties` - Extra properties that can be passed into `transformerClass` based on the configuration. Optional. +- `transport` - Transport configuration to emit modified events. Required. + +#### Behavior + +- The configured `transformerClass` will be used to alter events before the emission. +- Modified events will be passed into the configured `transport` for further processing. + +#### `EventTransformer` interface + +```java +public class CustomEventTransformer implements EventTransformer { + @Override + public void initialize(Map properties) { ... } + + @Override + public RunEvent transform(RunEvent event) { ... } + + @Override + public DatasetEvent transform(DatasetEvent event) { .. } + + @Override + public JobEvent transform(JobEvent event) { ... } +} +``` + +#### Examples + + + + +```yaml +transport: + type: transform + transformerClass: io.openlineage.CustomEventTransformer + transformerProperties: + key1: value1 + key2: value2 + transport: + type: http + url: http://example.com/api + name: my_http +``` + + + + +```ini +spark.openlineage.transport.type=transform +spark.openlineage.transport.transformerClass=io.openlineage.CustomEventTransformer +spark.openlineage.transport.transformerProperties.key1=value1 +spark.openlineage.transport.transformerProperties.key2=value2 +spark.openlineage.transport.transport.type=http +spark.openlineage.transport.transport.url=http://example.com/api +``` + + + + +```ini +openlineage.transport.type=transform +openlineage.transport.transformerClass=io.openlineage.CustomEventTransformer +openlineage.transport.transformerProperties.key1=value1 +openlineage.transport.transformerProperties.key2=value2 +openlineage.transport.transport.type=http +openlineage.transport.transport.url=http://example.com/api +``` + + + + +```java +import java.util.Arrays; +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.TransformConfig; +import io.openlineage.client.transports.HttpConfig; +import io.openlineage.client.transports.HttpTransport; + +HttpConfig httpConfig = new HttpConfig(); +httpConfig.setUrl(URI.create("http://example.com/api")); + +TransformConfig transformConfig = new TransformConfig(); +transformConfig.setTransformerClass(CustomEventTransformer.class.getName()); +transformConfig.setTransport(httpConfig); + +OpenLineageClient client = OpenLineageClient + .builder() + .transport(new TransformTransport(transformConfig)) + .build(); +``` + + + + +### [GcpLineage](https://github.com/OpenLineage/OpenLineage/blob/main/client/transports-dataplex/src/main/java/io/openlineage/client/transports/gcplineage/GcpLineageTransport.java) + +To use this transport in your project, you need to include `io.openlineage:transports-gcplineage` artifact in +your build configuration. This is particularly important for environments like `Spark`, where this transport must be on +the classpath for lineage events to be emitted correctly. + +#### Configuration + +- `type` - string, must be `"gcplineage"`. Required. +- `endpoint` - string, specifies the endpoint to which events are sent, default value is + `datalineage.googleapis.com:443`. Optional. +- `projectId` - string, the project quota identifier. If not provided, it is determined based on user credentials. + Optional. +- `location` - string, [Dataplex location](https://cloud.google.com/dataplex/docs/locations). Optional, default: + `"us"`. +- `credentialsFile` - string, path + to + the [Service Account credentials JSON file](https://developers.google.com/workspace/guides/create-credentials#create_credentials_for_a_service_account). + Optional, if not + provided [Application Default Credentials](https://cloud.google.com/docs/authentication/application-default-credentials) + are used +- `mode` - enum that specifies the type of client used for publishing OpenLineage events to GCP Lineage service. Possible values: + `sync` (synchronous) or `async` (asynchronous). Optional, default: `async`. + +#### Behavior + +- Events are serialized to JSON, included as part of a `gRPC` request, and then dispatched to the `GCP Lineage service` endpoint. +- Depending on the `mode` chosen, requests are sent using either a synchronous or asynchronous client. + +#### Examples + + + + +```yaml +transport: + type: gcplineage + projectId: your_gcp_project_id + location: us + mode: sync + credentialsFile: path/to/credentials.json +``` + + + + +```ini +spark.openlineage.transport.type=gcplineage +spark.openlineage.transport.projectId=your_gcp_project_id +spark.openlineage.transport.location=us +spark.openlineage.transport.mode=sync +spark.openlineage.transport.credentialsFile=path/to/credentials.json +``` + + + + +```ini +openlineage.transport.type=gcplineage +openlineage.transport.projectId=your_gcp_project_id +openlineage.transport.location=us +openlineage.transport.mode=sync +openlineage.transport.credentialsFile=path/to/credentials.json +``` + + + + +```java +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.gcplineage.GcpLineageTransportConfig; +import io.openlineage.client.transports.dataplex.GcpLineageTransport; + + +GcpLineageTransportConfig gcplineageConfig = new GcpLineageTransportConfig(); + +gcplineageConfig.setProjectId("your_gcp_project_id"); +gcplineageConfig.setLocation("your_gcp_location"); +gcplineageConfig.setMode("sync"); +gcplineageConfig.setCredentialsFile("path/to/credentials.json"); + +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new GcpLineageTransport(gcplineageConfig)) + .build(); +``` + + + + +### [Google Cloud Storage](https://github.com/OpenLineage/OpenLineage/blob/main/client/java/transports-gcs/src/main/java/io/openlineage/client/transports/gcs/GcsTransport.java) + +To use this transport in your project, you need to include `io.openlineage:transports-gcs` artifact in +your build configuration. This is particularly important for environments like `Spark`, where this transport must be on +the classpath for lineage events to be emitted correctly. + +#### Configuration + +- `type` - string, must be `"gcs"`. Required. +- `projectId` - string, the project quota identifier. Required. +- `credentialsFile` - string, path + to the [Service Account credentials JSON file](https://developers.google.com/workspace/guides/create-credentials#create_credentials_for_a_service_account). + Optional, if not + provided [Application Default Credentials](https://cloud.google.com/docs/authentication/application-default-credentials) + are used +- `bucketName` - string, the GCS bucket name. Required +- `fileNamePrefix` - string, prefix for the event file names. Optional. + +#### Behavior + +- Events are serialized to JSON and stored in the specified GCS bucket. +- Each event file is named based on its `eventTime`, converted to epoch milliseconds, with an optional prefix if configured. +- Two constructors are available: one accepting both `Storage` and `GcsTransportConfig` and another solely accepting + `GcsTransportConfig`. + +#### Examples + + + + +```yaml +transport: + type: gcs + bucketName: my-gcs-bucket + fileNamePrefix: /file/name/prefix/ + credentialsFile: path/to/credentials.json +``` + + + + +```ini +spark.openlineage.transport.type=gcs +spark.openlineage.transport.bucketName=my-gcs-bucket +spark.openlineage.transport.credentialsFile=path/to/credentials.json +spark.openlineage.transport.credentialsFile=file/name/prefix/ +``` + + + + +```ini +openlineage.transport.type=gcs +openlineage.transport.bucketName=my-gcs-bucket +openlineage.transport.credentialsFile=path/to/credentials.json +openlineage.transport.credentialsFile=file/name/prefix/ +``` + + + + +```java +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.gcs.GcsTransportConfig; +import io.openlineage.client.transports.dataplex.GcsTransport; + + +DataplexConfig gcsConfig = new GcsTransportConfig(); + +gcsConfig.setBucketName("my-bucket-name"); +gcsConfig.setFileNamePrefix("/file/name/prefix/"); +gcsConfig.setCredentialsFile("path/to/credentials.json"); + +OpenLineageClient client = OpenLineageClient.builder() + .transport( + new GcsTransport(dataplexConfig)) + .build(); +``` + + + + + +import S3Transport from './s3_transport.md'; + + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/client/java/partials/s3_transport.md b/versioned_docs/version-1.26.0/client/java/partials/s3_transport.md new file mode 100644 index 0000000..99e6a34 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/partials/s3_transport.md @@ -0,0 +1,102 @@ +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + + +### [S3](https://github.com/OpenLineage/OpenLineage/blob/main/client/transports-s3/src/main/java/io/openlineage/client/transports/s3/S3Transport.java) + +To use this transport in your project, you need to include the following dependency in your build configuration. This is +particularly important for environments like `Spark`, where this transport must be on the classpath for lineage events +to be emitted correctly. + +#### Maven + +```xml + + + io.openlineage + transports-s3 + {{PREPROCESSOR:OPENLINEAGE_VERSION}} + +``` + +#### Configuration + +- `type` - string, must be `"s3"`. Required. +- `endpoint` - string, the endpoint for S3 compliant service like MinIO, Ceph, etc. Optional +- `bucketName` - string, the S3 bucket name. Required +- `fileNamePrefix` - string, prefix for the event file names. It is separated from the timestamp with underscore. It can + include path and file name prefix. Optional. + +##### Credentials + +To authenticate, the transport uses +the [default credentials provider chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials-chain.html). The possible authentication methods include: +- Java system properties +- Environment variables +- Shared credentials config file (by default `~/.aws/config`) +- EC2 instance credentials (convenient in EMR and Glue) +- and other + +Refer to the documentation for details. + +#### Behavior + +- Events are serialized to JSON and stored in the specified S3 bucket. +- Each event file is named based on its `eventTime`, converted to epoch milliseconds, with an optional prefix if + configured. + +#### Examples + + + + +```yaml +transport: + type: s3 + endpoint: https://my-minio.example.com + bucketName: events + fileNamePrefix: my/service/events/event +``` + + + + +```ini +spark.openlineage.transport.type=s3 +spark.openlineage.transport.endpoint=https://my-minio.example.com +spark.openlineage.transport.bucketName=events +spark.openlineage.transport.fileNamePrefix=my/service/events/event +``` + + + + +```ini +openlineage.transport.type=s3 +openlineage.transport.endpoint=https://my-minio.example.com +openlineage.transport.bucketName=events +openlineage.transport.fileNamePrefix=my/service/events/event +``` + + + + +```java +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.transports.s3.S3TransportConfig; +import io.openlineage.client.transports.s3.S3Transport; + + +S3TransportConfig s3Config = new S3TransportConfig(); + +s3Config.setEndpoint("https://my-minio.example.com"); +s3Config.setBucketName("events"); +s3Config.setFileNamePrefix("my/service/events/event"); + +OpenLineageClient client = OpenLineageClient.builder() + .transport(new S3Transport(s3Config)) + .build(); +``` + + + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/client/java/usage.md b/versioned_docs/version-1.26.0/client/java/usage.md new file mode 100644 index 0000000..48da91d --- /dev/null +++ b/versioned_docs/version-1.26.0/client/java/usage.md @@ -0,0 +1,368 @@ +--- +sidebar_position: 2 +title: Usage Example +--- + +```java +// Use openlineage.yml +OpenLineageClient client = Clients.newClient(); + +// Define a simple OpenLineage START or COMPLETE event +OpenLineage.RunEvent startOrCompleteRun = ... + +// Emit OpenLineage event +client.emit(startOrCompleteRun); +``` + +### 1. Simple OpenLineage Client Test for Console Transport +First, let's explore how we can create OpenLineage client instance, but not using any actual transport to emit the data yet, except only to our `Console.` This would be a good exercise to run tests and check the data payloads. + +```java + OpenLineageClient client = OpenLineageClient.builder() + .transport(new ConsoleTransport()).build(); +``` + +Also, we will then get a sample payload to produce a `RunEvent`: + +```java + // create one start event for testing + RunEvent event = buildEvent(EventType.START); +``` + +Lastly, we will emit this event using the client that we instantiated\: + +```java + // emit the event + client.emit(event); +``` + +Here is the full source code of the test client application: + +```java +package ol.test; + +import io.openlineage.client.OpenLineage; +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.OpenLineage.RunEvent; +import io.openlineage.client.OpenLineage.InputDataset; +import io.openlineage.client.OpenLineage.Job; +import io.openlineage.client.OpenLineage.JobFacets; +import io.openlineage.client.OpenLineage.OutputDataset; +import io.openlineage.client.OpenLineage.Run; +import io.openlineage.client.OpenLineage.RunFacets; +import io.openlineage.client.OpenLineage.RunEvent.EventType; +import io.openlineage.client.transports.ConsoleTransport; +import io.openlineage.client.utils.UUIDUtils; + +import java.net.URI; +import java.time.ZoneId; +import java.time.ZonedDateTime; +import java.util.Arrays; +import java.util.List; +import java.util.UUID; + +/** + * My first openlinage client code + */ +public class OpenLineageClientTest +{ + public static void main( String[] args ) + { + try { + OpenLineageClient client = OpenLineageClient.builder() + .transport(new ConsoleTransport()).build(); + + // create one start event for testing + RunEvent event = buildEvent(EventType.START); + + // emit the event + client.emit(event); + + } catch (Exception e) { + e.printStackTrace(); + } + + } + + // sample code to build event + public static RunEvent buildEvent(EventType eventType) { + ZonedDateTime now = ZonedDateTime.now(ZoneId.of("UTC")); + URI producer = URI.create("producer"); + OpenLineage ol = new OpenLineage(producer); + UUID runId = UUIDUtils.generateNewUUID(); + + // run facets + RunFacets runFacets = + ol.newRunFacetsBuilder() + .nominalTime( + ol.newNominalTimeRunFacetBuilder() + .nominalStartTime(now) + .nominalEndTime(now) + .build()) + .build(); + + // a run is composed of run id, and run facets + Run run = ol.newRunBuilder().runId(runId).facets(runFacets).build(); + + // job facets + JobFacets jobFacets = ol.newJobFacetsBuilder().build(); + + // job + String name = "jobName"; + String namespace = "namespace"; + Job job = ol.newJobBuilder().namespace(namespace).name(name).facets(jobFacets).build(); + + // input dataset + List inputs = + Arrays.asList( + ol.newInputDatasetBuilder() + .namespace("ins") + .name("input") + .facets( + ol.newDatasetFacetsBuilder() + .version(ol.newDatasetVersionDatasetFacet("input-version")) + .build()) + .inputFacets( + ol.newInputDatasetInputFacetsBuilder() + .dataQualityMetrics( + ol.newDataQualityMetricsInputDatasetFacetBuilder() + .rowCount(10L) + .bytes(20L) + .columnMetrics( + ol.newDataQualityMetricsInputDatasetFacetColumnMetricsBuilder() + .put( + "mycol", + ol.newDataQualityMetricsInputDatasetFacetColumnMetricsAdditionalBuilder() + .count(10D) + .distinctCount(10L) + .max(30D) + .min(5D) + .nullCount(1L) + .sum(3000D) + .quantiles( + ol.newDataQualityMetricsInputDatasetFacetColumnMetricsAdditionalQuantilesBuilder() + .put("25", 52D) + .build()) + .build()) + .build()) + .build()) + .build()) + .build()); + // output dataset + List outputs = + Arrays.asList( + ol.newOutputDatasetBuilder() + .namespace("ons") + .name("output") + .facets( + ol.newDatasetFacetsBuilder() + .version(ol.newDatasetVersionDatasetFacet("output-version")) + .build()) + .outputFacets( + ol.newOutputDatasetOutputFacetsBuilder() + .outputStatistics(ol.newOutputStatisticsOutputDatasetFacet(10L, 20L)) + .build()) + .build()); + + // run state update which encapsulates all - with START event in this case + RunEvent runStateUpdate = + ol.newRunEventBuilder() + .eventType(OpenLineage.RunEvent.EventType.START) + .eventTime(now) + .run(run) + .job(job) + .inputs(inputs) + .outputs(outputs) + .build(); + + return runStateUpdate; + } +} +``` + +The result of running this will result in the following output from your Java application: + +``` +[main] INFO io.openlineage.client.transports.ConsoleTransport - {"eventType":"START","eventTime":"2022-08-05T15:11:24.858414Z","run":{"runId":"bb46bbc4-fb1a-495a-ad3b-8d837f566749","facets":{"nominalTime":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/NominalTimeRunFacet.json#/$defs/NominalTimeRunFacet","nominalStartTime":"2022-08-05T15:11:24.858414Z","nominalEndTime":"2022-08-05T15:11:24.858414Z"}}},"job":{"namespace":"namespace","name":"jobName","facets":{}},"inputs":[{"namespace":"ins","name":"input","facets":{"version":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json#/$defs/DatasetVersionDatasetFacet","datasetVersion":"input-version"}},"inputFacets":{"dataQualityMetrics":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DataQualityMetricsInputDatasetFacet.json#/$defs/DataQualityMetricsInputDatasetFacet","rowCount":10,"bytes":20,"columnMetrics":{"mycol":{"nullCount":1,"distinctCount":10,"sum":3000.0,"count":10.0,"min":5.0,"max":30.0,"quantiles":{"25":52.0}}}}}}],"outputs":[{"namespace":"ons","name":"output","facets":{"version":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json#/$defs/DatasetVersionDatasetFacet","datasetVersion":"output-version"}},"outputFacets":{"outputStatistics":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":10,"size":20}}}],"producer":"producer","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} +``` + +### 2. Simple OpenLineage Client Test for Http Transport + +Now, using the same code base, we will change how the client application works by switching the Console transport into `Http Transport` as shown below. This code will now be able to send the OpenLineage events into a compatible backends such as [Marquez](https://marquezproject.ai/). + +Before making this change and running it, make sure you have an instance of Marquez running on your local environment. Setting up and running Marquez can be found [here](https://marquezproject.github.io/marquez/quickstart.html). + +```java +OpenLineageClient client = OpenLineageClient.builder() + .transport( + HttpTransport.builder() + .uri("http://localhost:5000") + .build()) + .build(); +``` +If we ran the same application, you will now see the event data not emitted in the output console, but rather via the HTTP transport to the marquez backend that was running. + +![the Marquez graph](mqz_job_running.png) + +Notice that the Status of this job run will be in `RUNNING` state, as it will be in that state until it receives an `end` event that will close off its gaps. That is how the OpenLineage events would work. + +Now, let's change the previous example to have lineage event doing a complete cycle of `START` -> `COMPLETE`: + +```java +package ol.test; + +import io.openlineage.client.OpenLineage; +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.OpenLineage.RunEvent; +import io.openlineage.client.OpenLineage.InputDataset; +import io.openlineage.client.OpenLineage.Job; +import io.openlineage.client.OpenLineage.JobFacets; +import io.openlineage.client.OpenLineage.OutputDataset; +import io.openlineage.client.OpenLineage.Run; +import io.openlineage.client.OpenLineage.RunFacets; +import io.openlineage.client.OpenLineage.RunEvent.EventType; +import io.openlineage.client.transports.HttpTransport; +import io.openlineage.client.utils.UUIDUtils; + +import java.net.URI; +import java.time.ZoneId; +import java.time.ZonedDateTime; +import java.util.Arrays; +import java.util.List; +import java.util.UUID; + +/** + * My first openlinage client code + */ +public class OpenLineageClientTest +{ + public static void main( String[] args ) + { + try { + + OpenLineageClient client = OpenLineageClient.builder() + .transport( + HttpTransport.builder() + .uri("http://localhost:5000") + .build()) + .build(); + + // create one start event for testing + RunEvent event = buildEvent(EventType.START, null); + + // emit the event + client.emit(event); + + // another event to COMPLETE the run + event = buildEvent(EventType.COMPLETE, event.getRun().getRunId()); + + // emit the second COMPLETE event + client.emit(event); + + } catch (Exception e) { + e.printStackTrace(); + } + } + + // sample code to build event + public static RunEvent buildEvent(EventType eventType, UUID runId) { + ZonedDateTime now = ZonedDateTime.now(ZoneId.of("UTC")); + URI producer = URI.create("producer"); + OpenLineage ol = new OpenLineage(producer); + + if (runId == null) { + runId = UUIDUtils.generateNewUUID(); + } + + // run facets + RunFacets runFacets = + ol.newRunFacetsBuilder() + .nominalTime( + ol.newNominalTimeRunFacetBuilder() + .nominalStartTime(now) + .nominalEndTime(now) + .build()) + .build(); + + // a run is composed of run id, and run facets + Run run = ol.newRunBuilder().runId(runId).facets(runFacets).build(); + + // job facets + JobFacets jobFacets = ol.newJobFacetsBuilder().build(); + + // job + String name = "jobName"; + String namespace = "namespace"; + Job job = ol.newJobBuilder().namespace(namespace).name(name).facets(jobFacets).build(); + + // input dataset + List inputs = + Arrays.asList( + ol.newInputDatasetBuilder() + .namespace("ins") + .name("input") + .facets( + ol.newDatasetFacetsBuilder() + .version(ol.newDatasetVersionDatasetFacet("input-version")) + .build()) + .inputFacets( + ol.newInputDatasetInputFacetsBuilder() + .dataQualityMetrics( + ol.newDataQualityMetricsInputDatasetFacetBuilder() + .rowCount(10L) + .bytes(20L) + .columnMetrics( + ol.newDataQualityMetricsInputDatasetFacetColumnMetricsBuilder() + .put( + "mycol", + ol.newDataQualityMetricsInputDatasetFacetColumnMetricsAdditionalBuilder() + .count(10D) + .distinctCount(10L) + .max(30D) + .min(5D) + .nullCount(1L) + .sum(3000D) + .quantiles( + ol.newDataQualityMetricsInputDatasetFacetColumnMetricsAdditionalQuantilesBuilder() + .put("25", 52D) + .build()) + .build()) + .build()) + .build()) + .build()) + .build()); + // output dataset + List outputs = + Arrays.asList( + ol.newOutputDatasetBuilder() + .namespace("ons") + .name("output") + .facets( + ol.newDatasetFacetsBuilder() + .version(ol.newDatasetVersionDatasetFacet("output-version")) + .build()) + .outputFacets( + ol.newOutputDatasetOutputFacetsBuilder() + .outputStatistics(ol.newOutputStatisticsOutputDatasetFacet(10L, 20L)) + .build()) + .build()); + + // run state update which encapsulates all - with START event in this case + RunEvent runStateUpdate = + ol.newRunEventBuilder() + .eventType(eventType) + .eventTime(now) + .run(run) + .job(job) + .inputs(inputs) + .outputs(outputs) + .build(); + + return runStateUpdate; + } +} +``` +Now, when you run this application, the Marquez would have an output that would looke like this: + +![the Marquez graph](mqz_job_complete.png) + diff --git a/versioned_docs/version-1.26.0/client/mqz_graph.png b/versioned_docs/version-1.26.0/client/mqz_graph.png new file mode 100644 index 0000000..0336268 Binary files /dev/null and b/versioned_docs/version-1.26.0/client/mqz_graph.png differ diff --git a/versioned_docs/version-1.26.0/client/mqz_graph_example.png b/versioned_docs/version-1.26.0/client/mqz_graph_example.png new file mode 100644 index 0000000..571e29f Binary files /dev/null and b/versioned_docs/version-1.26.0/client/mqz_graph_example.png differ diff --git a/versioned_docs/version-1.26.0/client/mqz_jobs.png b/versioned_docs/version-1.26.0/client/mqz_jobs.png new file mode 100644 index 0000000..444ee6b Binary files /dev/null and b/versioned_docs/version-1.26.0/client/mqz_jobs.png differ diff --git a/versioned_docs/version-1.26.0/client/python.md b/versioned_docs/version-1.26.0/client/python.md new file mode 100644 index 0000000..1f80060 --- /dev/null +++ b/versioned_docs/version-1.26.0/client/python.md @@ -0,0 +1,953 @@ +--- +sidebar_position: 5 +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Python + +## Overview + +The Python client is the basis of existing OpenLineage integrations such as Airflow and dbt. + +The client enables the creation of lineage metadata events with Python code. +The core data structures currently offered by the client are the `RunEvent`, `RunState`, `Run`, `Job`, `Dataset`, +and `Transport` classes. These either configure or collect data for the emission of lineage events. + +You can use the client to create your own custom integrations. + +## Installation + +Download the package using `pip` with +```bash +pip install openlineage-python +``` + +To install the package from source, use +```bash +python -m pip install . +``` + +## Configuration + +We recommend configuring the client with an `openlineage.yml` file that contains all the +details of how to connect to your OpenLineage backend. + +You can make this file available to the client in three ways (the list also presents precedence of the configuration): + +1. Set an `OPENLINEAGE_CONFIG` environment variable to a file path: `OPENLINEAGE_CONFIG=path/to/openlineage.yml`. +2. Place an `openlineage.yml` file in the current working directory (the absolute path of the directory where your script or process is currently running). +3. Place an `openlineage.yml` file under `.openlineage/` in the user's home directory (`~/.openlineage/openlineage.yml`). + +In `openlineage.yml`, use a standard `Transport` interface to specify the transport type +(`http`, `console`, `kafka`, `file`, or [custom](#custom-transport-type)) and authorization parameters. +See the [example config file](#built-in-transport-types) for each transport type. + +If there is no config file found, the OpenLineage client looks at environment variables for [HTTP transport](#http-transport-configuration-with-environment-variables). + +At the end, if no configuration is found, ``ConsoleTransport`` is used, the events are printed in the console. + +### Environment Variables + +The following environment variables are available to use: + +| Name | Description | Example | Since | +|----------------------------|-------------------------------------------------------------------|-------------------------|--------| +| OPENLINEAGE_CONFIG | The path to the YAML configuration file | path/to/openlineage.yml | | +| OPENLINEAGE_CLIENT_LOGGING | Logging level of OpenLineage client and its child modules | DEBUG | | +| OPENLINEAGE_DISABLED | When `true`, OpenLineage will not emit events (default: false) | false | 0.9.0 | +| OPENLINEAGE_URL | The URL to send lineage events to (also see OPENLINEAGE_ENDPOINT) | https://myapp.com | | +| OPENLINEAGE_ENDPOINT | Endpoint to which events are sent (default: api/v1/lineage) | api/v2/events | | +| OPENLINEAGE_API_KEY | Token included in the Authentication HTTP header as the Bearer | secret_token_123 | | + +If you are using Airflow integration, there are additional [environment variables available](../integrations/airflow/usage.md#environment-variables). + +#### Dynamic configuration with environment variables + +You can also configure the client with dynamic environment variables. +Environment variables that configure the OpenLineage client follow a specific pattern. All variables that affect the client configuration start with the prefix `OPENLINEAGE__`, followed by nested keys separated by double underscores (`__`). + +##### Key Features + +1. Prefix Requirement: All environment variables must begin with `OPENLINEAGE__`. +2. Sections Separation: Configuration sections are separated using double underscores `__` to form the hierarchy. +3. Lowercase Conversion: Environment variable values are automatically converted to lowercase. +4. JSON String Support: You can pass a JSON string at any level of the configuration hierarchy, which will be merged into the final configuration structure. +5. Hyphen Restriction: Since environment variable names cannot contain `-` (hyphen), if a name strictly requires a hyphen, use a JSON string as the value of the environment variable. +6. Precedence Rules: +* Top-level keys have precedence and will not be overwritten by more nested entries. +* For example, `OPENLINEAGE__TRANSPORT='{..}'` will not have its keys overwritten by `OPENLINEAGE__TRANSPORT__AUTH__KEY='key'`. + +##### Dynamic Alias for Transport Variables + +To facilitate easier management of environment variables, aliases are dynamically created for certain variables like `OPENLINEAGE_URL`. If `OPENLINEAGE_URL` is set, it automatically translates into specific transport configurations +that can be used with Composite transport with `default_http` as the name of the HTTP transport. + +Alias rules are following: +* If environment variable `OPENLINEAGE_URL`="http://example.com" is set, it would insert following environment variables: +```sh +OPENLINEAGE__TRANSPORT__TRANSPORTS__DEFAULT_HTTP__TYPE="http" +OPENLINEAGE__TRANSPORT__TRANSPORTS__DEFAULT_HTTP__URL="http://example.com" +``` +* Similarly if environment variable `OPENLINEAGE_API_KEY`="random_key" is set, it will be translated to: +```sh +OPENLINEAGE__TRANSPORT__TRANSPORTS__DEFAULT_HTTP__AUTH='{"type": "api_key", "apiKey": "random_key"}' +``` +qually with environment variable `OPENLINEAGE_ENDPOINT`="api/v1/lineage", that translates to: +```sh +OPENLINEAGE__TRANSPORT__TRANSPORTS__DEFAULT_HTTP__ENDPOINT="api/v1/lineage" +``` +* If one does not want to use aliased HTTP transport in Composite Transport, they can set `OPENLINEAGE__TRANSPORT__TRANSPORTS__DEFAULT_HTTP` to `{}`. + + +#### Examples + + + + +Setting following environment variables: + +```sh +OPENLINEAGE__TRANSPORT__TYPE=http +OPENLINEAGE__TRANSPORT__URL=http://localhost:5050 +OPENLINEAGE__TRANSPORT__ENDPOINT=/api/v1/lineage +OPENLINEAGE__TRANSPORT__AUTH='{"type":"api_key", "apiKey":"random_token"}' +OPENLINEAGE__TRANSPORT__COMPRESSION=gzip +``` + +is equivalent to passing following YAML configuration: +```yaml +transport: + type: http + url: http://localhost:5050 + endpoint: api/v1/lineage + auth: + type: api_key + apiKey: random_token + compression: gzip +``` + + + + +Setting following environment variables: + +```sh +OPENLINEAGE__TRANSPORT__TYPE=composite +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__TYPE=http +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__URL=http://localhost:5050 +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__ENDPOINT=/api/v1/lineage +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__AUTH='{"type":"api_key", "apiKey":"random_token"}' +OPENLINEAGE__TRANSPORT__TRANSPORTS__FIRST__COMPRESSION=gzip +OPENLINEAGE__TRANSPORT__TRANSPORTS__SECOND__TYPE=console +``` + +is equivalent to passing following YAML configuration: +```yaml +transport: + type: composite + transports: + first: + type: http + url: http://localhost:5050 + endpoint: api/v1/lineage + auth: + type: api_key + apiKey: random_token + compression: gzip + second: + type: console +``` + + + + +Setting following environment variables: + +```sh +OPENLINEAGE__TRANSPORT='{"type":"console"}' +OPENLINEAGE__TRANSPORT__TYPE=http +``` + +is equivalent to passing following YAML configuration: +```yaml +transport: + type: console +``` + + + + +Setting following environment variables: + +```sh +OPENLINEAGE__TRANSPORT__TYPE=kafka +OPENLINEAGE__TRANSPORT__TOPIC=my_topic +OPENLINEAGE__TRANSPORT__CONFIG='{"bootstrap.servers": "localhost:9092,another.host:9092", "acks": "all", "retries": 3}' +OPENLINEAGE__TRANSPORT__FLUSH=true +OPENLINEAGE__TRANSPORT__MESSAGE_KEY=some-value +``` + +is equivalent to passing following YAML configuration: +```yaml +transport: + type: kafka + topic: my_topic + config: + bootstrap.servers: localhost:9092,another.host:9092 + acks: all + retries: 3 + flush: true + message_key: some-value # this has been aliased to messageKey +``` + + + + +#### HTTP transport configuration with environment variables + +For backwards compatibility, the simplest HTTP transport configuration, with only a subset of its config, can be done with environment variables +(all other transport types are only configurable with YAML file). This setup can be done with the following +environment variables: + +- `OPENLINEAGE_URL` (required) +- `OPENLINEAGE_ENDPOINT` (optional, default: `api/v1/lineage`) +- `OPENLINEAGE_API_KEY` (optional). + +## Built-in Transport Types + +### HTTP + +Allows sending events to HTTP endpoint, using [requests](https://requests.readthedocs.io/). + +#### Configuration + +- `type` - string, must be `"http"`. Required. +- `url` - string, base url for HTTP requests. Required. +- `endpoint` - string specifying the endpoint to which events are sent, appended to `url`. Optional, default: `api/v1/lineage`. +- `timeout` - float specifying timeout (in seconds) value used while connecting to server. Optional, default: `5`. +- `verify` - boolean specifying whether the client should verify TLS certificates from the backend. Optional, default: `true`. +- `auth` - dictionary specifying authentication options. Optional, by default no authorization is used. If set, requires the `type` property. + - `type` - string specifying the "api_key" or the fully qualified class name of your TokenProvider. Required if `auth` is provided. + - `apiKey` - string setting the Authentication HTTP header as the Bearer. Required if `type` is `api_key`. +- `compression` - string, name of algorithm used by HTTP client to compress request body. Optional, default value `null`, allowed values: `gzip`. Added in v1.13.0. +- `custom_headers` - dictionary of additional headers to be sent with each request. Optional, default: `{}`. + +#### Behavior + +Events are serialized to JSON, and then are send as HTTP POST request with `Content-Type: application/json`. + +#### Examples + + + + +```yaml +transport: + type: http + url: https://backend:5000 + endpoint: api/v1/lineage + timeout: 5 + verify: false + auth: + type: api_key + apiKey: f048521b-dfe8-47cd-9c65-0cb07d57591e + compression: gzip +``` + + + + +```python +from openlineage.client import OpenLineageClient +from openlineage.client.transport.http import ApiKeyTokenProvider, HttpConfig, HttpCompression, HttpTransport + +http_config = HttpConfig( + url="https://backend:5000", + endpoint="api/v1/lineage", + timeout=5, + verify=False, + auth=ApiKeyTokenProvider({"apiKey": "f048521b-dfe8-47cd-9c65-0cb07d57591e"}), + compression=HttpCompression.GZIP, +) + +client = OpenLineageClient(transport=HttpTransport(http_config)) +``` + + + + +### Console + +This straightforward transport emits OpenLineage events directly to the console through a logger. +No additional configuration is required. + +#### Configuration + +- `type` - string, must be `"console"`. Required. + +#### Behavior + +Events are serialized to JSON. Then each event is logged with `INFO` level to logger with name `openlineage.client.transport.console`. + +#### Notes + +Be cautious when using the `DEBUG` log level, as it might result in double-logging due to the `OpenLineageClient` also logging. + +#### Examples + + + + +```yaml +transport: + type: console +``` + + + + +```python +from openlineage.client import OpenLineageClient +from openlineage.client.transport.console import ConsoleConfig, ConsoleTransport + +console_config = ConsoleConfig() +client = OpenLineageClient(transport=ConsoleTransport(console_config)) +``` + + + + +### Kafka + +Kafka transport requires `confluent-kafka` package to be additionally installed. +It can be installed also by specifying kafka client extension: `pip install openlineage-python[kafka]` + +#### Configuration + +- `type` - string, must be `"kafka"`. Required. +- `topic` - string specifying the topic on what events will be sent. Required. +- `config` - a dictionary containing a Kafka producer config as in [Kafka producer config](https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#kafka-client-configuration). Required. +- `flush` - boolean specifying whether Kafka should flush after each event. Optional, default: `true`. +- `messageKey` - string, key for all Kafka messages produced by transport. Optional, default value described below. Added in v1.13.0. + + Default values for `messageKey` are: + - `run:{parentJob.namespace}/{parentJob.name}` - for RunEvent with parent facet + - `run:{job.namespace}/{job.name}` - for RunEvent + - `job:{job.namespace}/{job.name}` - for JobEvent + - `dataset:{dataset.namespace}/{dataset.name}` - for DatasetEvent + +#### Behavior + +- Events are serialized to JSON, and then dispatched to the Kafka topic. +- If `flush` is `true`, messages will be flushed to the topic after each event being sent. + +#### Notes + +It is recommended to provide `messageKey` if Job hierarchy is used. It can be any string, but it should be the same for all jobs in +hierarchy, like `Airflow task -> Spark application -> Spark task runs`. + +#### Using with Airflow integration + +There's a caveat for using `KafkaTransport` with Airflow integration. In this integration, a Kafka producer needs to be created +for each OpenLineage event. +It happens due to the Airflow execution and plugin model, which requires us to send messages from worker processes. +These are created dynamically for each task execution. + +#### Examples + + + + +```yaml +transport: + type: kafka + topic: my_topic + config: + bootstrap.servers: localhost:9092,another.host:9092 + acks: all + retries: 3 + flush: true + messageKey: some-value +``` + + + + +```python +from openlineage.client import OpenLineageClient +from openlineage.client.transport.kafka import KafkaConfig, KafkaTransport + +kafka_config = KafkaConfig( + topic="my_topic", + config={ + "bootstrap.servers": "localhost:9092,another.host:9092", + "acks": "all", + "retries": "3", + }, + flush=True, + messageKey="some", +) + +client = OpenLineageClient(transport=KafkaTransport(kafka_config)) +``` + + + + +### File + +Designed mainly for integration testing, the `FileTransport` emits OpenLineage events to a given file(s). + +#### Configuration + +- `type` - string, must be `"file"`. Required. +- `log_file_path` - string specifying the path of the file or file prefix (when `append` is true). Required. +- `append` - boolean, see *Behavior* section below. Optional, default: `false`. + +#### Behavior + +- If the target file is absent, it's created. +- If `append` is `true`, each event will be appended to a single file `log_file_path`, separated by newlines. +- If `append` is `false`, each event will be written to as separated file with name `{log_file_path}-{datetime}`. + +#### Examples + + + + +```yaml +transport: + type: file + log_file_path: /path/to/your/file + append: false +``` + + + + +```python +from openlineage.client import OpenLineageClient +from openlineage.client.transport.file import FileConfig, FileTransport + +file_config = FileConfig( + log_file_path="/path/to/your/file", + append=False, +) + +client = OpenLineageClient(transport=FileTransport(file_config)) +``` + + + + +### Composite + +The `CompositeTransport` is designed to combine multiple transports, allowing event emission to several destinations. This is useful when events need to be sent to multiple targets, such as a logging system and an API endpoint. The events are delivered sequentially - one after another in a defined order. + +#### Configuration + +- `type` - string, must be "composite". Required. +- `transports` - a list or a map of transport configurations. Required. +- `continue_on_failure` - boolean flag, determines if the process should continue even when one of the transports fails. Default is `false`. + +#### Behavior + +- The configured transports will be initialized and used in sequence to emit OpenLineage events. +- If `continue_on_failure` is set to `false`, a failure in one transport will stop the event emission process, and an exception will be raised. +- If `continue_on_failure` is `true`, the failure will be logged, but the remaining transports will still attempt to send the event. + +#### Notes for Multiple Transports +The composite transport can be used with any OpenLineage transport (e.g. `HttpTransport`, `KafkaTransport`, etc). + +The `transports` configuration can be provided in two formats: + +1. A list of transport configurations, where each transport may optionally include a `name` field. +2. A map of transport configurations, where the key acts as the name for each transport. +The map format is particularly useful for configurations set via environment variables. + +##### Why are transport names used? +Transport names are not required for basic functionality. Their primary purpose is to enable configuration of composite transports via environment variables, which is only supported when names are defined. + +#### Examples + + + + +```yaml +transport: + type: composite + continueOnFailure: true + transports: + - type: http + url: http://example.com/api + name: my_http + - type: http + url: http://localhost:5000 + endpoint: /api/v1/lineage +``` + + + + +```yaml +transport: + type: composite + continueOnFailure: true + transports: + my_http: + type: http + url: http://example.com/api + local_http: + type: http + url: http://localhost:5000 + endpoint: /api/v1/lineage +``` + + + + +```python +from openlineage.client import OpenLineageClient +from openlineage.client.transport.composite import CompositeTransport, CompositeConfig + +config = CompositeConfig.from_dict( + { + "type": "composite", + "transports": [ + { + "type": "kafka", + "config": {"bootstrap.servers": "localhost:9092"}, + "topic": "random-topic", + "messageKey": "key", + "flush": False, + }, + {"type": "console"}, + ], + }, + ) +client = OpenLineageClient(transport=CompositeTransport(config)) +``` + + + +### Custom Transport Type + +To implement a custom transport, follow the instructions in [`transport.py`](https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/transport/transport.py). + +The `type` property (required) must be a fully qualified class name that can be imported. + +## Environment Variables Run Facet + +To include specific environment variables in OpenLineage events, the `OpenLineageClient` can add them as a facet called `EnvironmentVariablesRunFacet`. This feature allows you to specify which environment variables should be collected and attached to each emitted event. + +To enable this, configure the `environment_variables` option within the `facets` section of your `OpenLineageClient` configuration. + + + + +```yaml +facets: + environment_variables: + - VAR1 + - VAR2 +``` + + + + +```sh +OPENLINEAGE__FACETS__ENVIRONMENT_VARIABLES='["VAR1", "VAR2"]' +``` + + + + +## Getting Started + +To try out the client, follow the steps below to install and explore OpenLineage, Marquez (the reference implementation of OpenLineage), and the client itself. Then, the instructions will show you how to use these tools to add a run event and datasets to an existing namespace. + +### Prerequisites +- Docker 17.05+ +- Docker Compose 1.29.1+ +- Git (preinstalled on most versions of MacOS; verify your version with `git version`) +- 4 GB of available memory (the minimum for Docker — more is strongly recommended) + +### Install OpenLineage and Marquez + +Clone the Marquez Github repository: +```bash +git clone https://github.com/MarquezProject/marquez.git +``` + +### Install the Python client +```bash +pip install openlineage-python +``` + +### Start Docker and Marquez +Start Docker Desktop +Run Marquez with preloaded data: +```bash +cd marquez +./docker/up.sh --seed +``` + +Marquez should be up and running at `http://localhost:3000`. + +Take a moment to explore Marquez to get a sense of how metadata is displayed in the UI. Namespaces – the global contexts for runs and datasets – can be found in the top right corner, and icons for jobs and runs can be found in a tray along the left side. + +Next, configure OpenLineage and add a script to your project that will generate a new job and new datasets within an existing namespace (here we’re using the `food_delivery` namespace that got passed to Marquez with the `–seed` argument we used earlier). + +Create a directory for your script: +```bash +.. +mkdir python_scripts && cd python_scripts +``` + +In the python_scripts directory, create a Python script (we used the name `generate_events.py` for ours) and an `openlineage.yml` file. + +In `openlineage.yml`, define a transport type and URL to tell OpenLineage where and how to send metadata: + +```yaml +transport: + type: http + url: http://localhost:5000 +``` + +In `generate_events.py`, import the Python client and the methods needed to create a job and datasets. Also required (to create a run): the `datetime` and `uuid` packages: + +```python +from openlineage.client import OpenLineageClient +from openlineage.client.event_v2 import ( + Dataset, + InputDataset, + Job, + OutputDataset, + Run, + RunEvent, + RunState, +) +from openlineage.client.uuid import generate_new_uuid +from datetime import datetime +``` + +Then, in the same file, initialize the Python client: +```python +client = OpenLineageClient.from_environment() +``` + +It is also possible to specify parameters such as URL for client to connect to, without using environment variables or `openlineage.yaml` file, by directly setting it up when instantiating OpenLineageClient: + +```python +client = OpenLineageClient(url="http://localhost:5000") +``` + +> For more details about options to setup OpenLineageClient such as API tokens or HTTP transport settings, please refer to the following [example](https://github.com/OpenLineage/OpenLineage/blob/main/client/python/tests/test_http.py) + + +Specify the producer of the new lineage metadata with a string: +```python +producer = "OpenLineage.io/website/blog" +``` + +Now you can create some basic dataset objects. These require a namespace and name: +```python +inventory = Dataset(namespace="food_delivery", name="public.inventory") +menus = Dataset(namespace="food_delivery", name="public.menus_1") +orders = Dataset(namespace="food_delivery", name="public.orders_1") +``` + +You can also create a job object (we’ve borrowed this one from the existing `food_delivery` namespace): +```python +job = Job(namespace="food_delivery", name="example.order_data") +``` + +To create a run object you’ll need to specify a unique ID: +```python +run = Run(runId=str(generate_new_uuid())) +``` + +a START run event: +```python +client.emit( + RunEvent( + eventType=RunState.START, + eventTime=datetime.now().isoformat(), + run=run, + job=job, + producer=producer, + ) +) +``` + +and, finally, a COMPLETE run event: +```python +client.emit( + RunEvent( + eventType=RunState.COMPLETE, + eventTime=datetime.now().isoformat(), + run=run, job=job, producer=producer, + inputs=[inventory], + outputs=[menus, orders], + ) +) +``` + +Now you have a complete script for creating datasets and a run event! Execute it in the terminal to send the metadata to Marquez: +```bash +python3 generate_scripts.py +``` + +Marquez will update itself automatically, so the new job and datasets should now be visible in the UI. Clicking on the jobs icon (the icon with the three interlocking gears), will make the `example.order_data` job appear in the list of jobs: + +![the Marquez jobs list](./mqz_jobs.png) + +When you click on the job, you will see a new map displaying the job, input and outputs we created with our script: + +![the Marquez graph](./mqz_graph.png) + +## Full Example Source Code + +```python +#!/usr/bin/env python3 +from datetime import datetime, timedelta, timezone +from random import random + +from openlineage.client.client import OpenLineageClient, OpenLineageClientOptions +from openlineage.client.event_v2 import ( + Dataset, + InputDataset, + Job, + OutputDataset, + Run, + RunEvent, + RunState, +) +from openlineage.client.facet_v2 import ( + nominal_time_run, + schema_dataset, + source_code_location_job, + sql_job, +) +from openlineage.client.uuid import generate_new_uuid + +PRODUCER = "https://github.com/openlineage-user" +namespace = "python_client" +dag_name = "user_trends" + +# update to your host +url = "http://mymarquez.host:5000" +api_key = "1234567890ckcu028rzu5l" + +client = OpenLineageClient( + url=url, + # optional api key in case marquez requires it. When running marquez in + # your local environment, you usually do not need this. + options=OpenLineageClientOptions(api_key=api_key), +) + +# If you want to log to a file instead of Marquez +# from openlineage.client import OpenLineageClient +# from openlineage.client.transport.file import FileConfig, FileTransport +# +# file_config = FileConfig( +# log_file_path="ol.json", +# append=True, +# ) +# +# client = OpenLineageClient(transport=FileTransport(file_config)) + + +# generates job facet +def job(job_name, sql, location): + facets = {"sql": sql_job.SQLJobFacet(query=sql)} + if location != None: + facets.update( + { + "sourceCodeLocation": source_code_location_job.SourceCodeLocationJobFacet( + "git", location + ) + } + ) + return Job(namespace=namespace, name=job_name, facets=facets) + + +# geneartes run racet +def run(run_id, hour): + return Run( + runId=run_id, + facets={ + "nominalTime": nominal_time_run.NominalTimeRunFacet( + nominalStartTime=f"2022-04-14T{twoDigits(hour)}:12:00Z", + # nominalEndTime=None + ) + }, + ) + + +# generates dataset +def dataset(name, schema=None, ns=namespace): + if schema == None: + facets = {} + else: + facets = {"schema": schema} + return Dataset(namespace=ns, name=name, facets=facets) + + +# generates output dataset +def outputDataset(dataset, stats): + output_facets = {"stats": stats, "outputStatistics": stats} + return OutputDataset(dataset.namespace, + dataset.name, + facets=dataset.facets, + outputFacets=output_facets) + + +# generates input dataset +def inputDataset(dataset, dq): + input_facets = { + "dataQuality": dq, + } + return InputDataset(dataset.namespace, dataset.name, + facets=dataset.facets, + inputFacets=input_facets) + + +def twoDigits(n): + if n < 10: + result = f"0{n}" + elif n < 100: + result = f"{n}" + else: + raise f"error: {n}" + return result + + +now = datetime.now(timezone.utc) + + +# generates run Event +def runEvents(job_name, sql, inputs, outputs, hour, min, location, duration): + run_id = str(generate_new_uuid()) + myjob = job(job_name, sql, location) + myrun = run(run_id, hour) + started_at = now + timedelta(hours=hour, minutes=min, seconds=20 + round(random() * 10)) + ended_at = started_at + timedelta(minutes=duration, seconds=20 + round(random() * 10)) + return ( + RunEvent( + eventType=RunState.START, + eventTime=started_at.isoformat(), + run=myrun, + job=myjob, + producer=PRODUCER, + inputs=inputs, + outputs=outputs, + ), + RunEvent( + eventType=RunState.COMPLETE, + eventTime=ended_at.isoformat(), + run=myrun, + job=myjob, + producer=PRODUCER, + inputs=inputs, + outputs=outputs, + ), + ) + + +# add run event to the events list +def addRunEvents(events, job_name, sql, inputs, outputs, hour, minutes, location=None, duration=2): + (start, complete) = runEvents(job_name, sql, inputs, outputs, hour, minutes, location, duration) + events.append(start) + events.append(complete) + + +events = [] + +# create dataset data +for i in range(0, 5): + user_counts = dataset("tmp_demo.user_counts") + user_history = dataset( + "temp_demo.user_history", + schema_dataset.SchemaDatasetFacet( + fields=[ + schema_dataset.SchemaDatasetFacetFields( + name="id", type="BIGINT", description="the user id" + ), + schema_dataset.SchemaDatasetFacetFields( + name="email_domain", type="VARCHAR", description="the user id" + ), + schema_dataset.SchemaDatasetFacetFields( + name="status", type="BIGINT", description="the user id" + ), + schema_dataset.SchemaDatasetFacetFields( + name="created_at", + type="DATETIME", + description="date and time of creation of the user", + ), + schema_dataset.SchemaDatasetFacetFields( + name="updated_at", + type="DATETIME", + description="the last time this row was updated", + ), + schema_dataset.SchemaDatasetFacetFields( + name="fetch_time_utc", + type="DATETIME", + description="the time the data was fetched", + ), + schema_dataset.SchemaDatasetFacetFields( + name="load_filename", + type="VARCHAR", + description="the original file this data was ingested from", + ), + schema_dataset.SchemaDatasetFacetFields( + name="load_filerow", + type="INT", + description="the row number in the original file", + ), + schema_dataset.SchemaDatasetFacetFields( + name="load_timestamp", + type="DATETIME", + description="the time the data was ingested", + ), + ] + ), + "snowflake://", + ) + + create_user_counts_sql = """CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS ( + SELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count + FROM TMP_DEMO.USER_HISTORY + GROUP BY date + )""" + + # location of the source code + location = "https://github.com/some/airflow/dags/example/user_trends.py" + + # run simulating Airflow DAG with snowflake operator + addRunEvents( + events, + dag_name + ".create_user_counts", + create_user_counts_sql, + [user_history], + [user_counts], + i, + 11, + location, + ) + + +for event in events: + from openlineage.client.serde import Serde + + print(event) + print(Serde.to_json(event)) + # time.sleep(1) + client.emit(event) + +``` +The resulting lineage events received by Marquez would look like this. + +![the Marquez graph](./mqz_graph_example.png) diff --git a/versioned_docs/version-1.26.0/datamodel.svg b/versioned_docs/version-1.26.0/datamodel.svg new file mode 100644 index 0000000..d2f7f54 --- /dev/null +++ b/versioned_docs/version-1.26.0/datamodel.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/_category_.json b/versioned_docs/version-1.26.0/development/_category_.json new file mode 100644 index 0000000..3f59497 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Development", + "position": 8 +} diff --git a/versioned_docs/version-1.26.0/development/developing/_category.json b/versioned_docs/version-1.26.0/development/developing/_category.json new file mode 100644 index 0000000..18dc469 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/_category.json @@ -0,0 +1,8 @@ +{ + "label": "Developing with OpenLineage", + "position": 4, + "link": { + "type": "doc", + "id": "developing" + } +} \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/developing.md b/versioned_docs/version-1.26.0/development/developing/developing.md new file mode 100644 index 0000000..d2382cd --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/developing.md @@ -0,0 +1,48 @@ +--- +sidebar_position: 1 +--- + +# Developing With OpenLineage + +As there are hundreds and possibly thousands databases, query engines and other tools you could use to process, create and move data, there's great chance that existing OpenLineage integration won't cover your needs. + +However, OpenLineage project also provides libraries you could use to write your own integration. + +### Clients + +For [Python](../../client/python.md) and [Java](../../client/java/java.md), we've created clients that you can use to properly create and emit OpenLineage events to HTTP, Kafka, and other consumers. + +### API Documentation + +- [OpenAPI documentation](https://openlineage.io/apidocs/openapi/) +- [Java Doc](https://openlineage.io/apidocs/javadoc/) + +### Common Library (Python) + +Getting lineage from systems like BigQuery or Redshift isn't necessarily tied to orchestrator or processing engine you're using. For this reason, we've extracted +that functionality from our Airflow library and [packaged it for separate use](https://pypi.org/project/openlineage-integration-common/). + +### Environment Variables + +The list of available environment variables for **Python** can be found [here](../../client/python.md#environment-variables). +The list of available environment variables for **Java** can be found [here](../../client/java/java.md#environment-variables). + +### SQL parser + +We've created SQL parser that allows you to extract lineage from SQL statements. The parser is implemented in Rust; however, it's also available as a [Python library](https://pypi.org/project/openlineage-sql/). +You can take a look at its code on [GitHub](https://github.com/OpenLineage/OpenLineage/tree/main/integration/sql). + +## Contributing + +If contributing changes, additions or fixes, please include the following header in any new files: + +``` +/* +/* Copyright 2018-2024 contributors to the OpenLineage project +/* SPDX-License-Identifier: Apache-2.0 +*/ +``` + +There is a pre-commit step that checks license in headers for new files when pull requests are opened. + +Thanks for your contributions to the project! diff --git a/versioned_docs/version-1.26.0/development/developing/java/_category_.json b/versioned_docs/version-1.26.0/development/developing/java/_category_.json new file mode 100644 index 0000000..5d19ed2 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/java/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Java", + "position": 2 +} diff --git a/versioned_docs/version-1.26.0/development/developing/java/adding_metrics.md b/versioned_docs/version-1.26.0/development/developing/java/adding_metrics.md new file mode 100644 index 0000000..a48899f --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/java/adding_metrics.md @@ -0,0 +1,18 @@ +--- +title: Metrics Backends +sidebar_position: 2 +--- + +To integrate additional metrics backend into the OpenLineage client, implement the `MeterRegistryFactory` interface and ensure it is utilized by the `MicrometerProvider`'s `getMetricsBuilders` method. + +The `MeterRegistryFactory` interface is designed to construct a `MeterRegistry` object from the OpenLineage configuration map. This interface allows the integration of either custom implementations or existing ones provided by Micrometer. + +If your metrics backend requires external dependencies (e.g., `io.micrometer:micrometer-registry-otlp:latest`), add them to your project's build.gradle as compileOnly. This ensures they are available during compilation but optional at runtime. + +Use `ReflectionUtils.hasClass` to check the existence of required classes on the classpath before using them. This prevents runtime failures due to missing dependencies. + +``` + if (ReflectionUtils.hasClass("io.micrometer.statsd.StatsdMeterRegistry")) { + builders.add(new StatsDMeterRegistryFactory()); + } +``` \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/java/setup.md b/versioned_docs/version-1.26.0/development/developing/java/setup.md new file mode 100644 index 0000000..e3f7978 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/java/setup.md @@ -0,0 +1,8 @@ +--- +title: Setup a development environment +sidebar_position: 1 +--- + +:::info +This page needs your contribution! Please contribute new examples using the edit link at the bottom. +::: \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/java/troubleshooting/_category_.json b/versioned_docs/version-1.26.0/development/developing/java/troubleshooting/_category_.json new file mode 100644 index 0000000..c99297e --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/java/troubleshooting/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Troubleshooting", + "position": 1 +} \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/java/troubleshooting/logging.md b/versioned_docs/version-1.26.0/development/developing/java/troubleshooting/logging.md new file mode 100644 index 0000000..5356f43 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/java/troubleshooting/logging.md @@ -0,0 +1,315 @@ +--- +title: Logging +sidebar_position: 1 +--- + +OpenLineage Java library is based on [slf4j](https://www.slf4j.org/) when generating logs. Being able to emit logs for various purposes is very helpful when troubleshooting OpenLineage. + +Consider the following sample java code that emits OpenLineage events: + +```java +package ol.test; + +import io.openlineage.client.OpenLineage; +import io.openlineage.client.OpenLineageClient; +import io.openlineage.client.OpenLineage.RunEvent; +import io.openlineage.client.OpenLineage.InputDataset; +import io.openlineage.client.OpenLineage.Job; +import io.openlineage.client.OpenLineage.JobFacets; +import io.openlineage.client.OpenLineage.OutputDataset; +import io.openlineage.client.OpenLineage.Run; +import io.openlineage.client.OpenLineage.RunFacets; +import io.openlineage.client.OpenLineage.RunEvent.EventType; +import io.openlineage.client.transports.HttpTransport; +import io.openlineage.client.utils.UUIDUtils; + +import java.net.URI; +import java.time.ZoneId; +import java.time.ZonedDateTime; +import java.util.Arrays; +import java.util.List; +import java.util.UUID; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * sample openlinage client code + */ +public class OpenLineageClientTest +{ + private static Logger logger = LoggerFactory.getLogger(OpenLineageClientTest.class); + + public static void main( String[] args ) + { + logger.info("Running OpenLineage Client Test..."); + try { + + OpenLineageClient client = OpenLineageClient.builder() + .transport( + HttpTransport.builder() + .uri("http://localhost:5000") + .apiKey("abcdefghijklmn") + .build()) + .build(); + + // create one start event for testing + RunEvent event = buildEvent(EventType.START, null); + + // emit the event + client.emit(event); + + // another event to COMPLETE the run + event = buildEvent(EventType.COMPLETE, event.getRun().getRunId()); + + // emit the second COMPLETE event + client.emit(event); + + } catch (Exception e) { + e.printStackTrace(); + } + } + + // sample code to build event + public static RunEvent buildEvent(EventType eventType, UUID runId) { + ZonedDateTime now = ZonedDateTime.now(ZoneId.of("UTC")); + URI producer = URI.create("producer"); + OpenLineage ol = new OpenLineage(producer); + + if (runId == null) { + runId = UUIDUtils.generateNewUUID(); + } + + // run facets + RunFacets runFacets = + ol.newRunFacetsBuilder() + .nominalTime( + ol.newNominalTimeRunFacetBuilder() + .nominalStartTime(now) + .nominalEndTime(now) + .build()) + .build(); + + // a run is composed of run id, and run facets + Run run = ol.newRunBuilder().runId(runId).facets(runFacets).build(); + + // job facets + JobFacets jobFacets = ol.newJobFacetsBuilder().build(); + + // job + String name = "jobName"; + String namespace = "namespace"; + Job job = ol.newJobBuilder().namespace(namespace).name(name).facets(jobFacets).build(); + + // input dataset + List inputs = + Arrays.asList( + ol.newInputDatasetBuilder() + .namespace("ins") + .name("input") + .facets( + ol.newDatasetFacetsBuilder() + .version(ol.newDatasetVersionDatasetFacet("input-version")) + .build()) + .inputFacets( + ol.newInputDatasetInputFacetsBuilder() + .dataQualityMetrics( + ol.newDataQualityMetricsInputDatasetFacetBuilder() + .rowCount(10L) + .bytes(20L) + .columnMetrics( + ol.newDataQualityMetricsInputDatasetFacetColumnMetricsBuilder() + .put( + "mycol", + ol.newDataQualityMetricsInputDatasetFacetColumnMetricsAdditionalBuilder() + .count(10D) + .distinctCount(10L) + .max(30D) + .min(5D) + .nullCount(1L) + .sum(3000D) + .quantiles( + ol.newDataQualityMetricsInputDatasetFacetColumnMetricsAdditionalQuantilesBuilder() + .put("25", 52D) + .build()) + .build()) + .build()) + .build()) + .build()) + .build()); + // output dataset + List outputs = + Arrays.asList( + ol.newOutputDatasetBuilder() + .namespace("ons") + .name("output") + .facets( + ol.newDatasetFacetsBuilder() + .version(ol.newDatasetVersionDatasetFacet("output-version")) + .build()) + .outputFacets( + ol.newOutputDatasetOutputFacetsBuilder() + .outputStatistics(ol.newOutputStatisticsOutputDatasetFacet(10L, 20L)) + .build()) + .build()); + + // run state update which encapsulates all - with START event in this case + RunEvent runStateUpdate = + ol.newRunEventBuilder() + .eventType(eventType) + .eventTime(now) + .run(run) + .job(job) + .inputs(inputs) + .outputs(outputs) + .build(); + + return runStateUpdate; + } +} + +``` + +When you use OpenLineage backend such as Marquez on your local environment, the program would emit OpenLienage events to it. + +```bash +java ol.test.OpenLineageClientTest +``` + +However, this short program does not produce any logging information, as the logging configuration is required to be setup. Below are the examples of adding dependencies of the libraries that you need to use `log4j2` as the target implementation for the slf4j, on [maven](https://maven.apache.org/) or [gradle](https://gradle.org/). + +### Maven +pom.xml +```xml + + ... + + org.apache.logging.log4j + log4j-api + 2.7 + + + org.apache.logging.log4j + log4j-core + 2.7 + + + org.apache.logging.log4j + log4j-slf4j-impl + 2.7 + + ... + +``` +### Gradle +build.gradle +``` +dependencies { + ... + implementation "org.apache.logging.log4j:log4j-api:2.7" + implementation "org.apache.logging.log4j:log4j-core:2.7" + implementation "org.apache.logging.log4j:log4j-slf4j-impl:2.7" + ... +} +``` + +You also need to create a log4j configuration file, `log4j2.properties` on the classpath. Here is the sample log configuration. + +``` +# Set to debug or trace if log4j initialization is failing +status = warn + +# Name of the configuration +name = ConsoleLogConfigDemo + +# Console appender configuration +appender.console.type = Console +appender.console.name = consoleLogger +appender.console.layout.type = PatternLayout +appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n + +# Root logger level +rootLogger.level = debug + +# Root logger referring to console appender +rootLogger.appenderRef.stdout.ref = consoleLogger +``` + +Re-compiling and running the `ol.test.OpenLineageClientTest` again will produce the following outputs: + +``` +2022-12-07 08:57:24 INFO OpenLineageClientTest:33 - Running OpenLineage Client Test... +2022-12-07 08:57:25 DEBUG HttpTransport:96 - POST http://localhost:5000/api/v1/lineage: {"eventType":"START","eventTime":"2022-12-07T14:57:25.072781Z","run":{"runId":"0142c998-3416-49e7-92aa-d025c4c93697","facets":{"nominalTime":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/NominalTimeRunFacet.json#/$defs/NominalTimeRunFacet","nominalStartTime":"2022-12-07T14:57:25.072781Z","nominalEndTime":"2022-12-07T14:57:25.072781Z"}}},"job":{"namespace":"namespace","name":"jobName","facets":{}},"inputs":[{"namespace":"ins","name":"input","facets":{"version":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json#/$defs/DatasetVersionDatasetFacet","datasetVersion":"input-version"}},"inputFacets":{"dataQualityMetrics":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DataQualityMetricsInputDatasetFacet.json#/$defs/DataQualityMetricsInputDatasetFacet","rowCount":10,"bytes":20,"columnMetrics":{"mycol":{"nullCount":1,"distinctCount":10,"sum":3000.0,"count":10.0,"min":5.0,"max":30.0,"quantiles":{"25":52.0}}}}}}],"outputs":[{"namespace":"ons","name":"output","facets":{"version":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json#/$defs/DatasetVersionDatasetFacet","datasetVersion":"output-version"}},"outputFacets":{"outputStatistics":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":10,"size":20}}}],"producer":"producer","schemaURL":"https://openlineage.io/spec/1-0-4/OpenLineage.json#/$defs/RunEvent"} +2022-12-07 08:57:25 DEBUG HttpTransport:96 - POST http://localhost:5000/api/v1/lineage: {"eventType":"COMPLETE","eventTime":"2022-12-07T14:57:25.42041Z","run":{"runId":"0142c998-3416-49e7-92aa-d025c4c93697","facets":{"nominalTime":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/NominalTimeRunFacet.json#/$defs/NominalTimeRunFacet","nominalStartTime":"2022-12-07T14:57:25.42041Z","nominalEndTime":"2022-12-07T14:57:25.42041Z"}}},"job":{"namespace":"namespace","name":"jobName","facets":{}},"inputs":[{"namespace":"ins","name":"input","facets":{"version":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json#/$defs/DatasetVersionDatasetFacet","datasetVersion":"input-version"}},"inputFacets":{"dataQualityMetrics":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DataQualityMetricsInputDatasetFacet.json#/$defs/DataQualityMetricsInputDatasetFacet","rowCount":10,"bytes":20,"columnMetrics":{"mycol":{"nullCount":1,"distinctCount":10,"sum":3000.0,"count":10.0,"min":5.0,"max":30.0,"quantiles":{"25":52.0}}}}}}],"outputs":[{"namespace":"ons","name":"output","facets":{"version":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json#/$defs/DatasetVersionDatasetFacet","datasetVersion":"output-version"}},"outputFacets":{"outputStatistics":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":10,"size":20}}}],"producer":"producer","schemaURL":"https://openlineage.io/spec/1-0-4/OpenLineage.json#/$defs/RunEvent"} +``` + +Logs will also produce meaningful error messages when something does not work correctly. For example, if the backend server does not exist, you would get the following messages in your console output: + +``` +2022-12-07 09:15:16 INFO OpenLineageClientTest:33 - Running OpenLineage Client Test... +2022-12-07 09:15:16 DEBUG HttpTransport:96 - POST http://localhost:5000/api/v1/lineage: {"eventType":"START","eventTime":"2022-12-07T15:15:16.668979Z","run":{"runId":"69861937-55ba-43f5-ab5e-fe78ef6a283d","facets":{"nominalTime":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/NominalTimeRunFacet.json#/$defs/NominalTimeRunFacet","nominalStartTime":"2022-12-07T15:15:16.668979Z","nominalEndTime":"2022-12-07T15:15:16.668979Z"}}},"job":{"namespace":"namespace","name":"jobName","facets":{}},"inputs":[{"namespace":"ins","name":"input","facets":{"version":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json#/$defs/DatasetVersionDatasetFacet","datasetVersion":"input-version"}},"inputFacets":{"dataQualityMetrics":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DataQualityMetricsInputDatasetFacet.json#/$defs/DataQualityMetricsInputDatasetFacet","rowCount":10,"bytes":20,"columnMetrics":{"mycol":{"nullCount":1,"distinctCount":10,"sum":3000.0,"count":10.0,"min":5.0,"max":30.0,"quantiles":{"25":52.0}}}}}}],"outputs":[{"namespace":"ons","name":"output","facets":{"version":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json#/$defs/DatasetVersionDatasetFacet","datasetVersion":"output-version"}},"outputFacets":{"outputStatistics":{"_producer":"producer","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":10,"size":20}}}],"producer":"producer","schemaURL":"https://openlineage.io/spec/1-0-4/OpenLineage.json#/$defs/RunEvent"} +io.openlineage.client.OpenLineageClientException: org.apache.http.conn.HttpHostConnectException: Connect to localhost:5000 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused + at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:113) + at io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:42) + at ol.test.OpenLineageClientTest.main(OpenLineageClientTest.java:48) +Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:5000 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused + at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156) + at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) + at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) + at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) + at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) + at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) + at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) + at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) + at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) + at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) + at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:108) + ... 2 more +Caused by: java.net.ConnectException: Connection refused + at java.base/sun.nio.ch.Net.pollConnect(Native Method) + at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672) + at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542) + at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:585) + at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) + at java.base/java.net.Socket.connect(Socket.java:666) + at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) + at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) + ... 12 more +``` + +If you wish to output loggigng message to a file, you can modify the basic configuration by adding a file appender configuration as follows: + +``` +# Set to debug or trace if log4j initialization is failing +status = warn + +# Name of the configuration +name = ConsoleLogConfigDemo + +# Console appender configuration +appender.console.type = Console +appender.console.name = consoleLogger +appender.console.layout.type = PatternLayout +appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n + +# File appender configuration +appender.file.type = File +appender.file.name = fileLogger +appender.file.fileName = app.log +appender.file.layout.type = PatternLayout +appender.file.layout.pattern = %d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n + +# Root logger level +rootLogger.level = debug + +# Root logger referring to console appender +rootLogger.appenderRef.stdout.ref = consoleLogger +rootLogger.appenderRef.file.ref = fileLogger +``` + +And the logs will be saved to a file `app.log`. +Outputting logs using `log4j2` is just one way of doing it, so below are some additional resources of undersatnding how Java logging works, and other ways to output the logs. + +### Further readings +- https://www.baeldung.com/java-logging-intro +- https://www.baeldung.com/slf4j-with-log4j2-logback#Log4j2 +- https://mkyong.com/logging/log4j2-properties-example/ diff --git a/versioned_docs/version-1.26.0/development/developing/python/_category_.json b/versioned_docs/version-1.26.0/development/developing/python/_category_.json new file mode 100644 index 0000000..e38b23f --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Python", + "position": 1 +} diff --git a/versioned_docs/version-1.26.0/development/developing/python/api-reference/_category_.json b/versioned_docs/version-1.26.0/development/developing/python/api-reference/_category_.json new file mode 100644 index 0000000..f92b09d --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/api-reference/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "API Reference", + "position": 10 +} diff --git a/versioned_docs/version-1.26.0/development/developing/python/api-reference/openlineage.client.md b/versioned_docs/version-1.26.0/development/developing/python/api-reference/openlineage.client.md new file mode 100644 index 0000000..514d94f --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/api-reference/openlineage.client.md @@ -0,0 +1,8 @@ +--- +title: Python Client +--- + + + +
\n
\n
\n On this page\n
\n
\n\n
\n
\n\n
\n

openlineage.client.client module

\n
\n
\nclass openlineage.client.client.OpenLineageClientOptions(timeout=5.0, verify=True, api_key=None, adapter=None)
\n

Bases: object

\n
\n
Parameters:
\n
    \n
  • timeout (float)

  • \n
  • verify (bool)

  • \n
  • api_key (Optional[str])

  • \n
  • adapter (Optional[HTTPAdapter])

  • \n
\n
\n
\n
\n
\ntimeout: float
\n
\n
\n
\nverify: bool
\n
\n
\n
\napi_key: str
\n
\n
\n
\nadapter: HTTPAdapter
\n
\n
\n
\n
\nclass openlineage.client.client.OpenLineageConfig(transport=_Nothing.NOTHING, facets=_Nothing.NOTHING, filters=_Nothing.NOTHING)
\n

Bases: object

\n
\n
Parameters:
\n
    \n
  • transport (dict[str, Any] | None)

  • \n
  • facets (FacetsConfig)

  • \n
  • filters (list[FilterConfig])

  • \n
\n
\n
\n
\n
\ntransport: dict[str, Any] | None
\n
\n
\n
\nfacets: FacetsConfig
\n
\n
\n
\nfilters: list[FilterConfig]
\n
\n
\n
\nclassmethod from_dict(params)
\n
\n
Parameters:
\n

params (dict[str, Any])

\n
\n
Return type:
\n

OpenLineageConfig

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.client.OpenLineageClient(url=None, options=None, session=None, transport=None, factory=None, *, config=None)
\n

Bases: object

\n
\n
Parameters:
\n
    \n
  • url (str | None)

  • \n
  • options (OpenLineageClientOptions | None)

  • \n
  • session (Session | None)

  • \n
  • transport (Transport | None)

  • \n
  • factory (TransportFactory | None)

  • \n
  • config (dict[str, Any] | None)

  • \n
\n
\n
\n
\n
\nDYNAMIC_ENV_VARS_PREFIX = 'OPENLINEAGE__'
\n
\n
\n
\nDEFAULT_URL_TRANSPORT_NAME = 'default_http'
\n
\n
\n
\nclassmethod from_environment()
\n
\n
Return type:
\n

_T

\n
\n
\n
\n
\n
\nclassmethod from_dict(config)
\n
\n
Parameters:
\n

config (dict[str, str])

\n
\n
Return type:
\n

_T

\n
\n
\n
\n
\n
\nfilter_event(event)
\n

Filters jobs according to config-defined events

\n
\n
Parameters:
\n

event (Event)

\n
\n
Return type:
\n

Event | None

\n
\n
\n
\n
\n
\nemit(event)
\n
\n
Parameters:
\n

event (Union[RunEvent, DatasetEvent, JobEvent, RunEvent, DatasetEvent, JobEvent])

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nproperty config: OpenLineageConfig
\n

Retrieves the OpenLineage configuration.

\n

This property method returns the content of the OpenLineage YAML config file.\nThe configuration is determined by merging sources in the following order of precedence:\n1. User-defined configuration passed to the client constructor.\n2. YAML config file located in one of the following paths:\n- Path specified by the OPENLINEAGE_CONFIG environment variable.\n- Current working directory.\n- $HOME/.openlineage.\n3. Environment variables with the OPENLINEAGE__ prefix.\nIf the configuration is not already loaded, it will be constructed by merging the above sources.\nIn case of a TypeError during the parsing of the configuration, a ValueError will be raised indicating\nthat the structure of the config does not match the expected format.

\n
\n
\n
\nadd_environment_facets(event)
\n

Adds environment variables as facets to the event object.

\n
\n
Parameters:
\n

event (Union[RunEvent, DatasetEvent, JobEvent, RunEvent, DatasetEvent, JobEvent])

\n
\n
Return type:
\n

Union[RunEvent, DatasetEvent, JobEvent, RunEvent, DatasetEvent, JobEvent]

\n
\n
\n
\n
\n
\n
\n

openlineage.client.event_v2 module

\n
\n
\nclass openlineage.client.event_v2.BaseEvent(*, eventTime, producer='')
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • eventTime (str)

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\neventTime: str
\n

the time the event occurred at

\n
\n
\n
\nproducer: str
\n
\n
\n
\nschemaURL: str
\n
\n
\n
\nproperty skip_redact: list[str]
\n
\n
\n
\neventtime_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nproducer_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nschemaurl_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.event_v2.RunEvent(*, eventTime, producer='', run, job, eventType=None, inputs=_Nothing.NOTHING, outputs=_Nothing.NOTHING)
\n

Bases: BaseEvent

\n
\n
Parameters:
\n
    \n
  • eventTime (str)

  • \n
  • producer (str)

  • \n
  • run (Run)

  • \n
  • job (Job)

  • \n
  • eventType (EventType | None)

  • \n
  • inputs (list[InputDataset] | None)

  • \n
  • outputs (list[OutputDataset] | None)

  • \n
\n
\n
\n
\n
\nrun: Run
\n
\n
\n
\njob: Job
\n
\n
\n
\neventType: EventType | None
\n

the current transition of the run state. It is required to issue 1 START event and 1 of [ COMPLETE,\nABORT, FAIL ] event per run. Additional events with OTHER eventType can be added to the same run.\nFor example to send additional metadata after the run is complete

\n
\n
\n
\ninputs: list[InputDataset] | None
\n

The set of input datasets.

\n
\n
\n
\noutputs: list[OutputDataset] | None
\n

The set of output datasets.

\n
\n
\n
\n
\nclass openlineage.client.event_v2.JobEvent(*, eventTime, producer='', job, inputs=_Nothing.NOTHING, outputs=_Nothing.NOTHING)
\n

Bases: BaseEvent

\n
\n
Parameters:
\n
\n
\n
\n
\n
\njob: Job
\n
\n
\n
\ninputs: list[InputDataset] | None
\n

The set of input datasets.

\n
\n
\n
\noutputs: list[OutputDataset] | None
\n

The set of output datasets.

\n
\n
\n
\n
\nclass openlineage.client.event_v2.DatasetEvent(*, eventTime, producer='', dataset)
\n

Bases: BaseEvent

\n
\n
Parameters:
\n
    \n
  • eventTime (str)

  • \n
  • producer (str)

  • \n
  • dataset (StaticDataset)

  • \n
\n
\n
\n
\n
\ndataset: StaticDataset
\n
\n
\n
\n
\nopenlineage.client.event_v2.RunState
\n

alias of EventType

\n
\n
\n
\nclass openlineage.client.event_v2.Dataset(namespace, name, *, facets=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • facets (dict[str, DatasetFacet] | None)

  • \n
\n
\n
\n
\n
\nnamespace: str
\n

The namespace containing that dataset

\n
\n
\n
\nname: str
\n

The unique name for that dataset within that namespace

\n
\n
\n
\nfacets: dict[str, DatasetFacet] | None
\n

The facets for this dataset

\n
\n
\n
\n
\nclass openlineage.client.event_v2.InputDataset(namespace, name, inputFacets=_Nothing.NOTHING, *, facets=_Nothing.NOTHING)
\n

Bases: Dataset

\n

An input dataset

\n
\n
Parameters:
\n
\n
\n
\n
\n
\ninputFacets: dict[str, InputDatasetFacet] | None
\n

The input facets for this dataset.

\n
\n
\n
\n
\nclass openlineage.client.event_v2.OutputDataset(namespace, name, outputFacets=_Nothing.NOTHING, *, facets=_Nothing.NOTHING)
\n

Bases: Dataset

\n

An output dataset

\n
\n
Parameters:
\n
\n
\n
\n
\n
\noutputFacets: dict[str, OutputDatasetFacet] | None
\n

The output facets for this dataset

\n
\n
\n
\n
\nclass openlineage.client.event_v2.Run(runId, facets=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • runId (str)

  • \n
  • facets (dict[str, RunFacet] | None)

  • \n
\n
\n
\n
\n
\nrunId: str
\n

The globally unique ID of the run associated with the job.

\n
\n
\n
\nfacets: dict[str, RunFacet] | None
\n

The run facets.

\n
\n
\n
\nrunid_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.event_v2.Job(namespace, name, facets=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • facets (dict[str, JobFacet] | None)

  • \n
\n
\n
\n
\n
\nnamespace: str
\n

The namespace containing that job

\n
\n
\n
\nname: str
\n

The unique name for that job within that namespace

\n
\n
\n
\nfacets: dict[str, JobFacet] | None
\n

The job facets.

\n
\n
\n
\n
\nopenlineage.client.event_v2.set_producer(producer)
\n
\n
Parameters:
\n

producer (str)

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n

openlineage.client.facet module

\n
\n
\nopenlineage.client.facet.set_producer(producer)
\n
\n
Parameters:
\n

producer (str)

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nclass openlineage.client.facet.BaseFacet
\n

Bases: RedactMixin

\n
\n
\n
\n
\nproperty skip_redact: List[str]
\n
\n
\n
\n
\nclass openlineage.client.facet.NominalTimeRunFacet(nominalStartTime, nominalEndTime=None)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
    \n
  • nominalStartTime (str)

  • \n
  • nominalEndTime (Optional[str])

  • \n
\n
\n
\n
\n
\nnominalStartTime: str
\n
\n
\n
\nnominalEndTime: Optional[str]
\n
\n
\n
\n
\nclass openlineage.client.facet.ParentRunFacet(run, job)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
    \n
  • run (Dict[Any, Any])

  • \n
  • job (Dict[Any, Any])

  • \n
\n
\n
\n
\n
\nrun: Dict[Any, Any]
\n
\n
\n
\njob: Dict[Any, Any]
\n
\n
\n
\nclassmethod create(runId, namespace, name)
\n
\n
Parameters:
\n
    \n
  • runId (str)

  • \n
  • namespace (str)

  • \n
  • name (str)

  • \n
\n
\n
Return type:
\n

ParentRunFacet

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.facet.DocumentationJobFacet(description)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n

description (str)

\n
\n
\n
\n
\ndescription: str
\n
\n
\n
\n
\nclass openlineage.client.facet.SourceCodeLocationJobFacet(type, url)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
    \n
  • type (str)

  • \n
  • url (str)

  • \n
\n
\n
\n
\n
\ntype: str
\n
\n
\n
\nurl: str
\n
\n
\n
\n
\nclass openlineage.client.facet.SqlJobFacet(query)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n

query (str)

\n
\n
\n
\n
\nquery: str
\n
\n
\n
\n
\nclass openlineage.client.facet.DocumentationDatasetFacet(description)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n

description (str)

\n
\n
\n
\n
\ndescription: str
\n
\n
\n
\n
\nclass openlineage.client.facet.SchemaField(name, type, description=None)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • name (str)

  • \n
  • type (str)

  • \n
  • description (Optional[str])

  • \n
\n
\n
\n
\n
\nname: str
\n
\n
\n
\ntype: str
\n
\n
\n
\ndescription: Optional[str]
\n
\n
\n
\n
\nclass openlineage.client.facet.SchemaDatasetFacet(fields)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n

fields (List[SchemaField])

\n
\n
\n
\n
\nfields: List[SchemaField]
\n
\n
\n
\n
\nclass openlineage.client.facet.DataSourceDatasetFacet(name, uri)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
    \n
  • name (str)

  • \n
  • uri (str)

  • \n
\n
\n
\n
\n
\nname: str
\n
\n
\n
\nuri: str
\n
\n
\n
\n
\nclass openlineage.client.facet.OutputStatisticsOutputDatasetFacet(rowCount=None, size=None, fileCount=None)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
    \n
  • rowCount (Optional[int])

  • \n
  • size (Optional[int])

  • \n
  • fileCount (Optional[int])

  • \n
\n
\n
\n
\n
\nrowCount: Optional[int]
\n
\n
\n
\nsize: Optional[int]
\n
\n
\n
\nfileCount: Optional[int]
\n
\n
\n
\n
\nclass openlineage.client.facet.ColumnMetric(nullCount=None, distinctCount=None, sum=None, count=None, min=None, max=None, quantiles=None)
\n

Bases: object

\n
\n
Parameters:
\n
    \n
  • nullCount (Optional[int])

  • \n
  • distinctCount (Optional[int])

  • \n
  • sum (Optional[int])

  • \n
  • count (Optional[int])

  • \n
  • min (Optional[float])

  • \n
  • max (Optional[float])

  • \n
  • quantiles (Optional[Dict[str, float]])

  • \n
\n
\n
\n
\n
\nnullCount: Optional[int]
\n
\n
\n
\ndistinctCount: Optional[int]
\n
\n
\n
\nsum: Optional[int]
\n
\n
\n
\ncount: Optional[int]
\n
\n
\n
\nmin: Optional[float]
\n
\n
\n
\nmax: Optional[float]
\n
\n
\n
\nquantiles: Optional[Dict[str, float]]
\n
\n
\n
\n
\nclass openlineage.client.facet.DataQualityMetricsInputDatasetFacet(rowCount=None, bytes=None, fileCount=None, columnMetrics=_Nothing.NOTHING)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
    \n
  • rowCount (Optional[int])

  • \n
  • bytes (Optional[int])

  • \n
  • fileCount (Optional[int])

  • \n
  • columnMetrics (Dict[str, ColumnMetric])

  • \n
\n
\n
\n
\n
\nrowCount: Optional[int]
\n
\n
\n
\nbytes: Optional[int]
\n
\n
\n
\nfileCount: Optional[int]
\n
\n
\n
\ncolumnMetrics: Dict[str, ColumnMetric]
\n
\n
\n
\n
\nclass openlineage.client.facet.Assertion(assertion, success, column=None)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • assertion (str)

  • \n
  • success (bool)

  • \n
  • column (Optional[str])

  • \n
\n
\n
\n
\n
\nassertion: str
\n
\n
\n
\nsuccess: bool
\n
\n
\n
\ncolumn: Optional[str]
\n
\n
\n
\n
\nclass openlineage.client.facet.DataQualityAssertionsDatasetFacet(assertions)
\n

Bases: BaseFacet

\n

This facet represents asserted expectations on dataset or it\u2019s column.

\n
\n
Parameters:
\n

assertions (List[Assertion])

\n
\n
\n
\n
\nassertions: List[Assertion]
\n
\n
\n
\n
\nclass openlineage.client.facet.SourceCodeJobFacet(language, source)
\n

Bases: BaseFacet

\n

This facet represents source code that the job executed.

\n
\n
Parameters:
\n
    \n
  • language (str)

  • \n
  • source (str)

  • \n
\n
\n
\n
\n
\nlanguage: str
\n
\n
\n
\nsource: str
\n
\n
\n
\n
\nclass openlineage.client.facet.ExternalQueryRunFacet(externalQueryId, source)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
    \n
  • externalQueryId (str)

  • \n
  • source (str)

  • \n
\n
\n
\n
\n
\nexternalQueryId: str
\n
\n
\n
\nsource: str
\n
\n
\n
\n
\nclass openlineage.client.facet.ErrorMessageRunFacet(message, programmingLanguage, stackTrace=None)
\n

Bases: BaseFacet

\n

This facet represents an error message that was the result of a job run.

\n
\n
Parameters:
\n
    \n
  • message (str)

  • \n
  • programmingLanguage (str)

  • \n
  • stackTrace (Optional[str])

  • \n
\n
\n
\n
\n
\nmessage: str
\n
\n
\n
\nprogrammingLanguage: str
\n
\n
\n
\nstackTrace: Optional[str]
\n
\n
\n
\n
\nclass openlineage.client.facet.SymlinksDatasetFacetIdentifiers(namespace, name, type)
\n

Bases: object

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • type (str)

  • \n
\n
\n
\n
\n
\nnamespace: str
\n
\n
\n
\nname: str
\n
\n
\n
\ntype: str
\n
\n
\n
\n
\nclass openlineage.client.facet.SymlinksDatasetFacet(identifiers=_Nothing.NOTHING)
\n

Bases: BaseFacet

\n

This facet represents dataset symlink names.

\n
\n
Parameters:
\n

identifiers (List[SymlinksDatasetFacetIdentifiers])

\n
\n
\n
\n
\nidentifiers: List[SymlinksDatasetFacetIdentifiers]
\n
\n
\n
\n
\nclass openlineage.client.facet.StorageDatasetFacet(storageLayer, fileFormat)
\n

Bases: BaseFacet

\n

This facet represents dataset symlink names.

\n
\n
Parameters:
\n
    \n
  • storageLayer (str)

  • \n
  • fileFormat (str)

  • \n
\n
\n
\n
\n
\nstorageLayer: str
\n
\n
\n
\nfileFormat: str
\n
\n
\n
\n
\nclass openlineage.client.facet.OwnershipJobFacetOwners(name, type=None)
\n

Bases: object

\n
\n
Parameters:
\n
    \n
  • name (str)

  • \n
  • type (Optional[str])

  • \n
\n
\n
\n
\n
\nname: str
\n
\n
\n
\ntype: Optional[str]
\n
\n
\n
\n
\nclass openlineage.client.facet.OwnershipJobFacet(owners=_Nothing.NOTHING)
\n

Bases: BaseFacet

\n

This facet represents ownership of a job.

\n
\n
Parameters:
\n

owners (List[OwnershipJobFacetOwners])

\n
\n
\n
\n
\nowners: List[OwnershipJobFacetOwners]
\n
\n
\n
\n
\nclass openlineage.client.facet.JobTypeJobFacet(processingType, integration, jobType)
\n

Bases: BaseFacet

\n

This facet represents job type properties.

\n
\n
Parameters:
\n
    \n
  • processingType (str)

  • \n
  • integration (str)

  • \n
  • jobType (str)

  • \n
\n
\n
\n
\n
\nprocessingType: str
\n
\n
\n
\nintegration: str
\n
\n
\n
\njobType: str
\n
\n
\n
\n
\nclass openlineage.client.facet.DatasetVersionDatasetFacet(datasetVersion)
\n

Bases: BaseFacet

\n

This facet represents version of a dataset.

\n
\n
Parameters:
\n

datasetVersion (str)

\n
\n
\n
\n
\ndatasetVersion: str
\n
\n
\n
\n
\nclass openlineage.client.facet.LifecycleStateChange(value)
\n

Bases: Enum

\n

An enumeration.

\n
\n
\nALTER = 'ALTER'
\n
\n
\n
\nCREATE = 'CREATE'
\n
\n
\n
\nDROP = 'DROP'
\n
\n
\n
\nOVERWRITE = 'OVERWRITE'
\n
\n
\n
\nRENAME = 'RENAME'
\n
\n
\n
\nTRUNCATE = 'TRUNCATE'
\n
\n
\n
\n
\nclass openlineage.client.facet.LifecycleStateChangeDatasetFacetPreviousIdentifier(name, namespace)
\n

Bases: object

\n
\n
Parameters:
\n
    \n
  • name (str)

  • \n
  • namespace (str)

  • \n
\n
\n
\n
\n
\nname: str
\n
\n
\n
\nnamespace: str
\n
\n
\n
\n
\nclass openlineage.client.facet.LifecycleStateChangeDatasetFacet(lifecycleStateChange, previousIdentifier)
\n

Bases: BaseFacet

\n

This facet represents information of lifecycle changes of a dataset.

\n
\n
Parameters:
\n
\n
\n
\n
\n
\nlifecycleStateChange: LifecycleStateChange
\n
\n
\n
\npreviousIdentifier: LifecycleStateChangeDatasetFacetPreviousIdentifier
\n
\n
\n
\n
\nclass openlineage.client.facet.OwnershipDatasetFacetOwners(name, type)
\n

Bases: object

\n
\n
Parameters:
\n
    \n
  • name (str)

  • \n
  • type (str)

  • \n
\n
\n
\n
\n
\nname: str
\n
\n
\n
\ntype: str
\n
\n
\n
\n
\nclass openlineage.client.facet.OwnershipDatasetFacet(owners=_Nothing.NOTHING)
\n

Bases: BaseFacet

\n

This facet represents ownership of a dataset.

\n
\n
Parameters:
\n

owners (List[OwnershipDatasetFacetOwners])

\n
\n
\n
\n
\nowners: List[OwnershipDatasetFacetOwners]
\n
\n
\n
\n
\nclass openlineage.client.facet.ColumnLineageDatasetFacetFieldsAdditionalInputFields(namespace, name, field)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • field (str)

  • \n
\n
\n
\n
\n
\nnamespace: str
\n
\n
\n
\nname: str
\n
\n
\n
\nfield: str
\n
\n
\n
\n
\nclass openlineage.client.facet.ColumnLineageDatasetFacetFieldsAdditional(inputFields, transformationDescription, transformationType)
\n

Bases: object

\n
\n
Parameters:
\n
\n
\n
\n
\n
\ninputFields: ClassVar[List[ColumnLineageDatasetFacetFieldsAdditionalInputFields]]
\n
\n
\n
\ntransformationDescription: str
\n
\n
\n
\ntransformationType: str
\n
\n
\n
\n
\nclass openlineage.client.facet.ColumnLineageDatasetFacet(fields=_Nothing.NOTHING)
\n

Bases: BaseFacet

\n

This facet contains column lineage of a dataset.

\n
\n
Parameters:
\n

fields (Dict[str, ColumnLineageDatasetFacetFieldsAdditional])

\n
\n
\n
\n
\nfields: Dict[str, ColumnLineageDatasetFacetFieldsAdditional]
\n
\n
\n
\n
\nclass openlineage.client.facet.ProcessingEngineRunFacet(version, name, openlineageAdapterVersion)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
    \n
  • version (str)

  • \n
  • name (str)

  • \n
  • openlineageAdapterVersion (str)

  • \n
\n
\n
\n
\n
\nversion: str
\n
\n
\n
\nname: str
\n
\n
\n
\nopenlineageAdapterVersion: str
\n
\n
\n
\n
\nclass openlineage.client.facet.ExtractionError(errorMessage, stackTrace, task, taskNumber)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
    \n
  • errorMessage (str)

  • \n
  • stackTrace (Optional[str])

  • \n
  • task (Optional[str])

  • \n
  • taskNumber (Optional[int])

  • \n
\n
\n
\n
\n
\nerrorMessage: str
\n
\n
\n
\nstackTrace: Optional[str]
\n
\n
\n
\ntask: Optional[str]
\n
\n
\n
\ntaskNumber: Optional[int]
\n
\n
\n
\n
\nclass openlineage.client.facet.ExtractionErrorRunFacet(totalTasks, failedTasks, errors)
\n

Bases: BaseFacet

\n
\n
Parameters:
\n
\n
\n
\n
\n
\ntotalTasks: int
\n
\n
\n
\nfailedTasks: int
\n
\n
\n
\nerrors: List[ExtractionError]
\n
\n
\n
\n
\n

openlineage.client.facet_v2 module

\n
\n
\nclass openlineage.client.facet_v2.BaseFacet(*, producer='')
\n

Bases: RedactMixin

\n

all fields of the base facet are prefixed with _ to avoid name conflicts in facets

\n
\n
Parameters:
\n

producer (str)

\n
\n
\n
\n
\nproperty skip_redact: list[str]
\n
\n
\n
\n
\nclass openlineage.client.facet_v2.DatasetFacet(*, producer='', deleted=None)
\n

Bases: BaseFacet

\n

A Dataset Facet

\n
\n
Parameters:
\n
    \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\n
\nclass openlineage.client.facet_v2.InputDatasetFacet(*, producer='')
\n

Bases: BaseFacet

\n

An Input Dataset Facet

\n
\n
Parameters:
\n

producer (str)

\n
\n
\n
\n
\n
\nclass openlineage.client.facet_v2.JobFacet(*, producer='', deleted=None)
\n

Bases: BaseFacet

\n

A Job Facet

\n
\n
Parameters:
\n
    \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\n
\nclass openlineage.client.facet_v2.OutputDatasetFacet(*, producer='')
\n

Bases: BaseFacet

\n

An Output Dataset Facet

\n
\n
Parameters:
\n

producer (str)

\n
\n
\n
\n
\n
\nclass openlineage.client.facet_v2.RunFacet(*, producer='')
\n

Bases: BaseFacet

\n

A Run Facet

\n
\n
Parameters:
\n

producer (str)

\n
\n
\n
\n
\n
\nopenlineage.client.facet_v2.set_producer(producer)
\n
\n
Parameters:
\n

producer (str)

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n

openlineage.client.filter module

\n
\n
\nclass openlineage.client.filter.FilterConfig(type=None, match=None, regex=None)
\n

Bases: object

\n
\n
Parameters:
\n
    \n
  • type (str | None)

  • \n
  • match (str | None)

  • \n
  • regex (str | None)

  • \n
\n
\n
\n
\n
\ntype: str | None
\n
\n
\n
\nmatch: str | None
\n
\n
\n
\nregex: str | None
\n
\n
\n
\n
\nclass openlineage.client.filter.Filter
\n

Bases: object

\n
\n
\nfilter_event(event)
\n
\n
Parameters:
\n

event (RunEventType)

\n
\n
Return type:
\n

RunEventType | None

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.filter.ExactMatchFilter(match)
\n

Bases: Filter

\n
\n
Parameters:
\n

match (str)

\n
\n
\n
\n
\nfilter_event(event)
\n
\n
Parameters:
\n

event (RunEventType)

\n
\n
Return type:
\n

RunEventType | None

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.filter.RegexFilter(regex)
\n

Bases: Filter

\n
\n
Parameters:
\n

regex (str)

\n
\n
\n
\n
\nfilter_event(event)
\n
\n
Parameters:
\n

event (RunEventType)

\n
\n
Return type:
\n

RunEventType | None

\n
\n
\n
\n
\n
\n
\nopenlineage.client.filter.create_filter(conf)
\n
\n
Parameters:
\n

conf (FilterConfig)

\n
\n
Return type:
\n

Filter | None

\n
\n
\n
\n
\n
\n

openlineage.client.run module

\n
\n
\nclass openlineage.client.run.RunState(value)
\n

Bases: Enum

\n

An enumeration.

\n
\n
\nSTART = 'START'
\n
\n
\n
\nRUNNING = 'RUNNING'
\n
\n
\n
\nCOMPLETE = 'COMPLETE'
\n
\n
\n
\nABORT = 'ABORT'
\n
\n
\n
\nFAIL = 'FAIL'
\n
\n
\n
\nOTHER = 'OTHER'
\n
\n
\n
\n
\nclass openlineage.client.run.Dataset(namespace, name, facets=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • facets (Dict[Any, Any])

  • \n
\n
\n
\n
\n
\nnamespace: str
\n
\n
\n
\nname: str
\n
\n
\n
\nfacets: Dict[Any, Any]
\n
\n
\n
\n
\nclass openlineage.client.run.InputDataset(namespace, name, facets=_Nothing.NOTHING, inputFacets=_Nothing.NOTHING)
\n

Bases: Dataset

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • facets (Dict[Any, Any])

  • \n
  • inputFacets (Dict[Any, Any])

  • \n
\n
\n
\n
\n
\ninputFacets: Dict[Any, Any]
\n
\n
\n
\n
\nclass openlineage.client.run.OutputDataset(namespace, name, facets=_Nothing.NOTHING, outputFacets=_Nothing.NOTHING)
\n

Bases: Dataset

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • facets (Dict[Any, Any])

  • \n
  • outputFacets (Dict[Any, Any])

  • \n
\n
\n
\n
\n
\noutputFacets: Dict[Any, Any]
\n
\n
\n
\n
\nclass openlineage.client.run.DatasetEvent(eventTime, producer, schemaURL, dataset)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • eventTime (str)

  • \n
  • producer (str)

  • \n
  • schemaURL (str)

  • \n
  • dataset (Dataset)

  • \n
\n
\n
\n
\n
\neventTime: str
\n
\n
\n
\nproducer: str
\n
\n
\n
\nschemaURL: str
\n
\n
\n
\ndataset: Dataset
\n
\n
\n
\n
\nclass openlineage.client.run.Job(namespace, name, facets=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • facets (Dict[Any, Any])

  • \n
\n
\n
\n
\n
\nnamespace: str
\n
\n
\n
\nname: str
\n
\n
\n
\nfacets: Dict[Any, Any]
\n
\n
\n
\n
\nclass openlineage.client.run.JobEvent(eventTime, producer, schemaURL, job, inputs=_Nothing.NOTHING, outputs=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • eventTime (str)

  • \n
  • producer (str)

  • \n
  • schemaURL (str)

  • \n
  • job (Job)

  • \n
  • inputs (Optional[List[Dataset]])

  • \n
  • outputs (Optional[List[Dataset]])

  • \n
\n
\n
\n
\n
\neventTime: str
\n
\n
\n
\nproducer: str
\n
\n
\n
\nschemaURL: str
\n
\n
\n
\njob: Job
\n
\n
\n
\ninputs: Optional[List[Dataset]]
\n
\n
\n
\noutputs: Optional[List[Dataset]]
\n
\n
\n
\n
\nclass openlineage.client.run.Run(runId, facets=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • runId (str)

  • \n
  • facets (Dict[Any, Any])

  • \n
\n
\n
\n
\n
\nrunId: str
\n
\n
\n
\nfacets: Dict[Any, Any]
\n
\n
\n
\ncheck(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.run.RunEvent(eventType, eventTime, run, job, producer, inputs=_Nothing.NOTHING, outputs=_Nothing.NOTHING, schemaURL='https://openlineage.io/spec/1-0-5/OpenLineage.json#/definitions/RunEvent')
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • eventType (RunState)

  • \n
  • eventTime (str)

  • \n
  • run (Run)

  • \n
  • job (Job)

  • \n
  • producer (str)

  • \n
  • inputs (Optional[List[Dataset]])

  • \n
  • outputs (Optional[List[Dataset]])

  • \n
  • schemaURL (str)

  • \n
\n
\n
\n
\n
\neventType: RunState
\n
\n
\n
\neventTime: str
\n
\n
\n
\nrun: Run
\n
\n
\n
\njob: Job
\n
\n
\n
\nproducer: str
\n
\n
\n
\ninputs: Optional[List[Dataset]]
\n
\n
\n
\noutputs: Optional[List[Dataset]]
\n
\n
\n
\nschemaURL: str
\n
\n
\n
\ncheck(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.serde module

\n
\n
\nclass openlineage.client.serde.Serde
\n

Bases: object

\n
\n
\nclassmethod remove_nulls_and_enums(obj)
\n
\n
Parameters:
\n

obj (Any)

\n
\n
Return type:
\n

Any

\n
\n
\n
\n
\n
\nclassmethod to_dict(obj)
\n
\n
Parameters:
\n

obj (Any)

\n
\n
Return type:
\n

dict[Any, Any]

\n
\n
\n
\n
\n
\nclassmethod to_json(obj)
\n
\n
Parameters:
\n

obj (Any)

\n
\n
Return type:
\n

str

\n
\n
\n
\n
\n
\n
\n

openlineage.client.utils module

\n
\n
\nopenlineage.client.utils.import_from_string(path)
\n
\n
Parameters:
\n

path (str)

\n
\n
Return type:
\n

type[Any]

\n
\n
\n
\n
\n
\nopenlineage.client.utils.try_import_from_string(path)
\n
\n
Parameters:
\n

path (str)

\n
\n
Return type:
\n

type[Any] | None

\n
\n
\n
\n
\n
\nopenlineage.client.utils.get_only_specified_fields(clazz, params)
\n
\n
Parameters:
\n
    \n
  • clazz (type[Any])

  • \n
  • params (dict[str, Any])

  • \n
\n
\n
Return type:
\n

dict[str, Any]

\n
\n
\n
\n
\n
\nopenlineage.client.utils.deep_merge_dicts(dict1, dict2)
\n

Deep merges two dictionaries.

\n

This function merges two dictionaries while handling nested dictionaries.\nFor keys that exist in both dictionaries, the values from dict2 take precedence.\nIf a key exists in both dictionaries and the values are dictionaries themselves,\nthey are merged recursively.\nThis function merges only dictionaries. If key is of different type, e.g. list\nit does not work properly.

\n
\n
Parameters:
\n
    \n
  • dict1 (dict[Any, Any])

  • \n
  • dict2 (dict[Any, Any])

  • \n
\n
\n
Return type:
\n

dict[Any, Any]

\n
\n
\n
\n
\n
\nclass openlineage.client.utils.RedactMixin
\n

Bases: object

\n
\n
\nproperty skip_redact: list[str]
\n
\n
\n
\n
\n

openlineage.client.uuid module

\n
\n
\nopenlineage.client.uuid.generate_new_uuid(instant=None)
\n

Generate new UUID for an instant of time. Each function call returns a new UUID value.

\n

UUID version is an implementation detail, and should not be relied on.\nFor now it is [UUIDv7](https://datatracker.ietf.org/doc/rfc9562/), so for increasing instant values,\nreturned UUID is always greater than previous one.

\n

Using uuid6 lib implementation (MIT License), with few changes:\n* oittaa/uuid6-python\n* oittaa/uuid6-python

\n

Added in v1.15.0

\n
\n
Parameters:
\n

instant (datetime | None) \u2013 instant of time used to generate UUID. If not provided, current time is used.

\n
\n
Return type:
\n

UUID

\n
\n
Returns:
\n

UUID

\n
\n
\n
\n
\n
\nopenlineage.client.uuid.generate_static_uuid(instant, data)
\n

Generate UUID for instant of time and input data.\nCalling function with same arguments always produces the same result.

\n

UUID version is an implementation detail, and **should not* be relied on.\nFor now it is [UUIDv7](https://datatracker.ietf.org/doc/rfc9562/), so for increasing instant values,\nreturned UUID is always greater than previous one. The only difference from RFC 9562 is that\nleast significant bytes are not random, but instead a SHA-1 hash of input data.

\n

Using uuid6 lib implementation (MIT License), with few changes:\n* oittaa/uuid6-python\n* oittaa/uuid6-python

\n

Added in v1.15.0

\n
\n
Parameters:
\n
    \n
  • instant (datetime) \u2013 instant of time used to generate UUID. If not provided, current time is used.

  • \n
  • data (bytes) \u2013 input data to generate random part from.

  • \n
\n
\n
Return type:
\n

UUID

\n
\n
Returns:
\n

UUID

\n
\n
\n
\n
\n
\n

openlineage.client.generated.base module

\n
\n
\nopenlineage.client.generated.base.set_producer(producer)
\n
\n
Parameters:
\n

producer (str)

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nclass openlineage.client.generated.base.BaseEvent(*, eventTime, producer='')
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • eventTime (str)

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\neventTime: str
\n

the time the event occurred at

\n
\n
\n
\nproducer: str
\n
\n
\n
\nschemaURL: str
\n
\n
\n
\nproperty skip_redact
\n
\n
\n
\neventtime_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nproducer_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nschemaurl_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.generated.base.BaseFacet(*, producer='')
\n

Bases: RedactMixin

\n

all fields of the base facet are prefixed with _ to avoid name conflicts in facets

\n
\n
Parameters:
\n

producer (str)

\n
\n
\n
\n
\nproperty skip_redact
\n
\n
\n
\n
\nclass openlineage.client.generated.base.Dataset(namespace, name, *, facets=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • facets (dict[str, DatasetFacet] | None)

  • \n
\n
\n
\n
\n
\nnamespace: str
\n

The namespace containing that dataset

\n
\n
\n
\nname: str
\n

The unique name for that dataset within that namespace

\n
\n
\n
\nfacets: dict[str, DatasetFacet] | None
\n

The facets for this dataset

\n
\n
\n
\n
\nclass openlineage.client.generated.base.DatasetEvent(*, eventTime, producer='', dataset)
\n

Bases: BaseEvent

\n
\n
Parameters:
\n
    \n
  • eventTime (str)

  • \n
  • producer (str)

  • \n
  • dataset (StaticDataset)

  • \n
\n
\n
\n
\n
\ndataset: StaticDataset
\n
\n
\n
\n
\nclass openlineage.client.generated.base.DatasetFacet(*, producer='', deleted=None)
\n

Bases: BaseFacet

\n

A Dataset Facet

\n
\n
Parameters:
\n
    \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\n
\nclass openlineage.client.generated.base.EventType(value)
\n

Bases: Enum

\n

the current transition of the run state. It is required to issue 1 START event and 1 of [ COMPLETE,\nABORT, FAIL ] event per run. Additional events with OTHER eventType can be added to the same run.\nFor example to send additional metadata after the run is complete

\n
\n
\nSTART = 'START'
\n
\n
\n
\nRUNNING = 'RUNNING'
\n
\n
\n
\nCOMPLETE = 'COMPLETE'
\n
\n
\n
\nABORT = 'ABORT'
\n
\n
\n
\nFAIL = 'FAIL'
\n
\n
\n
\nOTHER = 'OTHER'
\n
\n
\n
\n
\nclass openlineage.client.generated.base.InputDataset(namespace, name, inputFacets=_Nothing.NOTHING, *, facets=_Nothing.NOTHING)
\n

Bases: Dataset

\n

An input dataset

\n
\n
\ninputFacets: dict[str, InputDatasetFacet] | None
\n

The input facets for this dataset.

\n
\n
\n
\n
\nclass openlineage.client.generated.base.InputDatasetFacet(*, producer='')
\n

Bases: BaseFacet

\n

An Input Dataset Facet

\n
\n
Parameters:
\n

producer (str)

\n
\n
\n
\n
\n
\nclass openlineage.client.generated.base.Job(namespace, name, facets=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • facets (dict[str, JobFacet] | None)

  • \n
\n
\n
\n
\n
\nnamespace: str
\n

The namespace containing that job

\n
\n
\n
\nname: str
\n

The unique name for that job within that namespace

\n
\n
\n
\nfacets: dict[str, JobFacet] | None
\n

The job facets.

\n
\n
\n
\n
\nclass openlineage.client.generated.base.JobEvent(*, eventTime, producer='', job, inputs=_Nothing.NOTHING, outputs=_Nothing.NOTHING)
\n

Bases: BaseEvent

\n
\n
\njob: Job
\n
\n
\n
\ninputs: list[InputDataset] | None
\n

The set of input datasets.

\n
\n
\n
\noutputs: list[OutputDataset] | None
\n

The set of output datasets.

\n
\n
\n
\n
\nclass openlineage.client.generated.base.JobFacet(*, producer='', deleted=None)
\n

Bases: BaseFacet

\n

A Job Facet

\n
\n
Parameters:
\n
    \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\n
\nclass openlineage.client.generated.base.OutputDataset(namespace, name, outputFacets=_Nothing.NOTHING, *, facets=_Nothing.NOTHING)
\n

Bases: Dataset

\n

An output dataset

\n
\n
\noutputFacets: dict[str, OutputDatasetFacet] | None
\n

The output facets for this dataset

\n
\n
\n
\n
\nclass openlineage.client.generated.base.OutputDatasetFacet(*, producer='')
\n

Bases: BaseFacet

\n

An Output Dataset Facet

\n
\n
Parameters:
\n

producer (str)

\n
\n
\n
\n
\n
\nclass openlineage.client.generated.base.Run(runId, facets=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • runId (str)

  • \n
  • facets (dict[str, RunFacet] | None)

  • \n
\n
\n
\n
\n
\nrunId: str
\n

The globally unique ID of the run associated with the job.

\n
\n
\n
\nfacets: dict[str, RunFacet] | None
\n

The run facets.

\n
\n
\n
\nrunid_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.generated.base.RunEvent(*, eventTime, producer='', run, job, eventType=None, inputs=_Nothing.NOTHING, outputs=_Nothing.NOTHING)
\n

Bases: BaseEvent

\n
\n
Parameters:
\n
    \n
  • eventTime (str)

  • \n
  • producer (str)

  • \n
  • run (Run)

  • \n
  • job (Job)

  • \n
  • eventType (EventType | None)

  • \n
  • inputs (list[InputDataset] | None)

  • \n
  • outputs (list[OutputDataset] | None)

  • \n
\n
\n
\n
\n
\nrun: Run
\n
\n
\n
\njob: Job
\n
\n
\n
\neventType: EventType | None
\n

the current transition of the run state. It is required to issue 1 START event and 1 of [ COMPLETE,\nABORT, FAIL ] event per run. Additional events with OTHER eventType can be added to the same run.\nFor example to send additional metadata after the run is complete

\n
\n
\n
\ninputs: list[InputDataset] | None
\n

The set of input datasets.

\n
\n
\n
\noutputs: list[OutputDataset] | None
\n

The set of output datasets.

\n
\n
\n
\n
\nclass openlineage.client.generated.base.RunFacet(*, producer='')
\n

Bases: BaseFacet

\n

A Run Facet

\n
\n
Parameters:
\n

producer (str)

\n
\n
\n
\n
\n
\nclass openlineage.client.generated.base.StaticDataset(namespace, name, *, facets=_Nothing.NOTHING)
\n

Bases: Dataset

\n

A Dataset sent within static metadata events

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • facets (dict[str, DatasetFacet] | None)

  • \n
\n
\n
\n
\n
\n
\n

openlineage.client.generated.column_lineage_dataset module

\n
\n
\nclass openlineage.client.generated.column_lineage_dataset.ColumnLineageDatasetFacet(fields, dataset=_Nothing.NOTHING, *, producer='', deleted=None)
\n

Bases: DatasetFacet

\n
\n
Parameters:
\n
    \n
  • fields (dict[str, Fields])

  • \n
  • dataset (list[InputField] | None)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nfields: dict[str, Fields]
\n

Column level lineage that maps output fields into input fields used to evaluate them.

\n
\n
\n
\ndataset: list[InputField] | None
\n

Column level lineage that affects the whole dataset. This includes filtering, sorting, grouping\n(aggregates), joining, window functions, etc.

\n
\n
\n
\n
\nclass openlineage.client.generated.column_lineage_dataset.Fields(inputFields, transformationDescription=None, transformationType=None)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • inputFields (list[InputField])

  • \n
  • transformationDescription (str | None)

  • \n
  • transformationType (str | None)

  • \n
\n
\n
\n
\n
\ninputFields: list[InputField]
\n
\n
\n
\ntransformationDescription: str | None
\n

a string representation of the transformation applied

\n
\n
\n
\ntransformationType: str | None
\n

no\noriginal data available (like a hash of PII for example)

\n
\n
Type:
\n

IDENTITY|MASKED reflects a clearly defined behavior. IDENTITY

\n
\n
Type:
\n

exact same as input; MASKED

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.generated.column_lineage_dataset.InputField(namespace, name, field, transformations=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n

Represents a single dependency on some field (column).

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
  • field (str)

  • \n
  • transformations (list[Transformation] | None)

  • \n
\n
\n
\n
\n
\nnamespace: str
\n

The input dataset namespace

\n
\n
\n
\nname: str
\n

The input dataset name

\n
\n
\n
\nfield: str
\n

The input field

\n
\n
\n
\ntransformations: list[Transformation] | None
\n
\n
\n
\n
\nclass openlineage.client.generated.column_lineage_dataset.Transformation(type, subtype=None, description=None, masking=None)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • type (str)

  • \n
  • subtype (str | None)

  • \n
  • description (str | None)

  • \n
  • masking (bool | None)

  • \n
\n
\n
\n
\n
\ntype: str
\n

DIRECT, INDIRECT

\n
\n
Type:
\n

The type of the transformation. Allowed values are

\n
\n
\n
\n
\n
\nsubtype: str | None
\n

The subtype of the transformation

\n
\n
\n
\ndescription: str | None
\n

a string representation of the transformation applied

\n
\n
\n
\nmasking: bool | None
\n

is transformation masking the data or not

\n
\n
\n
\n
\n

openlineage.client.generated.data_quality_assertions_dataset module

\n
\n
\nclass openlineage.client.generated.data_quality_assertions_dataset.Assertion(assertion, success, column=None)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • assertion (str)

  • \n
  • success (bool)

  • \n
  • column (str | None)

  • \n
\n
\n
\n
\n
\nassertion: str
\n

Type of expectation test that dataset is subjected to

\n
\n
\n
\nsuccess: bool
\n
\n
\n
\ncolumn: str | None
\n

Column that expectation is testing. It should match the name provided in SchemaDatasetFacet. If\ncolumn field is empty, then expectation refers to whole dataset.

\n
\n
\n
\n
\nclass openlineage.client.generated.data_quality_assertions_dataset.DataQualityAssertionsDatasetFacet(assertions, *, producer='')
\n

Bases: InputDatasetFacet

\n

list of tests performed on dataset or dataset columns, and their results

\n
\n
Parameters:
\n
    \n
  • assertions (list[Assertion])

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\nassertions: list[Assertion]
\n
\n
\n
\n
\n

openlineage.client.generated.data_quality_metrics_input_dataset module

\n
\n
\nclass openlineage.client.generated.data_quality_metrics_input_dataset.ColumnMetrics(nullCount=None, distinctCount=None, sum=None, count=None, min=None, max=None, quantiles=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • nullCount (int | None)

  • \n
  • distinctCount (int | None)

  • \n
  • sum (float | None)

  • \n
  • count (float | None)

  • \n
  • min (float | None)

  • \n
  • max (float | None)

  • \n
  • quantiles (dict[str, float] | None)

  • \n
\n
\n
\n
\n
\nnullCount: int | None
\n

The number of null values in this column for the rows evaluated

\n
\n
\n
\ndistinctCount: int | None
\n

The number of distinct values in this column for the rows evaluated

\n
\n
\n
\nsum: float | None
\n

The total sum of values in this column for the rows evaluated

\n
\n
\n
\ncount: float | None
\n

The number of values in this column

\n
\n
\n
\nmin: float | None
\n
\n
\n
\nmax: float | None
\n
\n
\n
\nquantiles: dict[str, float] | None
\n

0.1 0.25 0.5 0.75 1

\n
\n
Type:
\n

The property key is the quantile. Examples

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.generated.data_quality_metrics_input_dataset.DataQualityMetricsInputDatasetFacet(columnMetrics, rowCount=None, bytes=None, fileCount=None, *, producer='')
\n

Bases: InputDatasetFacet

\n
\n
Parameters:
\n
    \n
  • columnMetrics (dict[str, ColumnMetrics])

  • \n
  • rowCount (int | None)

  • \n
  • bytes (int | None)

  • \n
  • fileCount (int | None)

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\ncolumnMetrics: dict[str, ColumnMetrics]
\n

The property key is the column name

\n
\n
\n
\nrowCount: int | None
\n

The number of rows evaluated

\n
\n
\n
\nbytes: int | None
\n

The size in bytes

\n
\n
\n
\nfileCount: int | None
\n

The number of files evaluated

\n
\n
\n
\n
\n

openlineage.client.generated.dataset_version_dataset module

\n
\n
\nclass openlineage.client.generated.dataset_version_dataset.DatasetVersionDatasetFacet(datasetVersion, *, producer='', deleted=None)
\n

Bases: DatasetFacet

\n
\n
Parameters:
\n
    \n
  • datasetVersion (str)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\ndatasetVersion: str
\n

The version of the dataset.

\n
\n
\n
\n
\n

openlineage.client.generated.datasource_dataset module

\n
\n
\nclass openlineage.client.generated.datasource_dataset.DatasourceDatasetFacet(name=None, uri=None, *, producer='', deleted=None)
\n

Bases: DatasetFacet

\n
\n
Parameters:
\n
    \n
  • name (str | None)

  • \n
  • uri (str | None)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nname: str | None
\n
\n
\n
\nuri: str | None
\n
\n
\n
\nuri_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.generated.documentation_dataset module

\n
\n
\nclass openlineage.client.generated.documentation_dataset.DocumentationDatasetFacet(description, *, producer='', deleted=None)
\n

Bases: DatasetFacet

\n
\n
Parameters:
\n
    \n
  • description (str)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\ndescription: str
\n

The description of the dataset.

\n
\n
\n
\n
\n

openlineage.client.generated.documentation_job module

\n
\n
\nclass openlineage.client.generated.documentation_job.DocumentationJobFacet(description, *, producer='', deleted=None)
\n

Bases: JobFacet

\n
\n
Parameters:
\n
    \n
  • description (str)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\ndescription: str
\n

The description of the job.

\n
\n
\n
\n
\n

openlineage.client.generated.error_message_run module

\n
\n
\nclass openlineage.client.generated.error_message_run.ErrorMessageRunFacet(message, programmingLanguage, stackTrace=None, *, producer='')
\n

Bases: RunFacet

\n
\n
Parameters:
\n
    \n
  • message (str)

  • \n
  • programmingLanguage (str)

  • \n
  • stackTrace (str | None)

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\nmessage: str
\n

A human-readable string representing error message generated by observed system

\n
\n
\n
\nprogrammingLanguage: str
\n

Programming language the observed system uses.

\n
\n
\n
\nstackTrace: str | None
\n

A language-specific stack trace generated by observed system

\n
\n
\n
\n
\n

openlineage.client.generated.external_query_run module

\n
\n
\nclass openlineage.client.generated.external_query_run.ExternalQueryRunFacet(externalQueryId, source, *, producer='')
\n

Bases: RunFacet

\n
\n
Parameters:
\n
    \n
  • externalQueryId (str)

  • \n
  • source (str)

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\nexternalQueryId: str
\n

Identifier for the external system

\n
\n
\n
\nsource: str
\n

source of the external query

\n
\n
\n
\n
\n

openlineage.client.generated.extraction_error_run module

\n
\n
\nclass openlineage.client.generated.extraction_error_run.Error(errorMessage, stackTrace=None, task=None, taskNumber=None)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • errorMessage (str)

  • \n
  • stackTrace (str | None)

  • \n
  • task (str | None)

  • \n
  • taskNumber (int | None)

  • \n
\n
\n
\n
\n
\nerrorMessage: str
\n

Text representation of extraction error message.

\n
\n
\n
\nstackTrace: str | None
\n

Stack trace of extraction error message

\n
\n
\n
\ntask: str | None
\n

Text representation of task that failed. This can be, for example, SQL statement that parser could\nnot interpret.

\n
\n
\n
\ntaskNumber: int | None
\n

Order of task (counted from 0).

\n
\n
\n
\n
\nclass openlineage.client.generated.extraction_error_run.ExtractionErrorRunFacet(totalTasks, failedTasks, errors, *, producer='')
\n

Bases: RunFacet

\n
\n
Parameters:
\n
    \n
  • totalTasks (int)

  • \n
  • failedTasks (int)

  • \n
  • errors (list[Error])

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\ntotalTasks: int
\n

The number of distinguishable tasks in a run that were processed by OpenLineage, whether\nsuccessfully or not. Those could be, for example, distinct SQL statements.

\n
\n
\n
\nfailedTasks: int
\n

The number of distinguishable tasks in a run that were processed not successfully by OpenLineage.\nThose could be, for example, distinct SQL statements.

\n
\n
\n
\nerrors: list[Error]
\n
\n
\n
\n
\n

openlineage.client.generated.job_type_job module

\n
\n
\nclass openlineage.client.generated.job_type_job.JobTypeJobFacet(processingType, integration, jobType=None, *, producer='', deleted=None)
\n

Bases: JobFacet

\n
\n
Parameters:
\n
    \n
  • processingType (str)

  • \n
  • integration (str)

  • \n
  • jobType (str | None)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nprocessingType: str
\n

BATCH or STREAMING

\n
\n
Type:
\n

Job processing type like

\n
\n
\n
\n
\n
\nintegration: str
\n

for example SPARK|DBT|AIRFLOW|FLINK

\n
\n
Type:
\n

OpenLineage integration type of this job

\n
\n
\n
\n
\n
\njobType: str | None
\n

QUERY|COMMAND|DAG|TASK|JOB|MODEL. This is an integration-specific field.

\n
\n
Type:
\n

Run type, for example

\n
\n
\n
\n
\n
\n
\n

openlineage.client.generated.lifecycle_state_change_dataset module

\n
\n
\nclass openlineage.client.generated.lifecycle_state_change_dataset.LifecycleStateChange(value)
\n

Bases: Enum

\n

The lifecycle state change.

\n
\n
\nALTER = 'ALTER'
\n
\n
\n
\nCREATE = 'CREATE'
\n
\n
\n
\nDROP = 'DROP'
\n
\n
\n
\nOVERWRITE = 'OVERWRITE'
\n
\n
\n
\nRENAME = 'RENAME'
\n
\n
\n
\nTRUNCATE = 'TRUNCATE'
\n
\n
\n
\n
\nclass openlineage.client.generated.lifecycle_state_change_dataset.LifecycleStateChangeDatasetFacet(lifecycleStateChange, previousIdentifier=None, *, producer='', deleted=None)
\n

Bases: DatasetFacet

\n
\n
Parameters:
\n
    \n
  • lifecycleStateChange (LifecycleStateChange)

  • \n
  • previousIdentifier (PreviousIdentifier | None)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nlifecycleStateChange: LifecycleStateChange
\n

The lifecycle state change.

\n
\n
\n
\npreviousIdentifier: PreviousIdentifier | None
\n

Previous name of the dataset in case of renaming it.

\n
\n
\n
\n
\nclass openlineage.client.generated.lifecycle_state_change_dataset.PreviousIdentifier(name, namespace)
\n

Bases: RedactMixin

\n

Previous name of the dataset in case of renaming it.

\n
\n
Parameters:
\n
    \n
  • name (str)

  • \n
  • namespace (str)

  • \n
\n
\n
\n
\n
\nname: str
\n
\n
\n
\nnamespace: str
\n
\n
\n
\n
\n

openlineage.client.generated.nominal_time_run module

\n
\n
\nclass openlineage.client.generated.nominal_time_run.NominalTimeRunFacet(nominalStartTime, nominalEndTime=None, *, producer='')
\n

Bases: RunFacet

\n
\n
Parameters:
\n
    \n
  • nominalStartTime (str)

  • \n
  • nominalEndTime (str | None)

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\nnominalStartTime: str
\n

//en.wikipedia.org/wiki/ISO_8601) timestamp representing the nominal start time\n(included) of the run. AKA the schedule time

\n
\n
Type:
\n

An [ISO-8601](https

\n
\n
\n
\n
\n
\nnominalEndTime: str | None
\n

//en.wikipedia.org/wiki/ISO_8601) timestamp representing the nominal end time\n(excluded) of the run. (Should be the nominal start time of the next run)

\n
\n
Type:
\n

An [ISO-8601](https

\n
\n
\n
\n
\n
\nnominalstarttime_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nnominalendtime_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.generated.output_statistics_output_dataset module

\n
\n
\nclass openlineage.client.generated.output_statistics_output_dataset.OutputStatisticsOutputDatasetFacet(rowCount=None, size=None, fileCount=None, *, producer='')
\n

Bases: OutputDatasetFacet

\n
\n
Parameters:
\n
    \n
  • rowCount (int | None)

  • \n
  • size (int | None)

  • \n
  • fileCount (int | None)

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\nrowCount: int | None
\n

The number of rows written to the dataset

\n
\n
\n
\nsize: int | None
\n

The size in bytes written to the dataset

\n
\n
\n
\nfileCount: int | None
\n

The number of files written to the dataset

\n
\n
\n
\n
\n

openlineage.client.generated.ownership_dataset module

\n
\n
\nclass openlineage.client.generated.ownership_dataset.Owner(name, type=None)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • name (str)

  • \n
  • type (str | None)

  • \n
\n
\n
\n
\n
\nname: str
\n

the identifier of the owner of the Dataset. It is recommended to define this as a URN. For example\napplication:foo, user:jdoe, team:data

\n
\n
\n
\ntype: str | None
\n

The type of ownership (optional)

\n
\n
\n
\n
\nclass openlineage.client.generated.ownership_dataset.OwnershipDatasetFacet(owners=_Nothing.NOTHING, *, producer='', deleted=None)
\n

Bases: DatasetFacet

\n
\n
Parameters:
\n
    \n
  • owners (list[Owner] | None)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nowners: list[Owner] | None
\n

The owners of the dataset.

\n
\n
\n
\n
\n

openlineage.client.generated.ownership_job module

\n
\n
\nclass openlineage.client.generated.ownership_job.Owner(name, type=None)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • name (str)

  • \n
  • type (str | None)

  • \n
\n
\n
\n
\n
\nname: str
\n

the identifier of the owner of the Job. It is recommended to define this as a URN. For example\napplication:foo, user:jdoe, team:data

\n
\n
\n
\ntype: str | None
\n

The type of ownership (optional)

\n
\n
\n
\n
\nclass openlineage.client.generated.ownership_job.OwnershipJobFacet(owners=_Nothing.NOTHING, *, producer='', deleted=None)
\n

Bases: JobFacet

\n
\n
Parameters:
\n
    \n
  • owners (list[Owner] | None)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nowners: list[Owner] | None
\n

The owners of the job.

\n
\n
\n
\n
\n

openlineage.client.generated.parent_run module

\n
\n
\nclass openlineage.client.generated.parent_run.Job(namespace, name)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • namespace (str)

  • \n
  • name (str)

  • \n
\n
\n
\n
\n
\nnamespace: str
\n

The namespace containing that job

\n
\n
\n
\nname: str
\n

The unique name for that job within that namespace

\n
\n
\n
\n
\nclass openlineage.client.generated.parent_run.ParentRunFacet(run, job, *, producer='')
\n

Bases: RunFacet

\n

the id of the parent run and job, iff this run was spawn from an other run (for example, the Dag run\nscheduling its tasks)

\n
\n
Parameters:
\n
    \n
  • run (Run)

  • \n
  • job (Job)

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\nrun: Run
\n
\n
\n
\njob: Job
\n
\n
\n
\nclassmethod create(runId, namespace, name)
\n
\n
Parameters:
\n
    \n
  • runId (str)

  • \n
  • namespace (str)

  • \n
  • name (str)

  • \n
\n
\n
Return type:
\n

ParentRunFacet

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.generated.parent_run.Run(runId)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n

runId (str)

\n
\n
\n
\n
\nrunId: str
\n

The globally unique ID of the run associated with the job.

\n
\n
\n
\nrunid_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.generated.processing_engine_run module

\n
\n
\nclass openlineage.client.generated.processing_engine_run.ProcessingEngineRunFacet(version, name=None, openlineageAdapterVersion=None, *, producer='')
\n

Bases: RunFacet

\n
\n
Parameters:
\n
    \n
  • version (str)

  • \n
  • name (str | None)

  • \n
  • openlineageAdapterVersion (str | None)

  • \n
  • producer (str)

  • \n
\n
\n
\n
\n
\nversion: str
\n

Processing engine version. Might be Airflow or Spark version.

\n
\n
\n
\nname: str | None
\n

Processing engine name, e.g. Airflow or Spark

\n
\n
\n
\nopenlineageAdapterVersion: str | None
\n

OpenLineage adapter package version. Might be e.g. OpenLineage Airflow integration package version

\n
\n
\n
\n
\n

openlineage.client.generated.schema_dataset module

\n
\n
\nclass openlineage.client.generated.schema_dataset.SchemaDatasetFacet(fields=_Nothing.NOTHING, *, producer='', deleted=None)
\n

Bases: DatasetFacet

\n
\n
Parameters:
\n
    \n
  • fields (list[SchemaDatasetFacetFields] | None)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nfields: list[SchemaDatasetFacetFields] | None
\n

The fields of the data source.

\n
\n
\n
\n
\nclass openlineage.client.generated.schema_dataset.SchemaDatasetFacetFields(name, type=None, description=None, fields=_Nothing.NOTHING)
\n

Bases: RedactMixin

\n
\n
Parameters:
\n
    \n
  • name (str)

  • \n
  • type (str | None)

  • \n
  • description (str | None)

  • \n
  • fields (list[SchemaDatasetFacetFields] | None)

  • \n
\n
\n
\n
\n
\nname: str
\n

The name of the field.

\n
\n
\n
\ntype: str | None
\n

The type of the field.

\n
\n
\n
\ndescription: str | None
\n

The description of the field.

\n
\n
\n
\nfields: list[SchemaDatasetFacetFields] | None
\n

Nested struct fields.

\n
\n
\n
\n
\n

openlineage.client.generated.source_code_job module

\n
\n
\nclass openlineage.client.generated.source_code_job.SourceCodeJobFacet(language, sourceCode, *, producer='', deleted=None)
\n

Bases: JobFacet

\n
\n
Parameters:
\n
    \n
  • language (str)

  • \n
  • sourceCode (str)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nlanguage: str
\n

Language in which source code of this job was written.

\n
\n
\n
\nsourceCode: str
\n

Source code of this job.

\n
\n
\n
\n
\n

openlineage.client.generated.source_code_location_job module

\n
\n
\nclass openlineage.client.generated.source_code_location_job.SourceCodeLocationJobFacet(type, url, repoUrl=None, path=None, version=None, tag=None, branch=None, *, producer='', deleted=None)
\n

Bases: JobFacet

\n
\n
Parameters:
\n
    \n
  • type (str)

  • \n
  • url (str)

  • \n
  • repoUrl (str | None)

  • \n
  • path (str | None)

  • \n
  • version (str | None)

  • \n
  • tag (str | None)

  • \n
  • branch (str | None)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\ntype: str
\n

the source control system

\n
\n
\n
\nurl: str
\n

the full http URL to locate the file

\n
\n
\n
\nrepoUrl: str | None
\n

the URL to the repository

\n
\n
\n
\npath: str | None
\n

the path in the repo containing the source files

\n
\n
\n
\nversion: str | None
\n

the current version deployed (not a branch name, the actual unique version)

\n
\n
\n
\ntag: str | None
\n

optional tag name

\n
\n
\n
\nbranch: str | None
\n

optional branch name

\n
\n
\n
\nurl_check(attribute, value)
\n
\n
Parameters:
\n
    \n
  • attribute (str)

  • \n
  • value (str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.generated.sql_job module

\n
\n
\nclass openlineage.client.generated.sql_job.SQLJobFacet(query, *, producer='', deleted=None)
\n

Bases: JobFacet

\n
\n
Parameters:
\n
    \n
  • query (str)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nquery: str
\n
\n
\n
\n
\n

openlineage.client.generated.storage_dataset module

\n
\n
\nclass openlineage.client.generated.storage_dataset.StorageDatasetFacet(storageLayer, fileFormat=None, *, producer='', deleted=None)
\n

Bases: DatasetFacet

\n
\n
Parameters:
\n
    \n
  • storageLayer (str)

  • \n
  • fileFormat (str | None)

  • \n
  • producer (str)

  • \n
  • deleted (bool | None)

  • \n
\n
\n
\n
\n
\nstorageLayer: str
\n

iceberg, delta.

\n
\n
Type:
\n

Storage layer provider with allowed values

\n
\n
\n
\n
\n
\nfileFormat: str | None
\n

parquet, orc, avro, json, csv, text, xml.

\n
\n
Type:
\n

File format with allowed values

\n
\n
\n
\n
\n
\n\n
\n

openlineage.client.transport.composite module

\n
\n
\nclass openlineage.client.transport.composite.CompositeConfig(transports, continue_on_failure=True)
\n

Bases: Config

\n

CompositeConfig is a configuration class for CompositeTransport.

\n
\n
Parameters:
\n
    \n
  • transports (list[dict[str, Any]] | dict[str, dict[str, Any]])

  • \n
  • continue_on_failure (bool)

  • \n
\n
\n
\n
\n
\ntransports
\n

A list of dictionaries, where each dictionary represents the configuration\nfor a child transport. Each dictionary should contain the necessary parameters\nto initialize a specific transport instance.

\n
\n
\n
\ncontinue_on_failure
\n

If set to True, the CompositeTransport will attempt to emit the event using\nall configured transports, regardless of whether any previous transport\nin the list failed to emit the event. If set to False, an error in one\ntransport will halt the emission process for subsequent transports.

\n
\n
\n
\ntransports: list[dict[str, Any]] | dict[str, dict[str, Any]]
\n
\n
\n
\ncontinue_on_failure: bool
\n
\n
\n
\nclassmethod from_dict(params)
\n

Create a CompositeConfig object from a dictionary.

\n
\n
Parameters:
\n

params (dict[str, Any])

\n
\n
Return type:
\n

CompositeConfig

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.transport.composite.CompositeTransport(config)
\n

Bases: Transport

\n

CompositeTransport is a transport class that emits events using multiple transports.

\n
\n
Parameters:
\n

config (CompositeConfig)

\n
\n
\n
\n
\nkind: str | None = 'composite'
\n
\n
\n
\nconfig_class
\n

alias of CompositeConfig

\n
\n
\n
\nproperty transports: list[Transport]
\n

Create and return a list of transports based on the config.

\n
\n
\n
\nemit(event)
\n

Emit an event using all transports in the config.

\n
\n
Parameters:
\n

event (Event)

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.transport.console module

\n
\n
\nclass openlineage.client.transport.console.ConsoleConfig
\n

Bases: Config

\n
\n
\n
\n
\n
\nclass openlineage.client.transport.console.ConsoleTransport(config)
\n

Bases: Transport

\n
\n
Parameters:
\n

config (ConsoleConfig)

\n
\n
\n
\n
\nkind: str | None = 'console'
\n
\n
\n
\nconfig_class
\n

alias of ConsoleConfig

\n
\n
\n
\nemit(event)
\n
\n
Parameters:
\n

event (Union[RunEvent, DatasetEvent, JobEvent, RunEvent, DatasetEvent, JobEvent])

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.transport.factory module

\n
\n
\nclass openlineage.client.transport.factory.DefaultTransportFactory
\n

Bases: TransportFactory

\n
\n
\n
\n
\nregister_transport(of_type, clazz)
\n
\n
Parameters:
\n
    \n
  • of_type (str)

  • \n
  • clazz (type[Transport] | str)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\ncreate(config=None)
\n

Initializes and returns a transport mechanism based on the provided configuration.

\n

If \u2018OPENLINEAGE_DISABLED\u2019 is set to \u2018true\u2019, a NoopTransport instance is returned,\neffectively disabling transport.\nIf a configuration dictionary is provided, transport specified by the config is initialized.\nIf no configuration is provided, the function defaults to a console-based transport, logging\na warning and printing events to the console.

\n
\n
Parameters:
\n

config (dict[str, str] | None)

\n
\n
Return type:
\n

Transport

\n
\n
\n
\n
\n
\n
\n

openlineage.client.transport.file module

\n
\n
\nclass openlineage.client.transport.file.FileConfig(log_file_path, append=False)
\n

Bases: Config

\n
\n
Parameters:
\n
    \n
  • log_file_path (str)

  • \n
  • append (bool)

  • \n
\n
\n
\n
\n
\nlog_file_path: str
\n
\n
\n
\nappend: bool = False
\n
\n
\n
\nclassmethod from_dict(params)
\n
\n
Parameters:
\n

params (dict[str, Any])

\n
\n
Return type:
\n

FileConfig

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.transport.file.FileTransport(config)
\n

Bases: Transport

\n
\n
Parameters:
\n

config (FileConfig)

\n
\n
\n
\n
\nkind: str | None = 'file'
\n
\n
\n
\nconfig_class
\n

alias of FileConfig

\n
\n
\n
\nemit(event)
\n
\n
Parameters:
\n

event (Union[RunEvent, DatasetEvent, JobEvent, RunEvent, DatasetEvent, JobEvent])

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.transport.http module

\n
\n
\nclass openlineage.client.transport.http.TokenProvider(config)
\n

Bases: object

\n
\n
Parameters:
\n

config (dict[str, str])

\n
\n
\n
\n
\nget_bearer()
\n
\n
Return type:
\n

str | None

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.transport.http.HttpCompression(value)
\n

Bases: Enum

\n

An enumeration.

\n
\n
\nGZIP = 'gzip'
\n
\n
\n
\n
\nclass openlineage.client.transport.http.ApiKeyTokenProvider(config)
\n

Bases: TokenProvider

\n
\n
Parameters:
\n

config (dict[str, str])

\n
\n
\n
\n
\nget_bearer()
\n
\n
Return type:
\n

str | None

\n
\n
\n
\n
\n
\n
\nopenlineage.client.transport.http.create_token_provider(auth)
\n
\n
Parameters:
\n

auth (dict[str, str])

\n
\n
Return type:
\n

TokenProvider

\n
\n
\n
\n
\n
\nopenlineage.client.transport.http.get_session()
\n
\n
Return type:
\n

Session

\n
\n
\n
\n
\n
\nclass openlineage.client.transport.http.HttpConfig(url, endpoint='api/v1/lineage', timeout=5.0, verify=True, auth=_Nothing.NOTHING, compression=None, session=None, adapter=None, custom_headers=_Nothing.NOTHING)
\n

Bases: Config

\n
\n
Parameters:
\n
    \n
  • url (str)

  • \n
  • endpoint (str)

  • \n
  • timeout (float)

  • \n
  • verify (bool)

  • \n
  • auth (TokenProvider)

  • \n
  • compression (HttpCompression | None)

  • \n
  • session (Session | None)

  • \n
  • adapter (HTTPAdapter | None)

  • \n
  • custom_headers (dict[str, str])

  • \n
\n
\n
\n
\n
\nurl: str
\n
\n
\n
\nendpoint: str
\n
\n
\n
\ntimeout: float
\n
\n
\n
\nverify: bool
\n
\n
\n
\nauth: TokenProvider
\n
\n
\n
\ncompression: HttpCompression | None
\n
\n
\n
\nsession: Session | None
\n
\n
\n
\nadapter: HTTPAdapter | None
\n
\n
\n
\ncustom_headers: dict[str, str]
\n
\n
\n
\nclassmethod from_dict(params)
\n
\n
Parameters:
\n

params (dict[str, Any])

\n
\n
Return type:
\n

HttpConfig

\n
\n
\n
\n
\n
\nclassmethod from_options(url, options, session)
\n
\n
Parameters:
\n
    \n
  • url (str)

  • \n
  • options (OpenLineageClientOptions)

  • \n
  • session (Session | None)

  • \n
\n
\n
Return type:
\n

HttpConfig

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.transport.http.HttpTransport(config)
\n

Bases: Transport

\n
\n
Parameters:
\n

config (HttpConfig)

\n
\n
\n
\n
\nkind: str | None = 'http'
\n
\n
\n
\nconfig_class
\n

alias of HttpConfig

\n
\n
\n
\nset_adapter(adapter)
\n
\n
Parameters:
\n

adapter (HTTPAdapter)

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nemit(event)
\n
\n
Parameters:
\n

event (Union[RunEvent, DatasetEvent, JobEvent, RunEvent, DatasetEvent, JobEvent])

\n
\n
Return type:
\n

Response

\n
\n
\n
\n
\n
\n
\n

openlineage.client.transport.kafka module

\n
\n
\nclass openlineage.client.transport.kafka.KafkaConfig(config, topic, messageKey=None, flush=True)
\n

Bases: Config

\n
\n
Parameters:
\n
    \n
  • config (dict[str, str])

  • \n
  • topic (str)

  • \n
  • messageKey (str | None)

  • \n
  • flush (bool)

  • \n
\n
\n
\n
\n
\nconfig: dict[str, str]
\n
\n
\n
\ntopic: str
\n
\n
\n
\nmessageKey: str | None
\n
\n
\n
\nflush: bool
\n
\n
\n
\nclassmethod from_dict(params)
\n
\n
Parameters:
\n

params (dict[str, Any])

\n
\n
Return type:
\n

_T

\n
\n
\n
\n
\n
\n
\nopenlineage.client.transport.kafka.on_delivery(err, msg)
\n
\n
Parameters:
\n
    \n
  • err (KafkaError)

  • \n
  • msg (Message)

  • \n
\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\nclass openlineage.client.transport.kafka.KafkaTransport(config)
\n

Bases: Transport

\n
\n
Parameters:
\n

config (KafkaConfig)

\n
\n
\n
\n
\nkind: str | None = 'kafka'
\n
\n
\n
\nconfig_class
\n

alias of KafkaConfig

\n
\n
\n
\nemit(event)
\n
\n
Parameters:
\n

event (Event)

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.transport.msk_iam module

\n
\n
\nclass openlineage.client.transport.msk_iam.MSKIAMConfig(config, topic, messageKey=None, flush=True, region=None, aws_profile=None, role_arn=None, aws_debug_creds=False)
\n

Bases: KafkaConfig

\n
\n
Parameters:
\n
    \n
  • config (dict[str, str])

  • \n
  • topic (str)

  • \n
  • messageKey (str | None)

  • \n
  • flush (bool)

  • \n
  • region (str)

  • \n
  • aws_profile (None | str)

  • \n
  • role_arn (None | str)

  • \n
  • aws_debug_creds (bool)

  • \n
\n
\n
\n
\n
\nregion: str
\n
\n
\n
\naws_profile: None | str
\n
\n
\n
\nrole_arn: None | str
\n
\n
\n
\naws_debug_creds: bool
\n
\n
\n
\n
\nclass openlineage.client.transport.msk_iam.MSKIAMTransport(config)
\n

Bases: KafkaTransport

\n
\n
Parameters:
\n

config (MSKIAMConfig)

\n
\n
\n
\n
\nkind: str | None = 'msk-iam'
\n
\n
\n
\nconfig_class
\n

alias of MSKIAMConfig

\n
\n
\n
\n
\n

openlineage.client.transport.noop module

\n
\n
\nclass openlineage.client.transport.noop.NoopConfig
\n

Bases: Config

\n
\n
\n
\n
\n
\nclass openlineage.client.transport.noop.NoopTransport(config)
\n

Bases: Transport

\n
\n
Parameters:
\n

config (NoopConfig)

\n
\n
\n
\n
\nkind: str | None = 'noop'
\n
\n
\n
\nconfig_class
\n

alias of NoopConfig

\n
\n
\n
\nemit(event)
\n
\n
Parameters:
\n

event (Union[RunEvent, DatasetEvent, JobEvent, RunEvent, DatasetEvent, JobEvent])

\n
\n
Return type:
\n

None

\n
\n
\n
\n
\n
\n
\n

openlineage.client.transport.transport module

\n

To implement custom Transport implement Config and Transport classes.

\n
\n
Transport needs to
    \n
  • specify class variable config that will point to Config class that Transport requires

  • \n
  • __init__ that will accept specified Config class instance

  • \n
  • implement emit method that will accept RunEvent

  • \n
\n
\n
\n

Config file is read and parameters there are passed to from_dict classmethod.\nThe config class can have more complex attributes, but needs to be able to\ninstantiate them in from_dict method.

\n

DefaultTransportFactory instantiates custom transports by looking at type field in\nclass config.

\n
\n
\nclass openlineage.client.transport.transport.Config
\n

Bases: object

\n
\n
\n
\n
\nclassmethod from_dict(params)
\n
\n
Parameters:
\n

params (dict[str, Any])

\n
\n
Return type:
\n

_T

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.transport.transport.Transport
\n

Bases: object

\n
\n
\nkind: str | None = None
\n
\n
\n
\nname: str | None = None
\n
\n
\n
\nconfig_class
\n

alias of Config

\n
\n
\n
\nemit(event)
\n
\n
Parameters:
\n

event (Event)

\n
\n
Return type:
\n

Any

\n
\n
\n
\n
\n
\n
\nclass openlineage.client.transport.transport.TransportFactory
\n

Bases: object

\n
\n
\ncreate(config=None)
\n
\n
Parameters:
\n

config (dict[str, str] | None)

\n
\n
Return type:
\n

Transport

\n
\n
\n
\n
\n
\n
\n
"}}> + diff --git a/versioned_docs/version-1.26.0/development/developing/python/setup.md b/versioned_docs/version-1.26.0/development/developing/python/setup.md new file mode 100644 index 0000000..5201ab1 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/setup.md @@ -0,0 +1,41 @@ +--- +title: Setup a development environment +sidebar_position: 1 +--- + +There are four Python OpenLineage packages that you can install locally when setting up a development environment.
+Two of them: [openlineage-integration-common](https://pypi.org/project/openlineage-integration-common/) and [openlineage-airflow](https://pypi.org/project/openlineage-airflow/) have dependency on [openlineage-python](https://pypi.org/project/openlineage-python/) client and [openlineage-sql](https://pypi.org/project/openlineage-sql/). + +Typically, you first need to build `openlineage-sql` locally (see [README](https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md)). After each release you have to repeat this step in order to bump local version of the package. + +To install Openlineage Common, Python Client & Dagster integration you need to run pip install command with a link to local directory: + +```bash +$ python -m pip install -e .[dev] +``` +In zsh: +```bash +$ python -m pip install -e .\[dev\] +``` + +To make Airflow integration setup easier you can use run following command in package directory: +```bash +$ pip install -r dev-requirements.txt +``` +This should install all needed integrations locally. + +### Docker Compose development environment +There is also possibility to create local Docker-based development environment that has OpenLineage libraries setup along with Airflow and some helpful services. +To do that you should run `run-dev-airflow.sh` script located [here](https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/scripts/run-dev-airflow.sh). + +The script uses the same Docker Compose files as [integration tests](./tests/airflow.md#integration-tests). Two main differences are: +* it runs in non-blocking way +* it mounts OpenLineage Python packages as editable and mounted to Airflow containers. This allows to change code and test it live without need to rebuild whole environment. + + +When using above script, you can add the `-i` flag or `--attach-integration` flag. +This can be helpful when you need to run arbitrary integration tests during development. For example, the following command run in the integration container... +```bash +python -m pytest test_integration.py::test_integration[great_expectations_validation-requests/great_expectations.json] +``` +...runs a single test which you can repeat after changes in code. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/python/tests/_category_.json b/versioned_docs/version-1.26.0/development/developing/python/tests/_category_.json new file mode 100644 index 0000000..3ac452b --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/tests/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Tests", + "position": 2 +} diff --git a/versioned_docs/version-1.26.0/development/developing/python/tests/airflow.md b/versioned_docs/version-1.26.0/development/developing/python/tests/airflow.md new file mode 100644 index 0000000..c1f37ef --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/tests/airflow.md @@ -0,0 +1,180 @@ +--- +title: Airflow +sidebar_position: 2 +--- + +OpenLineage provides an integration with Apache Airflow. As Airflow is actively developed and major changes happen quite often it is advised to test OpenLineage integration against multiple Airflow versions. In the current CI process OpenLineage is tested against following versions: +* 2.1.4 (2.0+ upgrade) +* 2.2.4 +* 2.3.4 (TaskListener API introduced) +* 2.4.3 +* 2.5.2 +* 2.6.1 + +### Unit tests +In order to make running unit tests against multiple Airflow versions easier there is possibility to use [tox](https://tox.wiki/). +To run unit tests against all configured Airflow versions just run: +``` +tox +``` +You can also list existing environments with: +``` +tox -l +``` +that should list: +``` +py3-airflow-2.1.4 +py3-airflow-2.2.4 +py3-airflow-2.3.4 +py3-airflow-2.4.3 +py3-airflow.2.5.0 +``` +Then you can run tests in chosen environment, e.g.: +``` +tox -e py3-airflow-2.3.4 +``` +`setup.cfg` contains tox-related configuration. By default `tox` command runs: +1. `flake8` linting +2. `pytest` command + +Additionally, outside of `tox` you should run `mypy` static code analysis. You can do that with: +``` +python -m mypy openlineage +``` + +### Integration tests +Integration tests are located in `tests/integration/tests` directory. They require running Docker containers to provision local test environment: Airflow components (worker, scheduler), databases (PostgreSQL, MySQL) and OpenLineage events consumer. + +#### How to run +Integration tests require usage of _docker compose_. There are scripts prepared to make build images and run tests easier. + +```bash +AIRFLOW_IMAGE= ./tests/integration/docker/up.sh +``` +e.g. +```bash +AIRFLOW_IMAGE=apache/airflow:2.3.4-python3.7 ./tests/integration/docker/up.sh +``` +#### What tests are ran +The actual setup is to run all defined Airflow DAGs, collect OpenLineage events and check if they meet requirements. +The test you should pay most attention to is `test_integration`. It compares produced events to expected JSON structures recursively, with a respect if fields are not missing. + +Some of the tests are skipped if database connection specific environment variables are not set. The example is set of `SNOWFLAKE_PASSWORD` and `SNOWFLAKE_ACCOUNT_ID` variables. + +#### View stored OpenLineage events +OpenLineage events produced from Airflow runs are stored locally in `./tests/integration/tests/events` directory. The files are not overwritten, rather new events are appended to existing files. + +#### Example how to add new integration test +Let's take following `CustomOperator` for which we should add `CustomExtractor` and test it. First we create DAG in integration tests DAGs folder: [airflow/tests/integration/tests/airflow/dags](https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/tests/integration/tests/airflow/dags). + +```python +from airflow.models import BaseOperator +from airflow.utils.dates import days_ago +from airflow import DAG + + +default_args = { + 'depends_on_past': False, + 'start_date': days_ago(7) +} + + +dag = DAG( + 'custom_extractor', + schedule_interval='@once', + default_args=default_args +) + +class CustomOperator(BaseOperator): + def execute(self, context: Any): + for i in range(10): + print(i) + +t1 = CustomOperator( + task_id='custom_extractor', + dag=dag +) +``` +In the same folder we create `custom_extractor.py`: +```python +from typing import Union, Optional, List + +from openlineage.client.run import Dataset +from openlineage.airflow.extractors import TaskMetadata +from openlineage.airflow.extractors.base import BaseExtractor + + +class CustomExtractor(BaseExtractor): + @classmethod + def get_operator_classnames(cls) -> List[str]: + return ['CustomOperator'] + + def extract(self) -> Union[Optional[TaskMetadata], List[TaskMetadata]]: + return TaskMetadata( + "test", + inputs=[ + Dataset( + namespace="test", + name="dataset", + facets={} + ) + ] + ) +``` +Typically we want to compare produced metadata against expected. In order to do that we create JSON file `custom_extractor.json` in [airflow/tests/integration/requests](https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/tests/integration/requests): +``` + [{ + "eventType": "START", + "inputs": [{ + "facets": {}, + "name": "dataset", + "namespace": "test" + }], + "job": { + "facets": { + "documentation": { + "description": "Test dag." + } + }, + "name": "custom_extractor.custom_extractor", + "namespace": "food_delivery" + }, + "run": { + "facets": { + "airflow_runArgs": { + "externalTrigger": false + }, + "parent": { + "job": { + "name": "custom_extractor", + "namespace": "food_delivery" + } + } + } + } + }, + { + "eventType": "COMPLETE", + "inputs": [{ + "facets": {}, + "name": "dataset", + "namespace": "test" + }], + "job": { + "facets": {}, + "name": "custom_extractor.custom_extractor", + "namespace": "food_delivery" + } + } + ] + ``` + and add parameter for `test_integration` in [airflow/tests/integration/test_integration.py](https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/test_integration.py): +``` +("source_code_dag", "requests/source_code.json"), ++ ("custom_extractor", "requests/custom_extractor.json"), +("unknown_operator_dag", "requests/unknown_operator.json"), +``` + +That should setup a check for existence of both `START` and `COMPLETE` events, custom input facet and correct job facet. + +Full example can be found in source code available in integration tests [directory](https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/python/tests/client.md b/versioned_docs/version-1.26.0/development/developing/python/tests/client.md new file mode 100644 index 0000000..d8a286b --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/tests/client.md @@ -0,0 +1,10 @@ +--- +title: Client +sidebar_position: 1 +--- + +:::info +This page needs your contribution! Please contribute new examples using the edit link at the bottom. +::: + +There are unit tests available for OpenLineage Python client. You can run them with a simple `pytest` command with directory set to client base path. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/python/tests/common.md b/versioned_docs/version-1.26.0/development/developing/python/tests/common.md new file mode 100644 index 0000000..2e3b585 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/tests/common.md @@ -0,0 +1,10 @@ +--- +title: Common +sidebar_position: 3 +--- + +:::info +This page needs your contribution! Please contribute new examples using the edit link at the bottom. +::: + +There are unit tests available for OpenLineage [common package](../../developing.md#common-library-python). You can run them with a simple `pytest` command with directory set to package base path. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/python/tests/dagster.md b/versioned_docs/version-1.26.0/development/developing/python/tests/dagster.md new file mode 100644 index 0000000..5e2a241 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/tests/dagster.md @@ -0,0 +1,10 @@ +--- +title: Dagster +sidebar_position: 4 +--- + +:::info +This page needs your contribution! Please contribute new examples using the edit link at the bottom. +::: + +There are unit tests available for Dagster integration. You can run them with a simple `pytest` command with directory set to integration base path. diff --git a/versioned_docs/version-1.26.0/development/developing/python/troubleshooting/_category_.json b/versioned_docs/version-1.26.0/development/developing/python/troubleshooting/_category_.json new file mode 100644 index 0000000..6aba13b --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/troubleshooting/_category_.json @@ -0,0 +1,5 @@ +{ + "label": "Troubleshooting", + "position": 3 +} + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/python/troubleshooting/logging.md b/versioned_docs/version-1.26.0/development/developing/python/troubleshooting/logging.md new file mode 100644 index 0000000..f4e2ac4 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/python/troubleshooting/logging.md @@ -0,0 +1,290 @@ +--- +title: Logging +sidebar_position: 1 +--- + +OpenLineage uses python's [logging facility](https://docs.python.org/3/library/logging.html) when generating logs. Being able to emit logs for various purposes is very helpful when troubleshooting OpenLineage. + +Consider the following sample python script that emits OpenLineage events: + +```python +#!/usr/bin/env python3 +from openlineage.client.run import ( + RunEvent, + RunState, + Run, + Job, + Dataset, + OutputDataset, + InputDataset, +) +from openlineage.client.client import OpenLineageClient, OpenLineageClientOptions +from openlineage.client.facet import ( + SqlJobFacet, + SchemaDatasetFacet, + SchemaField, + OutputStatisticsOutputDatasetFacet, + SourceCodeLocationJobFacet, + NominalTimeRunFacet, + DataQualityMetricsInputDatasetFacet, + ColumnMetric, +) +from openlineage.client.uuid import generate_new_uuid +from datetime import datetime, timezone, timedelta +import time +from random import random + +PRODUCER = f"https://github.com/openlineage-user" +namespace = "python_client" + +url = "http://localhost:5000" +api_key = "1234567890ckcu028rzu5l" + +client = OpenLineageClient( + url=url, + # optional api key in case the backend requires it + options=OpenLineageClientOptions(api_key=api_key), +) + +# generates job facet +def job(job_name, sql, location): + facets = {"sql": SqlJobFacet(sql)} + if location != None: + facets.update( + {"sourceCodeLocation": SourceCodeLocationJobFacet("git", location)} + ) + return Job(namespace=namespace, name=job_name, facets=facets) + + +# geneartes run racet +def run(run_id, hour): + return Run( + runId=run_id, + facets={ + "nominalTime": NominalTimeRunFacet( + nominalStartTime=f"2022-04-14T{twoDigits(hour)}:12:00Z" + ) + }, + ) + + +# generates dataset +def dataset(name, schema=None, ns=namespace): + if schema == None: + facets = {} + else: + facets = {"schema": schema} + return Dataset(namespace, name, facets) + + +# generates output dataset +def outputDataset(dataset, stats): + output_facets = {"stats": stats, "outputStatistics": stats} + return OutputDataset(dataset.namespace, dataset.name, dataset.facets, output_facets) + + +# generates input dataset +def inputDataset(dataset, dq): + input_facets = { + "dataQuality": dq, + } + return InputDataset(dataset.namespace, dataset.name, dataset.facets, input_facets) + + +def twoDigits(n): + if n < 10: + result = f"0{n}" + elif n < 100: + result = f"{n}" + else: + raise f"error: {n}" + return result + + +now = datetime.now(timezone.utc) + + +# generates run Event +def runEvents(job_name, sql, inputs, outputs, hour, min, location, duration): + run_id = str(generate_new_uuid()) + myjob = job(job_name, sql, location) + myrun = run(run_id, hour) + started_at = now + timedelta(hours=hour, minutes=min, seconds=20 + round(random() * 10)) + ended_at = started_at + timedelta(minutes=duration, seconds=20 + round(random() * 10)) + return ( + RunEvent( + eventType=RunState.START, + eventTime=started_at.isoformat(), + run=myrun, + job=myjob, + producer=PRODUCER, + inputs=inputs, + outputs=outputs, + ), + RunEvent( + eventType=RunState.COMPLETE, + eventTime=ended_at.isoformat(), + run=myrun, + job=myjob, + producer=PRODUCER, + inputs=inputs, + outputs=outputs, + ), + ) + + +# add run event to the events list +def addRunEvents( + events, job_name, sql, inputs, outputs, hour, minutes, location=None, duration=2 +): + (start, complete) = runEvents( + job_name, sql, inputs, outputs, hour, minutes, location, duration + ) + events.append(start) + events.append(complete) + + +events = [] + +# create dataset data +for i in range(0, 5): + + user_counts = dataset("tmp_demo.user_counts") + user_history = dataset( + "temp_demo.user_history", + SchemaDatasetFacet( + fields=[ + SchemaField(name="id", type="BIGINT", description="the user id"), + SchemaField( + name="email_domain", type="VARCHAR", description="the user id" + ), + SchemaField(name="status", type="BIGINT", description="the user id"), + SchemaField( + name="created_at", + type="DATETIME", + description="date and time of creation of the user", + ), + SchemaField( + name="updated_at", + type="DATETIME", + description="the last time this row was updated", + ), + SchemaField( + name="fetch_time_utc", + type="DATETIME", + description="the time the data was fetched", + ), + SchemaField( + name="load_filename", + type="VARCHAR", + description="the original file this data was ingested from", + ), + SchemaField( + name="load_filerow", + type="INT", + description="the row number in the original file", + ), + SchemaField( + name="load_timestamp", + type="DATETIME", + description="the time the data was ingested", + ), + ] + ), + "snowflake://", + ) + + create_user_counts_sql = """CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS ( + SELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count + FROM TMP_DEMO.USER_HISTORY + GROUP BY date + )""" + + # location of the source code + location = "https://github.com/some/airflow/dags/example/user_trends.py" + + # run simulating Airflow DAG with snowflake operator + addRunEvents( + events, + "create_user_counts", + create_user_counts_sql, + [user_history], + [user_counts], + i, + 11, + location, + ) + + +for event in events: + from openlineage.client.serde import Serde + client.emit(event) + +``` + +When you use OpenLineage backend such as Marquez on your local environment, the script would emit OpenLienage events to it. + +```bash +python oltest.py +``` + +However, this short script does not produce any logging information, as the logging configuration is not setup. + +Add the following line to `oltest.py`, to configure the logging level as `DEBUG`. + +```python +... +import logging +... +logging.basicConfig(level=logging.DEBUG) +... +``` + +Re-running the `oltest.py` again will now produce the following outputs: + +``` +DEBUG:openlineage.client.transport.http:Constructing openlineage client to send events to http://localhost:5000 +DEBUG:openlineage.client.transport.http:Sending openlineage event {"eventTime": "2022-12-07T02:10:24.369600+00:00", "eventType": "START", "inputs": [{"facets": {"schema": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SchemaDatasetFacet", "fields": [{"description": "the user id", "name": "id", "type": "BIGINT"}, {"description": "the user id", "name": "email_domain", "type": "VARCHAR"}, {"description": "the user id", "name": "status", "type": "BIGINT"}, {"description": "date and time of creation of the user", "name": "created_at", "type": "DATETIME"}, {"description": "the last time this row was updated", "name": "updated_at", "type": "DATETIME"}, {"description": "the time the data was fetched", "name": "fetch_time_utc", "type": "DATETIME"}, {"description": "the original file this data was ingested from", "name": "load_filename", "type": "VARCHAR"}, {"description": "the row number in the original file", "name": "load_filerow", "type": "INT"}, {"description": "the time the data was ingested", "name": "load_timestamp", "type": "DATETIME"}]}}, "name": "temp_demo.user_history", "namespace": "python_client"}], "job": {"facets": {"sourceCodeLocation": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet", "type": "git", "url": "https://github.com/some/airflow/dags/example/user_trends.py"}, "sql": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet", "query": "CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS (\n\t\t\tSELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count\n\t\t\tFROM TMP_DEMO.USER_HISTORY\n\t\t\tGROUP BY date\n\t\t\t)"}}, "name": "create_user_counts", "namespace": "python_client"}, "outputs": [{"facets": {}, "name": "tmp_demo.user_counts", "namespace": "python_client"}], "producer": "https://github.com/openlineage-user", "run": {"facets": {"nominalTime": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", "nominalStartTime": "2022-04-14T00:12:00Z"}}, "runId": "e74f805a-0fde-4480-84a3-6919011eb14d"}} +DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5000 +DEBUG:urllib3.connectionpool:http://localhost:5000 "POST /api/v1/lineage HTTP/1.1" 201 0 +DEBUG:openlineage.client.transport.http:Sending openlineage event {"eventTime": "2022-12-07T02:12:47.369600+00:00", "eventType": "COMPLETE", "inputs": [{"facets": {"schema": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SchemaDatasetFacet", "fields": [{"description": "the user id", "name": "id", "type": "BIGINT"}, {"description": "the user id", "name": "email_domain", "type": "VARCHAR"}, {"description": "the user id", "name": "status", "type": "BIGINT"}, {"description": "date and time of creation of the user", "name": "created_at", "type": "DATETIME"}, {"description": "the last time this row was updated", "name": "updated_at", "type": "DATETIME"}, {"description": "the time the data was fetched", "name": "fetch_time_utc", "type": "DATETIME"}, {"description": "the original file this data was ingested from", "name": "load_filename", "type": "VARCHAR"}, {"description": "the row number in the original file", "name": "load_filerow", "type": "INT"}, {"description": "the time the data was ingested", "name": "load_timestamp", "type": "DATETIME"}]}}, "name": "temp_demo.user_history", "namespace": "python_client"}], "job": {"facets": {"sourceCodeLocation": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet", "type": "git", "url": "https://github.com/some/airflow/dags/example/user_trends.py"}, "sql": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet", "query": "CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS (\n\t\t\tSELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count\n\t\t\tFROM TMP_DEMO.USER_HISTORY\n\t\t\tGROUP BY date\n\t\t\t)"}}, "name": "create_user_counts", "namespace": "python_client"}, "outputs": [{"facets": {}, "name": "tmp_demo.user_counts", "namespace": "python_client"}], "producer": "https://github.com/openlineage-user", "run": {"facets": {"nominalTime": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", "nominalStartTime": "2022-04-14T00:12:00Z"}}, "runId": "e74f805a-0fde-4480-84a3-6919011eb14d"}} +DEBUG:urllib3.connectionpool:http://localhost:5000 "POST /api/v1/lineage HTTP/1.1" 201 0 +DEBUG:openlineage.client.transport.http:Sending openlineage event {"eventTime": "2022-12-07T03:10:20.369600+00:00", "eventType": "START", "inputs": [{"facets": {"schema": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SchemaDatasetFacet", "fields": [{"description": "the user id", "name": "id", "type": "BIGINT"}, {"description": "the user id", "name": "email_domain", "type": "VARCHAR"}, {"description": "the user id", "name": "status", "type": "BIGINT"}, {"description": "date and time of creation of the user", "name": "created_at", "type": "DATETIME"}, {"description": "the last time this row was updated", "name": "updated_at", "type": "DATETIME"}, {"description": "the time the data was fetched", "name": "fetch_time_utc", "type": "DATETIME"}, {"description": "the original file this data was ingested from", "name": "load_filename", "type": "VARCHAR"}, {"description": "the row number in the original file", "name": "load_filerow", "type": "INT"}, {"description": "the time the data was ingested", "name": "load_timestamp", "type": "DATETIME"}]}}, "name": "temp_demo.user_history", "namespace": "python_client"}], "job": {"facets": {"sourceCodeLocation": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet", "type": "git", "url": "https://github.com/some/airflow/dags/example/user_trends.py"}, "sql": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet", "query": "CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS (\n\t\t\tSELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count\n\t\t\tFROM TMP_DEMO.USER_HISTORY\n\t\t\tGROUP BY date\n\t\t\t)"}}, "name": "create_user_counts", "namespace": "python_client"}, "outputs": [{"facets": {}, "name": "tmp_demo.user_counts", "namespace": "python_client"}], "producer": "https://github.com/openlineage-user", "run": {"facets": {"nominalTime": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", "nominalStartTime": "2022-04-14T01:12:00Z"}}, "runId": "ff034dc3-e3e9-4e4b-bcf1-efba104ac4d4"}} +DEBUG:urllib3.connectionpool:http://localhost:5000 "POST /api/v1/lineage HTTP/1.1" 201 0 +DEBUG:openlineage.client.transport.http:Sending openlineage event {"eventTime": "2022-12-07T03:12:42.369600+00:00", "eventType": "COMPLETE", "inputs": [{"facets": {"schema": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SchemaDatasetFacet", "fields": [{"description": "the user id", "name": "id", "type": "BIGINT"}, {"description": "the user id", "name": "email_domain", "type": "VARCHAR"}, {"description": "the user id", "name": "status", "type": "BIGINT"}, {"description": "date and time of creation of the user", "name": "created_at", "type": "DATETIME"}, {"description": "the last time this row was updated", "name": "updated_at", "type": "DATETIME"}, {"description": "the time the data was fetched", "name": "fetch_time_utc", "type": "DATETIME"}, {"description": "the original file this data was ingested from", "name": "load_filename", "type": "VARCHAR"}, {"description": "the row number in the original file", "name": "load_filerow", "type": "INT"}, {"description": "the time the data was ingested", "name": "load_timestamp", "type": "DATETIME"}]}}, "name": "temp_demo.user_history", "namespace": "python_client"}], "job": {"facets": {"sourceCodeLocation": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet", "type": "git", "url": "https://github.com/some/airflow/dags/example/user_trends.py"}, "sql": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet", "query": "CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS (\n\t\t\tSELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count\n\t\t\tFROM TMP_DEMO.USER_HISTORY\n\t\t\tGROUP BY date\n\t\t\t)"}}, "name": "create_user_counts", "namespace": "python_client"}, "outputs": [{"facets": {}, "name": "tmp_demo.user_counts", "namespace": "python_client"}], "producer": "https://github.com/openlineage-user", "run": {"facets": {"nominalTime": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", "nominalStartTime": "2022-04-14T01:12:00Z"}}, "runId": "ff034dc3-e3e9-4e4b-bcf1-efba104ac4d4"}} +DEBUG:urllib3.connectionpool:http://localhost:5000 "POST /api/v1/lineage HTTP/1.1" 201 0 +DEBUG:openlineage.client.transport.http:Sending openlineage event {"eventTime": "2022-12-07T04:10:22.369600+00:00", "eventType": "START", "inputs": [{"facets": {"schema": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SchemaDatasetFacet", "fields": [{"description": "the user id", "name": "id", "type": "BIGINT"}, {"description": "the user id", "name": "email_domain", "type": "VARCHAR"}, {"description": "the user id", "name": "status", "type": "BIGINT"}, {"description": "date and time of creation of the user", "name": "created_at", "type": "DATETIME"}, {"description": "the last time this row was updated", "name": "updated_at", "type": "DATETIME"}, {"description": "the time the data was fetched", "name": "fetch_time_utc", "type": "DATETIME"}, {"description": "the original file this data was ingested from", "name": "load_filename", "type": "VARCHAR"}, {"description": "the row number in the original file", "name": "load_filerow", "type": "INT"}, {"description": "the time the data was ingested", "name": "load_timestamp", "type": "DATETIME"}]}}, "name": "temp_demo.user_history", "namespace": "python_client"}], "job": {"facets": {"sourceCodeLocation": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet", "type": "git", "url": "https://github.com/some/airflow/dags/example/user_trends.py"}, "sql": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet", "query": "CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS (\n\t\t\tSELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count\n\t\t\tFROM TMP_DEMO.USER_HISTORY\n\t\t\tGROUP BY date\n\t\t\t)"}}, "name": "create_user_counts", "namespace": "python_client"}, "outputs": [{"facets": {}, "name": "tmp_demo.user_counts", "namespace": "python_client"}], "producer": "https://github.com/openlineage-user", "run": {"facets": {"nominalTime": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", "nominalStartTime": "2022-04-14T02:12:00Z"}}, "runId": "b7304cdf-7c9e-4183-bd9d-1474cb86bad3"}} +DEBUG:urllib3.connectionpool:http://localhost:5000 "POST /api/v1/lineage HTTP/1.1" 201 0 +DEBUG:openlineage.client.transport.http:Sending openlineage event {"eventTime": "2022-12-07T04:12:45.369600+00:00", "eventType": "COMPLETE", "inputs": [{"facets": {"schema": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SchemaDatasetFacet", "fields": [{"description": "the user id", "name": "id", "type": "BIGINT"}, {"description": "the user id", "name": "email_domain", "type": "VARCHAR"}, {"description": "the user id", "name": "status", "type": "BIGINT"}, {"description": "date and time of creation of the user", "name": "created_at", "type": "DATETIME"}, {"description": "the last time this row was updated", "name": "updated_at", "type": "DATETIME"}, {"description": "the time the data was fetched", "name": "fetch_time_utc", "type": "DATETIME"}, {"description": "the original file this data was ingested from", "name": "load_filename", "type": "VARCHAR"}, {"description": "the row number in the original file", "name": "load_filerow", "type": "INT"}, {"description": "the time the data was ingested", "name": "load_timestamp", "type": "DATETIME"}]}}, "name": "temp_demo.user_history", "namespace": "python_client"}], "job": {"facets": {"sourceCodeLocation": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet", "type": "git", "url": "https://github.com/some/airflow/dags/example/user_trends.py"}, "sql": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet", "query": "CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS (\n\t\t\tSELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count\n\t\t\tFROM TMP_DEMO.USER_HISTORY\n\t\t\tGROUP BY date\n\t\t\t)"}}, "name": "create_user_counts", "namespace": "python_client"}, "outputs": [{"facets": {}, "name": "tmp_demo.user_counts", "namespace": "python_client"}], "producer": "https://github.com/openlineage-user", "run": {"facets": {"nominalTime": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", "nominalStartTime": "2022-04-14T02:12:00Z"}}, "runId": "b7304cdf-7c9e-4183-bd9d-1474cb86bad3"}} +DEBUG:urllib3.connectionpool:http://localhost:5000 "POST /api/v1/lineage HTTP/1.1" 201 0 +.... +``` + +DEBUG will also produce meaningful error messages when something does not work correctly. For example, if the backend server does not exist, you would get the following messages in your console output: + +``` +DEBUG:openlineage.client.transport.http:Constructing openlineage client to send events to http://localhost:5000 +DEBUG:openlineage.client.transport.http:Sending openlineage event {"eventTime": "2022-12-07T02:15:58.090994+00:00", "eventType": "START", "inputs": [{"facets": {"schema": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SchemaDatasetFacet", "fields": [{"description": "the user id", "name": "id", "type": "BIGINT"}, {"description": "the user id", "name": "email_domain", "type": "VARCHAR"}, {"description": "the user id", "name": "status", "type": "BIGINT"}, {"description": "date and time of creation of the user", "name": "created_at", "type": "DATETIME"}, {"description": "the last time this row was updated", "name": "updated_at", "type": "DATETIME"}, {"description": "the time the data was fetched", "name": "fetch_time_utc", "type": "DATETIME"}, {"description": "the original file this data was ingested from", "name": "load_filename", "type": "VARCHAR"}, {"description": "the row number in the original file", "name": "load_filerow", "type": "INT"}, {"description": "the time the data was ingested", "name": "load_timestamp", "type": "DATETIME"}]}}, "name": "temp_demo.user_history", "namespace": "python_client"}], "job": {"facets": {"sourceCodeLocation": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet", "type": "git", "url": "https://github.com/some/airflow/dags/example/user_trends.py"}, "sql": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet", "query": "CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS (\n\t\t\tSELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count\n\t\t\tFROM TMP_DEMO.USER_HISTORY\n\t\t\tGROUP BY date\n\t\t\t)"}}, "name": "create_user_counts", "namespace": "python_client"}, "outputs": [{"facets": {}, "name": "tmp_demo.user_counts", "namespace": "python_client"}], "producer": "https://github.com/openlineage-user", "run": {"facets": {"nominalTime": {"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", "nominalStartTime": "2022-04-14T00:12:00Z"}}, "runId": "c321058c-276b-4d1a-a260-8e16f2137c2b"}} +DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5000 +Traceback (most recent call last): + File "/opt/homebrew/Caskroom/miniconda/base/envs/openlineage/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn + conn = connection.create_connection( + File "/opt/homebrew/Caskroom/miniconda/base/envs/openlineage/lib/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection + raise err + File "/opt/homebrew/Caskroom/miniconda/base/envs/openlineage/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection + sock.connect(sa) +ConnectionRefusedError: [Errno 61] Connection refused +``` + +If you wish to output loggigng message to a file, you can modify the basic configuration as following: +```python +... +logging.basicConfig(filename='debug.log', encoding='utf-8', level=logging.DEBUG) +... +``` + +And the output will be saved to a file `debug.log`. + +### Further readings +- https://docs.python.org/3/library/logging.html +- https://realpython.com/python-logging/ diff --git a/versioned_docs/version-1.26.0/development/developing/spark/_category_.json b/versioned_docs/version-1.26.0/development/developing/spark/_category_.json new file mode 100644 index 0000000..92ba6d8 --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/spark/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Spark", + "position": 3 +} diff --git a/versioned_docs/version-1.26.0/development/developing/spark/built_in_lineage.md b/versioned_docs/version-1.26.0/development/developing/spark/built_in_lineage.md new file mode 100644 index 0000000..cc4921c --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/spark/built_in_lineage.md @@ -0,0 +1,269 @@ +--- +sidebar_position: 2 +title: Integrating with Spark extensions +--- + +:::info +Feature available since 1.11. +::: + +:::info +To get even better lineage coverage for Spark extensions, we recommend implementing lineage extraction +directly within the extensions' code and this page contains documentation on that. +::: + +Spark ecosystem comes with a plenty of extensions that affect lineage extraction logic. +`spark-interfaces-scala` package contains Scala traits which can be implemented on the extension's side to +generate high quality metadata for OpenLineage events. + +In general, a mechanism works in a following way: + * Package `spark-interfaces-scala` is a simple and lightweight. + Its only purpose is to contain methods to generate OpenLineage model objects (like facets, datasets) programmatically + and interfaces' definitions (Scala traits) to expose lineage information from nodes of Spark logical plan. + * Any extension that adds custom node to Spark logical plan can implement the interfaces. + * Spark OpenLineage integration, when traversing logical plan tree, checks if its nodes implement + those interfaces and uses their methods to extract lineage metadata from those nodes. + +## Problem definition + +OpenLineage Spark integration is based on `openlineage-spark-*.jar` library attached +to a running Spark job. The library traverses Spark logical plan on run state updates to generate +OpenLineage events. While traversing plan's tree, the library extracts input and output datasets +as well as other interesting aspects of this particular job, run or datasets involved in the processing. +Extraction code for each node is contained within `openlineage-spark.jar`. + +Two main issues with this approach are: +* Spark ecosystem comes with plenty of extensions and many of them add + custom nodes into the logical plan of the query executed. + These nodes need to be traversed and understood by `openlineage-spark` to + extract lineage out of them. This brings serious complexity to the code base. Not only OpenLineage + has to cover multiple Spark versions, but also each Spark version supports multiple versions of + multiple extensions. + +* Spark extensions know a lot of valuable metadata that can be published within OpenLineage events. + It makes sense to allow extensions publish facets on their own. This [issue](https://github.com/OpenLineage/OpenLineage/issues/167) + contains great example of useful aspects that can be retrieved from Iceberg extension. + +## Solution + +A remedy to the problems above is to migrate lineage extraction logic directly to +Spark `LogicalPlan` nodes. The advantages of this approach are: +* **One-to-one version matching** - there is no further need for a single integration code to support + multiple versions of a Spark extension. +* **Avoid breaking changes** - this approach limits amount of upgrades that break integration between + `openlineage-spark` and other extensions, as lineage extraction code is directly put into extensions + codebase which assures that changes on the Spark extension side are not breaking it. + +`spark-interfaces-scala` package contains traits that shall be implemented as well as extra utility +classes to let integrate OpenLineage within any Spark extension. + +Package code should not be shipped with extension that implements traits. Dependency should be marked +as compile-only. Implementation of the code calling the methods should be responsible for providing +`spark-interfaces-scala` on the classpath. + +Please note that this package as well as the traits should be considered experimental and may evolve +in the future. All the current logic has been put into `*.scala.v1` package. First, it is possible +we develop the same interfaces in Java. Secondly, in case of non-compatible changes, +we are going to release `v2` interfaces. We're aiming to support different versions within spark +integration. + +## Extracting lineage from plan nodes + +### The easy way - return all the metadata about dataset + +Spark optimized logical plan is a tree created of `LogicalPlan` nodes. Oftentimes, it is a Spark extension +internal class that implements `LogicalPlan` and becomes node within a tree. In this case, it is +reasonable to implement lineage extraction logic directly within that class. + +Two interfaces have been prepared: +* `io.openlineage.spark.builtin.scala.v1.InputLineageNode` with `getInputs` method, +* `io.openlineage.spark.builtin.scala.v1.OutputLineageNode` with `getOutputs` method. + +They return list of `InputDatasetWithFacets` and `OutputDatasetWithFacets` respectively. Each trait has methods +to expose dataset facets as well facets that relate to particular dataset only in the context of +current run, like amount of bytes read from a certain dataset. + +### When extracting dataset name and namespace is non-trivial + +The simple approach is to let the extension provide dataset identifier containing `namespace` as `name`. +However, in some cases this can be cumbersome. +For example, within Spark's codebase there are several nodes whose output dataset is +`DatasourceV2Relation` and extracting dataset's `name` and `namespace` from such nodes includes +non-trivial logic. In such scenarios, it does not make sense to require an extension to re-implement +the logic already present within `spark-openlineage` code. To solve this, the traits introduce datasets +with delegates which don't contain exact dataset identifier with name and namespace. Instead, they contain +pointer to other member of the plan where `spark-openlineage` should extract identifier from. + +For this scenario, case classes `InputDatasetWithDelegate` and +`OutputDatasetWithDelegate` have been created. They allow assigning facets to a dataset, while +still letting other code to extract metadata for the same dataset. The classes contain `node` object +property which defines node within logical plan to contain more metadata about the dataset. +In other words, returning a delegate will make OpenLineage Spark integration extract lineage from +the delegate and enrich it with facets attached to a delegate. + +An example implementation for `ReplaceIcebergData` node: + +```scala +override def getOutputs(context: OpenLineageContext): List[OutputDatasetWithFacets] = { + if (!table.isInstanceOf[DataSourceV2Relation]) { + List() + } else { + val relation = table.asInstanceOf[DataSourceV2Relation] + val datasetFacetsBuilder: DatasetFacetsBuilder = { + new OpenLineage.DatasetFacetsBuilder() + .lifecycleStateChange( + context + .openLineage + .newLifecycleStateChangeDatasetFacet( + OpenLineage.LifecycleStateChangeDatasetFacet.LifecycleStateChange.OVERWRITE, + null + ) + ) + } + + // enrich dataset with additional facets like a dataset version + DatasetVersionUtils.getVersionOf(relation) match { + case Some(version) => datasetFacetsBuilder.version( + context + .openLineage + .newDatasetVersionDatasetFacet(version) + ) + case None => + } + + // return output dataset while pointing that more dataset details shall be extracted from + // `relation` object. + List( + OutputDatasetWithDelegate( + relation, + datasetFacetsBuilder, + new OpenLineage.OutputDatasetOutputFacetsBuilder() + ) + ) + } + } +``` + +### When extension implements a relation within standard LogicalRelation + +In this scenario, Spark extension is using standard `LogicalRelation` node within the logical plan. +However, the node may contain extension's specific `relation` property which extends +`org.apache.spark.sql.sources.BaseRelation`. In this case, we allow `BaseRelation` implementation +to implement `io.openlineage.spark.builtin.scala.v1.LineageRelation` interface. + +### When extension implements a provider to create relations + +An extension can contain implementation of `org.apache.spark.sql.sources.RelationProvider` +which again does not use any custom nodes within the logical plan, but provides classes to +create relations. To support this scenario, `io.openlineage.spark.builtin.scala.v1.LineageDatasetProvider` +can be implemented. + +### When extension uses Spark DataSource v2 API + +Some extensions rely on Spark DataSource V2 API and implement TableProvider, Table, ScanBuilder etc. +that are used within Spark to create `DataSourceV2Relation` instances. + +A logical plan node `DataSourceV2Relation` contains `Table` field with a properties map of type +`Map`. `openlineage-spark` uses this map to extract dataset information for lineage +event from `DataSourceV2Relation`. It is checking for the properties `openlineage.dataset.name` and +`openlineage.dataset.namespace`. If they are present, it uses them to identify a dataset. Please +be aware that namespace and name need to conform to [naming convention](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md). + +Properties can be also used to pass any dataset facet. For example: +``` +openlineage.dataset.facets.customFacet={"property1": "value1", "property2": "value2"} +``` +will enrich dataset with `customFacet`: +```json +"inputs": [{ +"name": "...", +"namespace": "...", +"facets": { + "customFacet": { + "property1": "value1", + "property2": "value2", + "_producer": "..." + }, + "schema": { } +}] +``` + +## Column Level Lineage + +Lineage is extracted from the optimized logical plan. The plan is a tree with the root being the output +dataset and leaves the input datasets. In order to collect column level lineage we need to track dependencies between input and output fields. + +Each node within plan has to understand which input attributes it consumes and how they affect output attributes produced by the node. +Attribute fields within plan are identified by `ExprId`. In order to build column level lineage, +dependencies between input and output attributes for each plan's node need to be identified. + +In order to emit column level lineage from a given spark node, `io.openlineage.spark.builtin.scala.v1.ColumnLevelLineageNode` +trait has to be implemented. The trait should implement following methods +* `def columnLevelLineageInputs(context: OpenLineageContext): List[DatasetFieldLineage]` +* `def columnLevelLineageOutputs(context: OpenLineageContext): List[DatasetFieldLineage]` +* `columnLevelLineageDependencies(context: OpenLineageContext): List[ExpressionDependency]` + +First two methods are used to identify input and output fields as well as matching the fields +to expressions which use the fields. Returned field lineage can contain identifier, which is mostly +field name, but can also be represented by a delegate object pointing to expression where +the identifier shall be extracted from. + +`ExpressionDependency` allows matching, for each Spark plan node, input expressions onto output +expressions. Having all the inputs and outputs identified, as well as intermediate dependencies between +the expressions used, allow building column level lineage facet. + +Code below contains an example of `ColumnLevelLineageNode` within Iceberg's `MergeRows` class +that implements `MERGE INTO` for Spark 3.4: + +```scala +case class MergeRows( + ..., + matchedOutputs: Seq[Seq[Seq[Expression]]], + notMatchedOutputs: Seq[Seq[Expression]], + output: Seq[Attribute], + child: LogicalPlan +) extends UnaryNode with ColumnLevelLineageNode { + + override def columnLevelLineageDependencies(context: OpenLineageContext): List[ExpressionDependency] = { + val deps: ListBuffer[ExpressionDependency] = ListBuffer() + + // For each matched and not-matched outputs `ExpressionDependencyWithDelegate` is created + // This means for output expression id `attr.exprId.id`, `expr` node needs to be examined to + // detect input expression ids. + output.zipWithIndex.foreach { + case (attr: Attribute, index: Int) => + notMatchedOutputs + .toStream + .filter(exprs => exprs.size > index) + .map(exprs => exprs(index)) + .foreach(expr => deps += ExpressionDependencyWithDelegate(OlExprId(attr.exprId.id), expr)) + matchedOutputs + .foreach { + matched => + matched + .toStream + .filter(exprs => exprs.size > index) + .map(exprs => exprs(index)) + .foreach(expr => deps += ExpressionDependencyWithDelegate(OlExprId(attr.exprId.id), expr)) + } + } + + deps.toList + } + + override def columnLevelLineageInputs(context: OpenLineageContext): List[DatasetFieldLineage] = { + // Delegates input field extraction to other logical plan node + List(InputDatasetFieldFromDelegate(child)) + } + + override def columnLevelLineageOutputs(context: OpenLineageContext): List[DatasetFieldLineage] = { + // For each output attribute return its name and ExprId assigned to it. + // We're aiming for lineage traits to stay Spark version agnostic and don't want to rely + // on Spark classes. That's why `OlExprId` is used to pass `ExprId` + output.map(a => OutputDatasetField(a.name, OlExprId(a.exprId.id))).toList + } + } +``` + +Please note that `ExpressionDependency` can be extended in the future to contain more information +on how inputs were used to produce a certain output attribute. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/developing/spark/setup.md b/versioned_docs/version-1.26.0/development/developing/spark/setup.md new file mode 100644 index 0000000..d3131fb --- /dev/null +++ b/versioned_docs/version-1.26.0/development/developing/spark/setup.md @@ -0,0 +1,56 @@ +--- +sidebar_position: 1 +title: Build +--- + +# Build + +## Java 17 + +Testing requires a Java 17 JVM to test the Scala Spark components. +Use your favourite tool (sdkman, `/usr/libexec/java_home`) to set `JAVA_HOME` and `PATH` environmental variables properly. + +## Preparation + +The integration depends on four libraries that are build locally `openlineage-java`, `spark-extension-interfaces`, `spark-extension-entrypoint` and `openlineage-sql-java`, +so before any testing or building of a package you need to publish the appropriate artifacts to local maven repository. +To build the packages you need to execute: + +```sh +./buildDependencies.sh +``` + +## Testing + +To run the tests, from the current directory run: + +```sh +./gradlew test +``` + +To run the integration tests, from the current directory run: + +```sh +./gradlew integrationTest +``` + +## Build jar + +```sh +./gradlew shadowJar +``` + +## Contributing + +If contributing changes, additions or fixes to the Spark integration, please include the following header in any new `.java` files: + +``` +/* +/* Copyright 2018-2024 contributors to the OpenLineage project +/* SPDX-License-Identifier: Apache-2.0 +*/ +``` + +A Github Action checks for headers in new `.java` files when pull requests are opened. + +Thank you for your contributions to the project! \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/examples.md b/versioned_docs/version-1.26.0/development/examples.md new file mode 100644 index 0000000..39337ca --- /dev/null +++ b/versioned_docs/version-1.26.0/development/examples.md @@ -0,0 +1,173 @@ +--- +title: Example Lineage Events +sidebar_position: 2 +--- + +## Simple Examples + +### START event with single input + +This is a START event with a single PostgreSQL input dataset. + +```json +{ + "eventType": "START", + "eventTime": "2020-12-28T19:52:00.001+10:00", + "run": { + "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" + }, + "job": { + "namespace": "workshop", + "name": "process_taxes" + }, + "inputs": [{ + "namespace": "postgres://workshop-db:None", + "name": "workshop.public.taxes" + }], + "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client" +} +``` + +### COMPLETE event with single output + +This is a COMPLETE event with a single PostgreSQL output dataset. + +```json +{ + "eventType": "COMPLETE", + "eventTime": "2020-12-28T20:52:00.001+10:00", + "run": { + "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" + }, + "job": { + "namespace": "workshop", + "name": "process_taxes" + }, + "outputs": [{ + "namespace": "postgres://workshop-db:None", + "name": "workshop.public.unpaid_taxes" + }], + "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client" +} +``` + +## Complex Examples + +### START event with Facets (run and job) + +This is a START event with run and job facets of Apache Airflow. + +```json +{ + "eventType": "START", + "eventTime": "2020-12-28T19:52:00.001+10:00", + "run": { + "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" + "facets": { + "airflow_runArgs": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/BaseFacet", + "externalTrigger": true + }, + "nominalTime": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", + "nominalStartTime": "2022-07-29T14:14:31.458067Z" + }, + "parentRun": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/ParentRunFacet", + "job": { + "name": "etl_orders", + "namespace": "cosmic_energy" + }, + "run": { + "runId": "1ba6fdaa-fb80-36ce-9c5b-295f544ec462" + } + } + } + }, + "job": { + "namespace": "workshop", + "name": "process_taxes", + "facets": { + "documentation": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/DocumentationJobFacet", + "description": "Process taxes." + }, + "sql": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet", + "query": "INSERT into taxes values(1, 100, 1000, 4000);" + } + }, + }, + "inputs": [{ + "namespace": "postgres://workshop-db:None", + "name": "workshop.public.taxes" + }], + "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client" +} +``` + +### COMPLETE event with Facets (dataset) + +This is a COMPLETE event with dataset facet of Database table. + +```json +{ + "eventType": "COMPLETE", + "eventTime": "2020-12-28T20:52:00.001+10:00", + "run": { + "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" + }, + "job": { + "namespace": "workshop", + "name": "process_taxes" + }, + "outputs": [{ + "namespace": "postgres://workshop-db:None", + "name": "workshop.public.unpaid_taxes", + "facets": { + "dataSource": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/DataSourceDatasetFacet", + "name": "postgres://workshop-db:None", + "uri": "workshop-db" + }, + "schema": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SchemaDatasetFacet", + "fields": [ + { + "name": "id", + "type": "SERIAL PRIMARY KEY" + }, + { + "name": "tax_dt", + "type": "TIMESTAMP NOT NULL" + }, + { + "name": "tax_item_id", + "type": "INTEGER REFERENCES tax_itemsid" + }, + { + "name": "amount", + "type": "INTEGER NOT NULL" + }, + { + "name": "ref_id", + "type": "INTEGER REFERENCES refid" + }, + { + "name": "comment", + "type": "TEXT" + } + ] + } + } + }], + "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client" +} +``` \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/development/ol-proxy.md b/versioned_docs/version-1.26.0/development/ol-proxy.md new file mode 100644 index 0000000..a9f046c --- /dev/null +++ b/versioned_docs/version-1.26.0/development/ol-proxy.md @@ -0,0 +1,57 @@ +--- +title: OpenLineage Proxy +sidebar_position: 3 +--- + +OpenLineage Proxy is a simple Java server that can be used to monitor the JSON events that OpenLineage client emits, as well as tunnel the transmission to the OpenLineage backend such as [Marquez](https://marquezproject.ai/). + +When you are unable to collect logs on the client side, but want to make sure the event that gets emitted are valid and correct, you can use OpenLineage Proxy to verify the messages. + +## Accessing the proxy +OpenLineage proxy can be obtained via github: +``` +git clone https://github.com/OpenLineage/OpenLineage.git +cd OpenLineage/proxy/backend +``` + +## Building the proxy +To build the proxy jar, run +``` +$ ./gradlew build +``` + +The packaged jar file can be found under `./build/libs/` + +## Running the proxy + +OpenLineage Proxy requires configuration file named `proxy.yml`. There is an [example](https://github.com/OpenLineage/OpenLineage/blob/main/proxy/backend/proxy.example.yml) that you can copy and name it as `proxy.yml`. + +``` +cp proxy.example.yml proxy.yml +``` + +By default, the OpenLineage proxy uses the following ports: + +- TCP port 8080 is available for the HTTP API server. +- TCP port 8081 is available for the admin interface. + +You can then run the proxy using gradlew: +``` +$ ./gradlew runShadow +``` + +## Monitoring OpenLineage events via Proxy + +When proxy is running, you can start sending your OpenLineage events just as the same way as you would be sending to any OpenLineage backend server. For example, in your URL for the OpenLineage backend, you can specify it as `http://localhost:8080/api/v1/lineage`. + +Once the message is sent to the proxy, you will see the OpenLineage message content (JSON) to the console output of the proxy. You can also specify in the configuration to store the messages into the log file. + +> You might have noticed that OpenLineage client (python, java) simply requires `http://localhost:8080` as the URL endpoint. This is possible because the client code adds the `/api/v1/lineage` internally before it makes the request. If you are not using OpenLineage client library to emit OpenLineage events, you must use the full URL in order for the proxy to receive the data correctly. + +## Forwarding the data +Not only the OpenLineage proxy is useful in receiving the monitoring the OpenLineage events, it can also be used to relay the events to other endpoints. Please see the [example](https://github.com/OpenLineage/OpenLineage/blob/main/proxy/backend/proxy.example.yml) of how to set the proxy to relay the events via Kafka topic or HTTP endpoint. + +## Other ways to run OpenLineage Proxy +- You do not have to clone the git repo and build all the time. OpenLineage proxy is published and available in [Maven Repository](https://mvnrepository.com/artifact/io.openlineage/openlineage-proxy/). +- You can also run OpenLineage Proxy as a [docker container](https://github.com/OpenLineage/OpenLineage/blob/main/proxy/backend/Dockerfile). +- There is also a [helm chart for Kubernetes](https://github.com/OpenLineage/OpenLineage/tree/main/proxy/backend/chart) available. diff --git a/versioned_docs/version-1.26.0/faq.md b/versioned_docs/version-1.26.0/faq.md new file mode 100644 index 0000000..5ab1a26 --- /dev/null +++ b/versioned_docs/version-1.26.0/faq.md @@ -0,0 +1,18 @@ +--- +title: Frequently Asked Questions +sidebar_position: 7 +--- + +:::info +This page needs your contribution! Please contribute new questions (or answers) using the edit link at the bottom. +::: + +### Is OpenLineage a metadata server? + +No. OpenLineage is, at its core, a specification for lineage metadata. But it also contains a collection of integrations, examples, and tools. + +If you are looking for a metadata server that can receive and analyze OpenLineage events, check out [Marquez](https://marquezproject.ai). + +### Is there room for another question on this page? + +You bet! There's always room. Submit an issue or pull request using the edit button at the bottom. diff --git a/versioned_docs/version-1.26.0/guides/_category_.json b/versioned_docs/version-1.26.0/guides/_category_.json new file mode 100644 index 0000000..6b51b67 --- /dev/null +++ b/versioned_docs/version-1.26.0/guides/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Guides", + "position": 6 +} diff --git a/versioned_docs/version-1.26.0/guides/about.md b/versioned_docs/version-1.26.0/guides/about.md new file mode 100644 index 0000000..6a2782c --- /dev/null +++ b/versioned_docs/version-1.26.0/guides/about.md @@ -0,0 +1,15 @@ +--- +sidebar_position: 1 +--- + +# About These Guides + +The following tutorials take you through the process of exploiting the lineage metadata provided by Marquez and OpenLineage to solve common data engineering problems and make new analytical and historical insights into your pipelines. + +The first tutorial, "Using OpenLineage with Spark," provides an introduction to OpenLineage's integration with Apache Spark. You will learn how to use Marquez and the OpenLineage standard to produce lineage metadata about jobs and datasets created using Spark and BigQuery in a Jupyter notebook environment. + +The second tutorial, "Using OpenLineage with Airflow," shows you how to use OpenLineage on Apache Airflow to produce data lineage on supported operators to emit lineage events to Marquez backend. The tutorial also introduces you to the OpenLineage proxy to monitor the event data being emitted. + +The third tutorial, "Backfilling Airflow DAGs Using Marquez," shows you how to use Marquez's Airflow integration and the Marquez CLI to backfill failing runs with the help of lineage metadata. You will learn how data lineage can be used to automate the backfilling process. + +The fourth tutorial, "Using Marquez with dbt," takes you through the process of setting up Marquez's dbt integration to harvest metadata produced by dbt. You will learn how to create a Marquez instance, install the integration, configure your dbt installation, and test the configuration using dbt. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/guides/airflow-backfill-dags.md b/versioned_docs/version-1.26.0/guides/airflow-backfill-dags.md new file mode 100644 index 0000000..5e85c52 --- /dev/null +++ b/versioned_docs/version-1.26.0/guides/airflow-backfill-dags.md @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +--- + +# Backfilling Airflow DAGs Using Marquez + +#### Adapted from a [blog post](https://openlineage.io/blog/backfilling-airflow-dags-using-marquez/) by Willy Lulciuc + +This tutorial covers the use of lineage metadata in Airflow to backfill DAGs. Thanks to data lineage, backfilling does not have to be a tedious chore. + +Airflow supports backfilling DAG runs for a historical time window with a given start and end date. If a DAG (`example.etl_orders_7_days`) started failing on 2021-06-06, for example, you might want to reprocess the daily table partitions for that week (assuming all partitions have been backfilled upstream). This is possible using the [Airflow CLI](https://openlineage.io/blog/backfilling-airflow-dags-using-marquez/). In order to run the backfill for `example.etl_orders_7_days` using Airflow, create an Airflow instance and execute the following backfill command in a terminal window: + +``` +# Backfill weekly food orders +$ airflow dags backfill \ + --start-date 2021-06-06 \ + --end-date 2021-06-06 \ + example.etl_orders_7_days +``` + +Unfortunately, backfills are rarely so straightforward. Some questions remain: + +- How quickly can data quality issues be identified and explored? +- What alerting rules should be in place to notify downstream DAGs of possible upstream processing issues or failures? +- What effects (if any) would upstream DAGs have on downstream DAGs if dataset consumption were delayed? + +Managing lineage metadata with Marquez clears up much of the ambiguity that has surrounded backfilling. The key is to maintain inter-DAG dependencies and catalog historical runs of DAGs. + +## Exploring Lineage Metadata using Marquez + +### Prerequisites + +- Sample data (for the dataset used here, follow the instructions in the [Write Sample Lineage Metadata to Marquez](https://marquezproject.github.io/marquez/quickstart.html#write-sample-lineage-metadata-to-marquez) section of Marquez's [quickstart](https://marquezproject.github.io/marquez/quickstart.html) guide) +- Docker 17.05+ +- Docker Desktop +- Docker Compose +- jq + +:::info +If you are using macOS Monterey (macOS 12), port 5000 will have to be released by [disabling the AirPlay Receiver](https://developer.apple.com/forums/thread/682332). Also, port 3000 will need to be free if access to the Marquez web UI is desired. +::: + +### Query the Lineage Graph + +After running the seed command in the quickstart guide, check to make sure Marquez is up by visiting http://localhost:3000. The page should display an empty Marquez instance and a message saying there is no data. Also, it should be possible to see the server output from requests in the terminal window where Marquez is running. This window should remain open. As you progress through the tutorial, feel free to experiment with the web UI. Use truncated strings (e.g., "example.etl_orders_7_days" instead of "job:food_delivery:example.etl_orders_7_days") to find the datasets referenced below. + +In Marquez, each dataset and job has its own globally unique node ID that can be used to query the lineage graph. The LineageAPI returns a set of nodes consisting of edges. An edge is directed and has a defined origin and destination. A lineage graph may contain the following node types: `dataset::`, `job::`. + +Start by querying the lineage graph of the seed data via the CLI. The `etl_orders_7_days` DAG has the node ID `job:food_delivery:example.etl_orders_7_days`. To see the graph, run the following in a new terminal window: + +``` +$ curl -X GET "http://localhost:5000/api/v1-beta/lineage?nodeId=job:food_delivery:example.etl_orders_7_days" +``` + +Notice in the returned lineage graph that the DAG input datasets are `public.categories`, `public.orders`, and `public.menus`, while `public.orders_7_days` is the output dataset. The response should look something like this: + +``` +{ + "graph": [{ + "id": "job:food_delivery:example.etl_orders_7_days", + "type": "JOB", + "data": { + "type": "BATCH", + "id": { + "namespace": "food_delivery", + "name": "example.etl_orders_7_days" + }, + "name": "example.etl_orders_7_days", + "createdAt": "2021-06-06T14:50:13.931946Z", + "updatedAt": "2021-06-06T14:57:54.037399Z", + "namespace": "food_delivery", + "inputs": [ + {"namespace": "food_delivery", "name": "public.categories"}, + {"namespace": "food_delivery", "name": "public.menu_items"}, + {"namespace": "food_delivery", "name": "public.orders"}, + {"namespace": "food_delivery", "name": "public.menus"} + ], + "outputs": [ + {"namespace": "food_delivery", "name": "public.orders_7_days"} + ], + "location": "https://github.com/example/jobs/blob/2294bc15eb49071f38425dc927e48655530a2f2e/etl_orders_7_days.py", + "context": { + "sql": "INSERT INTO orders_7_days (order_id, placed_on, discount_id, menu_id, restaurant_id, menu_item_id, category_id)\n SELECT o.id AS order_id, o.placed_on, o.discount_id, m.id AS menu_id, m.restaurant_id, mi.id AS menu_item_id, c.id AS category_id\n FROM orders AS o\n INNER JOIN menu_items AS mi\n ON menu_items.id = o.menu_item_id\n INNER JOIN categories AS c\n ON c.id = mi.category_id\n INNER JOIN menu AS m\n ON m.id = c.menu_id\n WHERE o.placed_on >= NOW() - interval '7 days';" + }, + "description": "Loads newly placed orders weekly.", + "latestRun": { + "id": "5c7f0dc4-d3c1-4f16-9ac3-dc86c5da37cc", + "createdAt": "2021-06-06T14:50:36.853459Z", + "updatedAt": "2021-06-06T14:57:54.037399Z", + "nominalStartTime": "2021-06-06T14:54:00Z", + "nominalEndTime": "2021-06-06T14:57:00Z", + "state": "FAILED", + "startedAt": "2021-06-06T14:54:14.037399Z", + "endedAt": "2021-06-06T14:57:54.037399Z", + "durationMs": 220000, + "args": {}, + "location": "https://github.com/example/jobs/blob/2294bc15eb49071f38425dc927e48655530a2f2e/etl_orders_7_days.py", + "context": { + "sql": "INSERT INTO orders_7_days (order_id, placed_on, discount_id, menu_id, restaurant_id, menu_item_id, category_id)\n SELECT o.id AS order_id, o.placed_on, o.discount_id, m.id AS menu_id, m.restaurant_id, mi.id AS menu_item_id, c.id AS category_id\n FROM orders AS o\n INNER JOIN menu_items AS mi\n ON menu_items.id = o.menu_item_id\n INNER JOIN categories AS c\n ON c.id = mi.category_id\n INNER JOIN menu AS m\n ON m.id = c.menu_id\n WHERE o.placed_on >= NOW() - interval '7 days';" + }, + "facets": {} + } + }, + "inEdges": [ + {"origin": "dataset:food_delivery:public.categories", "destination": "job:food_delivery:example.etl_orders_7_days"}, "destination": "job:food_delivery:example.etl_orders_7_days"}, + {"origin": "dataset:food_delivery:public.orders", "destination": "job:food_delivery:example.etl_orders_7_days"}, + {"origin": "dataset:food_delivery:public.menus", "destination": "job:food_delivery:example.etl_orders_7_days"} + ], + "outEdges": [ + {"origin": "job:food_delivery:example.etl_orders_7_days", "destination": "dataset:food_delivery:public.orders_7_days"} + ] + } + }, ...] +} +``` + +To see a visualization of the graph, search the web UI with `public.delivery_7_days`. + +### Backfill a DAG Run + +![Backfill](backfill.png) + +Figure 1: Backfilled daily table partitions + +To run a backfill for `example.etl_orders_7_days` using the DAG lineage metadata stored in Marquez, query the lineage graph for the upstream DAG where an error originated. In this case, the `example.etl_orders` DAG upstream of `example.etl_orders_7_days` failed to write some of the daily table partitions needed for the weekly food order trends report. To fix the weekly trends report, backfill the missing daily table partitions `public.orders_2021_06_04`, `public.orders_2021_06_05`, and `public.orders_2021_06_06` using the Airflow CLI: + +``` +# Backfill daily food orders +$ airflow dags backfill \ + --start-date 2021-06-04 \ + --end-date 2021-06-06 \ + example.etl_orders +``` + +![DAG Deps](inter-dag-deps.png) + +Figure 2: Airflow inter-DAG dependencies + +Then, using the script `backfill.sh` defined below, we can easily backfill all DAGs downstream of `example.etl_orders`: + +(Note: Make sure you have jq installed before running `backfill.sh`.) + +``` +#!/bin/bash +# +# Backfill DAGs automatically using lineage metadata stored in Marquez. +# +# Usage: $ ./backfill.sh +​ +set -e +​ +# Backfills DAGs downstream of the given node ID, recursively. +backfill_downstream_of() { + node_id="${1}" + # Get out edges for node ID + out_edges=($(echo $lineage_graph \ + | jq -r --arg NODE_ID "${node_id}" '.graph[] | select(.id==$NODE_ID) | .outEdges[].destination')) + for out_edge in "${out_edges[@]}"; do + # Run backfill if out edge is a job node (i.e. => ) + if [[ "${out_edge}" = job:* ]]; then + dag_id="${out_edge##*:}" + echo "backfilling ${dag_id}..." + airflow backfill --start_date "${start_date}" --end_date "${start_date}" "${dag_id}" + fi + # Follow out edges downstream, recursively + backfill_downstream_of "${out_edge}" + done +} +​ +start_date="${1}" +end_date="${2}" +dag_id="${3}" +​ +# (1) Build job node ID (format: 'job::') +node_id="job:food_delivery:${dag_id}" +​ +# (2) Get lineage graph +lineage_graph=$(curl -s -X GET "http://localhost:5000/api/v1-beta/lineage?nodeId=${node_id}") +​ +# (3) Run backfill +backfill_downstream_of "${node_id}" +``` + +When run, the script should output all backfilled DAGs to the console: + +``` +$ ./backfill.sh 2021-06-06 2021-06-06 example.etl_orders +backfilling example.etl_orders_7_days... +backfilling example.etl_delivery_7_days... +backfilling example.delivery_times_7_days... +``` + +### Conclusion + +The lineage metadata provided by Marquez can make the task of backfilling much easier. But lineage metadata can also help avoid the need to backfill altogether. Since Marquez collects DAG run metadata that can be viewed using the Runs API, building automated processes to check DAG run states and notify teams of upstream data quality issues is just one possible preventive measure. + +Explore Marquez's opinionated Metadata API and define your own automated process(es) for analyzing lineage metadata! Also, join our Slack channel or reach out to us on Twitter if you have questions. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/guides/airflow-quickstart.md b/versioned_docs/version-1.26.0/guides/airflow-quickstart.md new file mode 100644 index 0000000..e4d0219 --- /dev/null +++ b/versioned_docs/version-1.26.0/guides/airflow-quickstart.md @@ -0,0 +1,337 @@ +--- +sidebar_position: 2 +--- + +import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; + +# Getting Started with Apache Airflow® and OpenLineage+Marquez + +In this tutorial, you'll configure Apache Airflow® to send OpenLineage events to [Marquez](https://marquezproject.ai/) and explore a realistic troubleshooting scenario. + +### Table of Contents + +1. [Prerequisites](#prerequisites) +2. [Get and start Marquez](#get-marquez) +3. [Configure Apache Airflow to send OpenLineage events to Marquez](#configure-airflow) +4. [Write Airflow DAGs](#write-airflow-dags) +5. [View Collected Lineage in Marquez](#view-collected-metadata) +6. [Troubleshoot a Failing DAG with Marquez](#troubleshoot-a-failing-dag-with-marquez) + +## Prerequisites {#prerequisites} + +Before you begin, make sure you have installed: + +* [Docker 17.05+](https://docs.docker.com/install) +* [Apache Airflow 2.7+](https://airflow.apache.org/docs/apache-airflow/stable/start.html) running locally. + +:::tip + +For an easy path to installing and running Airflow locally for development purposes, see: [Quick Start](https://airflow.apache.org/docs/apache-airflow/2.10.3/start.html). + +::: + +## Get and start Marquez {#get-marquez} + +1. Create a directory for Marquez. Then, check out the Marquez source code by running: + + + + + ```bash + $ git clone https://github.com/MarquezProject/marquez && cd marquez + ``` + + + + + ```bash + $ git config --global core.autocrlf false + $ git clone https://github.com/MarquezProject/marquez && cd marquez + ``` + + + + +2. Both Airflow and Marquez require port 5432 for their metastores, but the Marquez services are easier to configure. You can also assign the database service to a new port on the fly. To start Marquez using port 2345 for the database, run: + + + + + ```bash + $ ./docker/up.sh --db-port 2345 + ``` + + + + + Verify that Postgres and Bash are in your `PATH`, then run: + + ```bash + $ sh ./docker/up.sh --db-port 2345 + ``` + + + + +3. To view the Marquez UI and verify it's running, open [http://localhost:3000](http://localhost:3000). The UI allows you to: + + - view cross-platform dependencies, meaning you can see the jobs across the tools in your ecosystem that produce or consume a critical table. + - view run-level metadata of current and previous job runs, enabling you to see the latest status of a job and the update history of a dataset. + - get a high-level view of resource usage, allowing you to see trends in your operations. + +## Configure Airflow to send OpenLineage events to Marquez {#configure-airflow} + +1. To configure Airflow to emit OpenLineage events to Marquez, you need to modify your local Airflow environment and add a dependency. First, define an OpenLineage transport. One way you can do this is by using an environment variable. To use `http` and send events to the Marquez API running locally on port `5000`, run: + + + + + ```bash + $ export AIRFLOW__OPENLINEAGE__TRANSPORT='{"type": "http", "url": "http://localhost:5000", "endpoint": "api/v1/lineage"}' + ``` + + + + + ```bash + $ set AIRFLOW__OPENLINEAGE__TRANSPORT='{"type": "http", "url": "http://localhost:5000", "endpoint": "api/v1/lineage"}' + ``` + + + + +2. You also need to define a namespace for Airflow jobs. It can be any string. Run: + + + + + ```bash + $ export AIRFLOW__OPENLINEAGE__NAMESPACE='my-team-airflow-instance' + ``` + + + + + ```bash + $ set AIRFLOW__OPENLINEAGE__NAMESPACE='my-team-airflow-instance' + ``` + + + + +3. To add the required Airflow OpenLineage Provider package to your Airflow environment, run: + + + + + ```bash + $ pip install apache-airflow-providers-openlineage + ``` + + + + + ```bash + $ pip install apache-airflow-providers-openlineage + ``` + + + + +4. To complete this tutorial, you also need to enable local Postgres operations in Airflow. To do this, run: + + + + + ```bash + $ pip install apache-airflow-providers-postgres + ``` + + + + + ```bash + $ pip install apache-airflow-providers-postgres + ``` + + + + +5. Create a database in your local Postgres instance and create an Airflow Postgres connection using the default ID (`postgres_default`). For help with the former, see: [Postgres Documentation](https://www.postgresql.org/docs/). For help with the latter, see: [Managing Connections](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#managing-connections). + +## Write Airflow DAGs + +In this step, you will create two new Airflow DAGs that perform simple tasks and add them to your existing Airflow instance. The `counter` DAG adds 1 to a column every minute, while the `sum` DAG calculates a sum every five minutes. This will result in a simple pipeline containing two jobs and two datasets. + +1. In `dags/`, create a file named `counter.py` and add the following code: + + ```python + import pendulum + from airflow.decorators import dag, task + from airflow.providers.postgres.operators.postgres import PostgresOperator + from airflow.utils.dates import days_ago + + @dag( + schedule='*/1 * * * *', + start_date=days_ago(1), + catchup=False, + is_paused_upon_creation=False, + max_active_runs=1, + description='DAG that generates a new count value equal to 1.' + ) + + def counter(): + + query1 = PostgresOperator( + task_id='if_not_exists', + postgres_conn_id='postgres_default', + sql=''' + CREATE TABLE IF NOT EXISTS counts (value INTEGER); + ''', + ) + + query2 = PostgresOperator( + task_id='inc', + postgres_conn_id='postgres_default', + sql=''' + INSERT INTO "counts" (value) VALUES (1); + ''', + ) + + query1 >> query2 + + counter() + + ``` + +2. In `dags/`, create a file named `sum.py` and add the following code: + + ```python + import pendulum + from airflow.decorators import dag, task + from airflow.providers.postgres.operators.postgres import PostgresOperator + from airflow.utils.dates import days_ago + + @dag( + start_date=days_ago(1), + schedule='*/5 * * * *', + catchup=False, + is_paused_upon_creation=False, + max_active_runs=1, + description='DAG that sums the total of generated count values.' + ) + + def sum(): + + query1 = PostgresOperator( + task_id='if_not_exists', + postgres_conn_id='postgres_default', + sql=''' + CREATE TABLE IF NOT EXISTS sums ( + value INTEGER + );''' + ) + + query2 = PostgresOperator( + task_id='total', + postgres_conn_id='postgres_default', + sql=''' + INSERT INTO sums (value) + SELECT SUM(value) FROM counts; + ''' + ) + + query1 >> query2 + + sum() + + ``` + +3. Restart Airflow to apply the changes. + +## View Collected Lineage in Marquez + +1. To view lineage collected by Marquez from Airflow, browse to the Marquez UI by visiting [http://localhost:3000](http://localhost:3000). Then, use the _search_ bar in the upper left to search for the `counter.inc` job. To view lineage metadata for `counter.inc`, click on the job from the drop-down list: + +

+ +

+ +2. Look at the lineage graph for `counter.inc`, where you should see `.public.counts` as an output dataset and `sum.total` as a downstream job: + + ![](./docs/counter-inc-graph.png) + +## Troubleshoot a Failing DAG with Marquez + +1. In this step, you'll simulate a pipeline outage due to a cross-DAG dependency change and see how the enhanced lineage from OpenLineage+Marquez makes breaking schema changes easy to troubleshoot. + + Say `Team A` owns the DAG `counter`. `Team A` updates `counter` to rename the `values` column in the `counts` table to `value_1_to_10` without properly communicating the schema change to the team that owns `sum`. + + Apply the following changes to `counter` to simulate the breaking change: + + ```diff + query1 = PostgresOperator( + - task_id='if_not_exists', + + task_id='alter_name_of_column', + postgres_conn_id='example_db', + sql=''' + - CREATE TABLE IF NOT EXISTS counts ( + - value INTEGER + - );''', + + ALTER TABLE "counts" RENAME COLUMN "value" TO "value_1_to_10"; + + ''' + ) + ``` + + ```diff + query2 = PostgresOperator( + task_id='inc', + postgres_conn_id='example_db', + sql=''' + - INSERT INTO counts (value) + + INSERT INTO counts (value_1_to_10) + VALUES (1) + ''', + ) + ``` + + Like the owner of `sum`, `Team B`, would do, note the failed runs in the DataOps view in Marquez: + + ![](./docs/sum-data-ops.png) + + `Team B` can only guess what might have caused the DAG failure as no recent changes have been made to the DAG. So, the team decides to check Marquez. + +2. In Marquez, navigate to the Datasets view and select your Postgres instance from the namespace dropdown menu in the top-right corner. Then, click on the `.public.counts` dataset and inspect the graph. You'll find the schema on the node: + + ![](./docs/counts-graph-new-schema.png) + +3. Imagine you don't recognize the column and want to know what it was originally and when it changed. Clicking on the node will open the detail drawer. There, using the version history, find the run in which the schema changed: + + ![](./docs/counts-detail.png) + +4. In Airflow, fix the downstream DAG that broke by updating the task that calculates the count total to use the new column name: + + ```diff + query2 = PostgresOperator( + task_id='total', + postgres_conn_id='example_db', + sql=''' + - INSERT INTO sums (value) + - SELECT SUM(value) FROM counts; + + SELECT SUM(value_1_to_10) FROM counts; + ''' + ) + ``` + +5. Rerun the DAG. In Marquez, verify the fix by looking at the recent run history in the DataOps view: + + ![](./docs/sum-history.png) + +## Next Steps + +* Review the Marquez [HTTP API](https://marquezproject.github.io/marquez/openapi.html) used to collect Airflow DAG metadata and learn how to build your own integrations using OpenLineage. +* Take a look at the [`openlineage-spark`](https://openlineage.io/docs/integrations/spark/) integration that can be used with Airflow. + +## Feedback + +What did you think of this guide? Let us know in the [OpenLineage Slack](https://join.slack.com/t/openlineage/shared_invite/zt-2u4oiyz5h-TEmqpP4fVM5eCdOGeIbZvA) or the [Marquez Slack](https://join.slack.com/t/marquezproject/shared_invite/zt-2iylxasbq-GG_zXNcJdNrhC9uUMr3B7A). You can also propose changes directly by [opening a pull request](https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md#submitting-a-pull-request). diff --git a/versioned_docs/version-1.26.0/guides/airflow_dev_setup.png b/versioned_docs/version-1.26.0/guides/airflow_dev_setup.png new file mode 100644 index 0000000..b454c2d Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/airflow_dev_setup.png differ diff --git a/versioned_docs/version-1.26.0/guides/airflow_proxy.md b/versioned_docs/version-1.26.0/guides/airflow_proxy.md new file mode 100644 index 0000000..18c0bd1 --- /dev/null +++ b/versioned_docs/version-1.26.0/guides/airflow_proxy.md @@ -0,0 +1,281 @@ +--- +sidebar_position: 6 +--- + +# Using the OpenLineage Proxy with Airflow + +This tutorial introduces you to using the [OpenLineage Proxy](https://github.com/OpenLineage/OpenLineage/tree/main/proxy) with Airflow. OpenLineage has various integrations that will enable Airflow to emit OpenLineage events when using [Airflow Integrations](https://openlineage.io/docs/integrations/airflow/). In this tutorial, you will be running a local instance of Airflow using Docker Compose and learning how to enable and setup OpenLineage to emit data lineage events. The tutorial will use two backends to check the data lineage, 1) the Proxy, and 2) [Marquez](https://marquezproject.ai/). + +## Table of Contents +- Setting up a Local Airflow Environment using Docker Compose +- Setting up Marquez +- Running Everything +- Accessing the Airflow UI +- Running an Example DAG + +## Setting up a Local Airflow Environment using Docker Compose + +Airflow has a convenient way to set up and run a fully functional environment using [Docker Compose](https://docs.docker.com/compose/). The following are therefore required to be installed before we begin this tutorial. + +### Prerequisites + +- Docker 20.10.0+ +- Docker Desktop +- Docker Compose +- Java 11 + +:::info +If you are using MacOS Monterey (MacOS 12), port 5000 will have to be released by [disabling the AirPlay Receiver](https://developer.apple.com/forums/thread/682332). Also, port 3000 will need to be free if access to the Marquez Web UI is desired. +::: + +Use the following [instructions](https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html) to set up and run Airflow using Docker Compose. + +First, let's start out by creating a new directory that will contain all of our work. + +``` +mkdir ~/airflow-ol && +cd ~/airflow-ol +``` + +Then, let's download the Docker Compose file that we'll be running in it. + +``` +curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.3/docker-compose.yaml' +``` + +This will allow a new environment variable `OPENLINEAGE_URL` to be passed to the Docker containers, which is needed for OpenLineage to work. + +Then, let's create the following directories that will be mounted and used by the Docker Compose that will start Airflow. + +``` +mkdir dags && +mkdir logs && +mkdir plugins +``` + +Also, create a file `.env` that will contain an environment variable that is going to be used by Airflow to install additional Python packages that are needed. In this tutorial, the `openlineage-airflow` package will be installed. + +``` +echo "_PIP_ADDITIONAL_REQUIREMENTS=openlineage-airflow" > .env +``` + +You also need to let OpenLineage know where to send lineage data. + +``` +echo "OPENLINEAGE_URL=http://host.docker.internal:4433" >> .env +``` + +The reason why we are setting the backend to `host.docker.internal` is that we are going to be running the OpenLineage Proxy outside Airflow's Docker environment on the host machine itself. Port 4433 is where the proxy will be listening for lineage data. + +## Setting up OpenLineage Proxy as Receiving End + +The OpenLineage Proxy is a simple tool that you can easily set up and run to receive OpenLineage data. The proxy does not do anything other than display what it receives. Optionally, it can also forward data to any OpenLineage-compatible backend via HTTP. + +Let's download the proxy code from git and build it: + +``` +cd ~ && +git clone https://github.com/OpenLineage/OpenLineage.git && +cd OpenLineage/proxy/backend && +./gradlew build +``` + +Now, copy `proxy.dev.yml` and edit its content as the following, and save it as `proxy.yml`. + +```yaml +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +server: + applicationConnectors: + - type: http + port: ${OPENLINEAGE_PROXY_PORT:-4433} + adminConnectors: + - type: http + port: ${OPENLINEAGE_PROXY_ADMIN_PORT:-4434} + +logging: + level: ${LOG_LEVEL:-INFO} + appenders: + - type: console + +proxy: + source: openLineageProxyBackend + streams: + - type: Console + - type: Http + url: http://localhost:5000/api/v1/lineage +``` + +## Setting up Marquez + +The last piece of the setup is the Marquez backend. Using Marquez's [quickstart document](https://github.com/MarquezProject/marquez/blob/main/docs/quickstart.md), set up the Marquez environment. + +``` +cd ~ && +git clone https://github.com/MarquezProject/marquez.git +``` + +In marquez/docker-compose.dev.yml, change the ports for pghero to free up port 8080 for Airflow: + +``` +version: "3.7" +services: + api: + build: . + + seed_marquez: + build: . + + pghero: + image: ankane/pghero + container_name: pghero + ports: + - "8888:8888" + environment: + DATABASE_URL: postgres://postgres:password@db:5432 +``` + +## Running Everything + +### Running Marquez + +Start Docker Desktop, then: + +``` +cd ~/marquez && +./docker/up.sh +``` + +### Running OpenLineage proxy + +``` +cd ~/OpenLineage/proxy/backend && +./gradlew runShadow +``` + +### Running Airflow + +``` +cd ~/airflow-ol +docker-compose up +``` + +![airflow_dev_setup](./airflow_dev_setup.png) + +At this point, Apache Airflow should be running and able to send lineage data to the OpenLineage Proxy, with the OpenLineage Proxy forwarding the data to Marquez. Consequently, we can both inspect data payloads and see lineage data in graph form. + +## Accessing the Airflow UI + +With everything up and running, we can now login to Airflow's UI by opening up a browser and accessing `http://localhost:8080`. + +Initial ID and password to login would be `airflow/airflow`. + +## Running an Example DAG + +When you log into Airflow UI, you will notice that there are several example DAGs already populated when it started up. We can start running some of them to see the OpenLineage events they generate. + +### Running Bash Operator + +In the DAGs page, locate the `example_bash_operator`. + +![airflow_trigger_dag](./airflow_trigger_dag.png) + +Clicke the ► button at the right, which will show up a popup. Select `Trigger DAG` to trigger and run the DAG manually. + +You should see DAG running, and eventually completing. + +### Check the OpenLineage events +Once everything is finished, you should be able to see a number of JSON data payloads output in OpenLineage proxy's console. + +```json +INFO [2022-08-16 21:39:41,411] io.openlineage.proxy.api.models.ConsoleLineageStream: { + "eventTime" : "2022-08-16T21:39:40.854926Z", + "eventType" : "START", + "inputs" : [ ], + "job" : { + "facets" : { }, + "name" : "example_bash_operator.runme_2", + "namespace" : "default" + }, + "outputs" : [ ], + "producer" : "https://github.com/OpenLineage/OpenLineage/tree/0.12.0/integration/airflow", + "run" : { + "facets" : { + "airflow_runArgs" : { + "_producer" : "https://github.com/OpenLineage/OpenLineage/tree/0.12.0/integration/airflow", + "_schemaURL" : "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/BaseFacet", + "externalTrigger" : true + }, + "airflow_version" : { + "_producer" : "https://github.com/OpenLineage/OpenLineage/tree/0.12.0/integration/airflow", + "_schemaURL" : "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/BaseFacet", + "airflowVersion" : "2.3.3", + "openlineageAirflowVersion" : "0.12.0", + "operator" : "airflow.operators.bash.BashOperator", + "taskInfo" : "{'_BaseOperator__init_kwargs': {'task_id': 'runme_2', 'params': <***.models.param.ParamsDict object at 0xffff7467b610>, 'bash_command': 'echo \"example_bash_operator__runme_2__20220816\" && sleep 1'}, '_BaseOperator__from_mapped': False, 'task_id': 'runme_2', 'task_group': , 'owner': '***', 'email': None, 'email_on_retry': True, 'email_on_failure': True, 'execution_timeout': None, 'on_execute_callback': None, 'on_failure_callback': None, 'on_success_callback': None, 'on_retry_callback': None, '_pre_execute_hook': None, '_post_execute_hook': None, 'executor_config': {}, 'run_as_user': None, 'retries': 0, 'queue': 'default', 'pool': 'default_pool', 'pool_slots': 1, 'sla': None, 'trigger_rule': , 'depends_on_past': False, 'ignore_first_depends_on_past': True, 'wait_for_downstream': False, 'retry_delay': datetime.timedelta(seconds=300), 'retry_exponential_backoff': False, 'max_retry_delay': None, 'params': <***.models.param.ParamsDict object at 0xffff7467b4d0>, 'priority_weight': 1, 'weight_rule': , 'resources': None, 'max_active_tis_per_dag': None, 'do_xcom_push': True, 'doc_md': None, 'doc_json': None, 'doc_yaml': None, 'doc_rst': None, 'doc': None, 'upstream_task_ids': set(), 'downstream_task_ids': {'run_after_loop'}, 'start_date': DateTime(2021, 1, 1, 0, 0, 0, tzinfo=Timezone('UTC')), 'end_date': None, '_dag': , '_log': , 'inlets': [], 'outlets': [], '_inlets': [], '_outlets': [], '_BaseOperator__instantiated': True, 'bash_command': 'echo \"example_bash_operator__runme_2__20220816\" && sleep 1', 'env': None, 'output_encoding': 'utf-8', 'skip_exit_code': 99, 'cwd': None, 'append_env': False}" + }, + "nominalTime" : { + "_producer" : "https://github.com/OpenLineage/OpenLineage/tree/0.12.0/integration/airflow", + "_schemaURL" : "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", + "nominalStartTime" : "2022-08-16T21:39:38.005668Z" + }, + "parentRun" : { + "_producer" : "https://github.com/OpenLineage/OpenLineage/tree/0.12.0/integration/airflow", + "_schemaURL" : "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/ParentRunFacet", + "job" : { + "name" : "example_bash_operator", + "namespace" : "default" + }, + "run" : { + "runId" : "39ad10d1-72d9-3fe9-b2a4-860c651b98b7" + } + } + }, + "runId" : "313b4e71-9cde-4c83-b641-dd6773bf114b" + } +} +``` + +### Check Marquez + +You can also open up the browser and visit `http://localhost:3000` to access Marquez UI, and take a look at the OpenLineage events originating from Airflow. + +![marquez_bash_jobs](./marquez_bash_jobs.png) + +### Running other DAGs + +Due to the length of this tutorial, we are not going to be running additional example DAGs, but you can try running them and it would be interesting to see how each of them are going to be emitting OpenLineage events. Please try running other examples like `example_python_operator` which will also emit OpenLineage events. + +Normally, DataLineage will be much more complete and useful if a DAG run involves certain `datasets` that either get used or created during the runtime of it. When you run those DAGs, you will be able to see the connection between different DAGs and Tasks touching the same dataset that will eventually turn into Data Lineage graph that may look something like this: + +![marquez_graph](https://marquezproject.ai/images/screenshot.png) + +Currently, these are the Airflow operators that have extractors that can extract and emit OpenLineage events. + +- PostgresOperator +- MySqlOperator +- BigQueryOperator +- SnowflakeOperator +- GreatExpectationsOperator +- PythonOperator + +See additional [Apache Examples](https://github.com/MarquezProject/marquez/tree/main/examples/airflow) for DAGs that you can run in Airflow for OpenLineage. + +## Troubleshooting + +- You might not see any data going through the proxy or via Marquez. In that case, please check the task log of Airflow and see if you see the following message: `[2022-08-16, 21:23:19 UTC] {factory.py:122} ERROR - Did not find openlineage.yml and OPENLINEAGE_URL is not set`. In that case, it means that the environment variable `OPENLINEAGE_URL` was not set properly, thus OpenLineage was not able to emit any events. Please make sure to follow instructions in setting up the proper environment variable when setting up the Airflow via docker compose. +- Sometimes, Marquez would not respond and fail to receive any data via its API port 5000. You should be able to notice that if you start receiving response code 500 from Marquez or the Marquez UI hangs. In that case, simply stop and restart Marquez. + +## Conclusion + +In this short tutorial, we have learned how to setup and run a simple Apache Airflow environment that can emit OpenLineage events during its DAG run. We have also monitored and received the lineage events using combination of OpenLineage proxy and Marquez. We hope this tutorial was helpful in understanding how Airflow could be setup with OpenLineage and how you can easily monitor its data and end result using proxy and Marquez. diff --git a/versioned_docs/version-1.26.0/guides/airflow_trigger_dag.png b/versioned_docs/version-1.26.0/guides/airflow_trigger_dag.png new file mode 100644 index 0000000..be04118 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/airflow_trigger_dag.png differ diff --git a/versioned_docs/version-1.26.0/guides/backfill.png b/versioned_docs/version-1.26.0/guides/backfill.png new file mode 100644 index 0000000..cdd3ff4 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/backfill.png differ diff --git a/versioned_docs/version-1.26.0/guides/dbt.md b/versioned_docs/version-1.26.0/guides/dbt.md new file mode 100644 index 0000000..dec8896 --- /dev/null +++ b/versioned_docs/version-1.26.0/guides/dbt.md @@ -0,0 +1,128 @@ +--- +sidebar_position: 4 +--- + +# Using Marquez with dbt + +#### Adapted from a [blog post](https://openlineage.io/blog/dbt-with-marquez/) by Ross Turk + +:::caution +This guide was developed using an **earlier version** of this integration and may require modification. +::: + +Each time it runs, dbt generates a trove of metadata about datasets and the work it performs with them. This tutorial covers the harvesting and effective use of this metadata. For data, the tutorial makes use of the Stackoverflow public data set in BigQuery. The end-product will be two tables of data about trends in Stackoverflow discussions of ELT. + +### Prerequisites + +- dbt +- Docker Desktop +- git +- Google Cloud Service account +- Google Cloud Service account JSON key file + +Note: your Google Cloud account should have access to BigQuery and read/write access to your GCS bucket. Giving your key file an easy-to-remember name (bq-dbt-demo.json) is recommended. Finally, if using macOS Monterey (macOS 12), you will need to release port 5000 by [disabling the AirPlay Receiver](https://developer.apple.com/forums/thread/682332). + +### Instructions + +First, run through this excellent [dbt tutorial](https://docs.getdbt.com/tutorial/setting-up). It explains how to create a BigQuery project, provision a service account, download a JSON key, and set up a local dbt environment. The rest of this example assumes the existence of a BigQuery project where models can be run, as well as proper configuration of dbt to connect to the project. + +Next, start a local Marquez instance to store lineage metadata. Make sure Docker is running, and then clone the Marquez repository: + +``` +git clone https://github.com/MarquezProject/marquez.git && cd marquez +./docker/up.sh +``` + +Check to make sure Marquez is up by visiting http://localhost:3000. The page should display an empty Marquez instance and a message saying there is no data. Also, it should be possible to see the server output from requests in the terminal window where Marquez is running. This window should remain open. + +Now, in a new terminal window/pane, clone the following GitHub project, which contains some database models: + +``` +git clone https://github.com/rossturk/stackostudy.git && cd stackostudy +``` + +Now it is time to install dbt and its integration with OpenLineage. Doing this in a Python virtual environment is recommended. To create one and install necessary packages, run the following commands: + +``` +python -m venv virtualenv +source virtualenv/bin/activate +pip install dbt dbt-openlineage +``` + +Keep in mind that dbt learns how to connect to a BigQuery project by looking for a matching profile in `~/.dbt/profiles.yml`. Create or edit this file so it contains a section with the project's BigQuery connection details. Also, point to the location of the JSON key for the service account. Consult [this section](https://docs.getdbt.com/tutorial/create-a-project-dbt-cli#connect-to-bigquery) in the dbt documentation for more help with dbt profiles. At this point, profiles.yml should look something like this: + +``` +stackostudy: + target: dev + outputs: + dev: + type: bigquery + method: service-account + keyfile: /Users/rturk/.dbt/dbt-example.json + project: dbt-example + dataset: stackostudy + threads: 1 + timeout_seconds: 300 + location: US + priority: interactive +``` + +The `dbt debug` command checks to see that everything has been configured correctly. Running it now should produce output like the following: + +``` +% dbt debug +Running with dbt=0.20.1 +dbt version: 0.20.1 +python version: 3.8.12 +python path: /opt/homebrew/Cellar/dbt/0.20.1_1/libexec/bin/python3 +os info: macOS-11.5.2-arm64-arm-64bit +Using profiles.yml file at /Users/rturk/.dbt/profiles.yml +Using dbt_project.yml file at /Users/rturk/projects/stackostudy/dbt_project.yml +​ +Configuration: + profiles.yml file [OK found and valid] + dbt_project.yml file [OK found and valid] +​ +Required dependencies: + - git [OK found] +​ +Connection: + method: service-account + database: stacko-study + schema: stackostudy + location: US + priority: interactive + timeout_seconds: 300 + maximum_bytes_billed: None + Connection test: OK connection ok +``` + +### Important Details + +Some important conventions should be followed when designing dbt models for use with OpenLineage. Following these conventions will help ensure that OpenLineage collects the most complete metadata possible. + +First, any datasets existing outside the dbt project should be defined in a schema YAML file inside the `models/` directory: + +``` +version: 2 +​ +sources: + - name: stackoverflow + database: bigquery-public-data + schema: stackoverflow + tables: + - name: posts_questions + - name: posts_answers + - name: users + - name: votes +``` + +This contains the name of the external dataset - in this case, bigquery-public-datasets - and lists the tables that are used by the models in this project. The name of the file does not matter, as long as it ends with .yml and is inside `models/`. Hardcoding dataset and table names into queries can result in incomplete data. + +When writing queries, be sure to use the `{{ ref() }}` and `{{ source() }}` jinja functions when referring to data sources. The `{{ ref() }}` function can be used to refer to tables within the same model, and the `{{ source() }}` function refers to tables we have defined in schema.yml. That way, dbt will properly keep track of the relationships between datasets. For example, to select from both an external dataset and one in this model: + +``` +select * from {{ source('stackoverflow', 'posts_answers') }} +where parent_id in (select id from {{ ref('filtered_questions') }} ) +``` + diff --git a/versioned_docs/version-1.26.0/guides/docs/astro-current-lineage-view-job.png b/versioned_docs/version-1.26.0/guides/docs/astro-current-lineage-view-job.png new file mode 100644 index 0000000..dd54df6 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/astro-current-lineage-view-job.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/astro-job-failure.png b/versioned_docs/version-1.26.0/guides/docs/astro-job-failure.png new file mode 100644 index 0000000..f175a7c Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/astro-job-failure.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/astro-lineage-view-dataset.png b/versioned_docs/version-1.26.0/guides/docs/astro-lineage-view-dataset.png new file mode 100644 index 0000000..00d8810 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/astro-lineage-view-dataset.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/astro-lineage-view-job-successful.png b/versioned_docs/version-1.26.0/guides/docs/astro-lineage-view-job-successful.png new file mode 100644 index 0000000..46b509d Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/astro-lineage-view-job-successful.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/astro-view-dags.png b/versioned_docs/version-1.26.0/guides/docs/astro-view-dags.png new file mode 100644 index 0000000..a3b6c37 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/astro-view-dags.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/counter-inc-graph.png b/versioned_docs/version-1.26.0/guides/docs/counter-inc-graph.png new file mode 100644 index 0000000..c061428 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/counter-inc-graph.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/counts-detail.png b/versioned_docs/version-1.26.0/guides/docs/counts-detail.png new file mode 100644 index 0000000..d26793f Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/counts-detail.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/counts-graph-new-schema.png b/versioned_docs/version-1.26.0/guides/docs/counts-graph-new-schema.png new file mode 100644 index 0000000..8808a19 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/counts-graph-new-schema.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/current-search-count.png b/versioned_docs/version-1.26.0/guides/docs/current-search-count.png new file mode 100644 index 0000000..c3cc14b Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/current-search-count.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/marquez-current-lineage-view-job.png b/versioned_docs/version-1.26.0/guides/docs/marquez-current-lineage-view-job.png new file mode 100644 index 0000000..310ea92 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/marquez-current-lineage-view-job.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/marquez-search.png b/versioned_docs/version-1.26.0/guides/docs/marquez-search.png new file mode 100644 index 0000000..7e71f83 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/marquez-search.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/sum-data-ops.png b/versioned_docs/version-1.26.0/guides/docs/sum-data-ops.png new file mode 100644 index 0000000..f37616c Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/sum-data-ops.png differ diff --git a/versioned_docs/version-1.26.0/guides/docs/sum-history.png b/versioned_docs/version-1.26.0/guides/docs/sum-history.png new file mode 100644 index 0000000..8fa7270 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/docs/sum-history.png differ diff --git a/versioned_docs/version-1.26.0/guides/facets.md b/versioned_docs/version-1.26.0/guides/facets.md new file mode 100644 index 0000000..2800c01 --- /dev/null +++ b/versioned_docs/version-1.26.0/guides/facets.md @@ -0,0 +1,77 @@ +--- +sidebar_position: 5 +--- + +# Understanding and Using Facets + +#### Adapted from the OpenLineage [spec](https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md). + +Facets are pieces of metadata that can be attached to the core entities of the spec: +- Run +- Job +- Dataset (Inputs or Outputs) + +A facet is an atomic piece of metadata identified by its name. This means that emitting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely. It is defined as a JSON object that can be either part of the spec or a custom facet defined in a different project. + +Custom facets must use a distinct prefix named after the project defining them to avoid collision with standard facets defined in the [OpenLineage.json](https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json) spec. +They have a `\_schemaURL` field pointing to the corresponding version of the facet schema (as a JSONPointer: [$ref URL location](https://swagger.io/docs/specification/using-ref/) ). + +For example: https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/MyCustomJobFacet + +The versioned URL must be an immutable pointer to the version of the facet schema. For example, it should include a tag of a git sha and not a branch name. This should also be a canonical URL. There should be only one URL used for a given version of a schema. + +Custom facets can be promoted to the standard by including them in the spec. + +#### Custom Facet Naming + +The naming of custom facets should follow the pattern `{prefix}{name}{entity}Facet` PascalCased. +The prefix must be a distinct identifier named after the project defining it to avoid collision with standard facets defined in the [OpenLineage.json](https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json) spec. +The entity is the core entity for which the facet is attached. + +When attached to the core entity, the key should follow the pattern `{prefix}_{name}`, where both prefix and name follow snakeCase pattern. + +An example of a valid name is `BigQueryStatisticsJobFacet` and its key `bigQuery_statistics`. + +### Standard Facets + +#### Run Facets + +- **nominalTime**: Captures the time this run is scheduled for. This is a typical usage for time based scheduled job. The job has a nominal schedule time that will be different from the actual time it is running at. + +- **parent**: Captures the parent job and Run when the run was spawn from a parent run. For example in the case of Airflow, there's a run for the DAG that then spawns runs for individual tasks that would refer to the parent run as the DAG run. Similarly when a SparkOperator starts a Spark job, this creates a separate run that refers to the task run as its parent. + +- **errorMessage**: Captures potential error message, programming language - and optionally stack trace - with which the run failed. + +#### Job Facets + +- **sourceCodeLocation**: Captures the source code location and version (e.g., the git sha) of the job. + +- **sourceCode**: Captures the language (e.g., Python) and actual source code of the job. + +- **sql**: Capture the SQL query if this job is a SQL query. + +- **ownership**: Captures the owners of the job. + +#### Dataset Facets + +- **schema**: Captures the schema of the dataset. + +- **dataSource**: Captures the database instance containing this dataset (e.g., Database schema, Object store bucket, etc.) + +- **lifecycleStateChange**: Captures the lifecycle states of the dataset (e.g., alter, create, drop, overwrite, rename, truncate). + +- **version**: Captures the dataset version when versioning is defined by database (e.g., Iceberg snapshot ID). + +- [**columnLineage**](https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.json): Captures the column-level lineage. + +- **ownership**: Captures the owners of the dataset. + +#### Input Dataset Facets + +- **dataQualityMetrics**: Captures dataset-level and column-level data quality metrics when scanning a dataset with a DataQuality library (row count, byte size, null count, distinct count, average, min, max, quantiles). + +- **dataQualityAssertions**: Captures the result of running data tests on a dataset or its columns. + +#### Output Dataset Facets +- **outputStatistics**: Captures the size of the output written to a dataset (row count and byte size). + diff --git a/versioned_docs/version-1.26.0/guides/inter-dag-deps.png b/versioned_docs/version-1.26.0/guides/inter-dag-deps.png new file mode 100644 index 0000000..ed6d376 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/inter-dag-deps.png differ diff --git a/versioned_docs/version-1.26.0/guides/job_failure.png b/versioned_docs/version-1.26.0/guides/job_failure.png new file mode 100644 index 0000000..332cfc2 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/job_failure.png differ diff --git a/versioned_docs/version-1.26.0/guides/jupyter_home.png b/versioned_docs/version-1.26.0/guides/jupyter_home.png new file mode 100644 index 0000000..8eabf14 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/jupyter_home.png differ diff --git a/versioned_docs/version-1.26.0/guides/jupyter_new_notebook.png b/versioned_docs/version-1.26.0/guides/jupyter_new_notebook.png new file mode 100644 index 0000000..ffb1984 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/jupyter_new_notebook.png differ diff --git a/versioned_docs/version-1.26.0/guides/marquez_bash_jobs.png b/versioned_docs/version-1.26.0/guides/marquez_bash_jobs.png new file mode 100644 index 0000000..536cbdb Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/marquez_bash_jobs.png differ diff --git a/versioned_docs/version-1.26.0/guides/marquez_bigquery_dataset_latest.png b/versioned_docs/version-1.26.0/guides/marquez_bigquery_dataset_latest.png new file mode 100644 index 0000000..6dceab5 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/marquez_bigquery_dataset_latest.png differ diff --git a/versioned_docs/version-1.26.0/guides/marquez_home.png b/versioned_docs/version-1.26.0/guides/marquez_home.png new file mode 100644 index 0000000..1b5e6da Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/marquez_home.png differ diff --git a/versioned_docs/version-1.26.0/guides/marquez_job_facets.png b/versioned_docs/version-1.26.0/guides/marquez_job_facets.png new file mode 100644 index 0000000..ac70834 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/marquez_job_facets.png differ diff --git a/versioned_docs/version-1.26.0/guides/marquez_job_graph.png b/versioned_docs/version-1.26.0/guides/marquez_job_graph.png new file mode 100644 index 0000000..706bec7 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/marquez_job_graph.png differ diff --git a/versioned_docs/version-1.26.0/guides/marquez_output_dataset_latest.png b/versioned_docs/version-1.26.0/guides/marquez_output_dataset_latest.png new file mode 100644 index 0000000..c4b2aa2 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/marquez_output_dataset_latest.png differ diff --git a/versioned_docs/version-1.26.0/guides/marquez_output_dataset_version.png b/versioned_docs/version-1.26.0/guides/marquez_output_dataset_version.png new file mode 100644 index 0000000..4541747 Binary files /dev/null and b/versioned_docs/version-1.26.0/guides/marquez_output_dataset_version.png differ diff --git a/versioned_docs/version-1.26.0/guides/spark-connector.md b/versioned_docs/version-1.26.0/guides/spark-connector.md new file mode 100644 index 0000000..c2f3da8 --- /dev/null +++ b/versioned_docs/version-1.26.0/guides/spark-connector.md @@ -0,0 +1,51 @@ +--- +sidebar_position: 6 +--- + +# OpenLineage for Spark Connectors + +### What is OpenLineage +OpenLineage is an open standard for lineage data collection. It tracks metadata about core objects - datasets, jobs and runs - that represent how data is moving through the data pipelines. + +Besides describing standard events, OpenLineage project develops integration for popular open source data processing tools, like Apache Airflow, dbt, Apache Flink and Apache Spark, that allow users to automatically gather lineage metadata while the data jobs are running. +How does Spark OpenLineage integration work? +OpenLineage implements an instance of SparkListener interface, which allows it to listen to Spark events emitted during executions. Amongst those events are those that let us know that Spark Job has started or stopped running, like SparkListenerJobStart, SparkListenerJobEnd. When an OL listener receives that event, it can look up the LogicalPlan of a job, which represents a high level representation of a computation that Spark plans to do. + +LogicalPlan has a tree-like structure. The leafs of the tree are sources of the data that describe where and how Spark is reading the input datasets. Then, data flows through intermediary nodes that describe some computation to be performed - like joins, or reshaping the data structure - like some projection. At the end, the root node describes where the data will end up. The peculiarity of that structure is that there is only one output node - if you write data to multiple output datasets, it’s represented as multiple jobs and LogicalPlan trees. + +### What has OpenLineage to do with Spark connectors? + +LogicalPlan is an abstract class. The particular operations, whether reading data, processing it or writing it are implemented as a subclass of it, with attributes and methods allowing OL listener to interpret that data. OL Spark integration has a concept of visitors that receive nodes of the LogicalPlan - visitor defines the conditions - like, whether that LogicalPlan node is a particular subclass, like SaveIntoDataSourceCommand, or it’s received in particular phase of a Spark Job’s lifetime - and how to process data given it wants to do it. + +Spark Connectors, whether included by default in Spark or external to it, have few options on how to implement the necessary operations. This is a very simplified explanation. + +First is to implement your own LogicalPlan nodes together with extending Spark Planner to make sure the right LogicalPlan is generated. This is the hardest route, and it’s how several internal Spark connectors work, including Hive. + +Second is to implement the DataSourceV1 API. This includes implementing interfaces like RelationProvider, FileFormat. This allows users to read or write data using standard DataFrame APIs: +val people: DataFrame = spark.read + .format("csv") + .load("people.csv") + +Third is to implement the DataSourceV2 API. This includes implementing a custom Table interface that represents a dataset, with Traits that allow you to specify implementation of particular operations and optimizations (like predicate pushdown). This also allows users to read or write data using standard DataFrame APIs - Spark detects whether the connector uses V1 or V2 interface and uses correct code paths. + +The point of using DataSource APIs for connectors is that they reuse several structures of Spark, including standard user APIs, and LogicalPlans generated for those connectors are implemented: the planner will check whether relevant format is available, and for example for reading from V2 interface will generate DataSourceV2Relation leaf node, that uses relevant Table implementation under the hood coming from particular connector jar. + +To achieve full coverage of Spark operations, OL has to cover implementation of connectors whether they use V1 or V2 interface - it needs to understand the interface’s structure, what LogicalPlan nodes they use and implement support for it in a way that allows us to expose correct dataset naming from each connector - with possibly more metadata. + +### What does OpenLineage want to do with Spark connectors? + +Right now, OL integration implements support for each connector in the OpenLineage repository. This means OL Spark integration doesn’t only have to understand what LogicalPlan Spark will generate for standard Spark constructs, but also the underlying implementations of DataSource interfaces - for example, OL has an IcebergHandler class that handles getting correct dataset names of Iceberg tables, using internal Iceberg connector classes. + +This could be improved for a few reasons. + +First, the connector can change in a way that breaks our interface and they don’t know anything about it. The OpenLineage team also most likely won’t know anything about it until it gets a bug report. + +Second, even when OL receives a bug report, it has to handle the error in a backwards-compatible manner. Users can use different connector versions with different Spark versions on different Scala versions… The matrix of possible configurations vastly exceeds separate implementations for different versions, so the only solution that is realistically doable is using reflection to catch the change and try different code paths. This happens for the BigQuery connector. + +To solve this problem, OL wants to migrate responsibility to exposing lineage metadata directly to connectors, and has created interfaces for Spark connectors to implement. Given implementation of those interfaces, OL Spark integration can just use the exposed data without need to understand the implementation. It allows connectors to test whether they expose correct lineage metadata, and migrate the internals without breaking any OL Spark integration code. + +The interfaces provide a way to integrate OL support for a variety of ways in which Spark connectors are implemented. For example, if connector implements RelationProvider, OL interfaces allow you to extend it with class LineageRelationProvider, that tells the OL Spark integration that it can call getLineageDatasetIdentifier on it, without the need to use other, internal methods of the RelationProvider. + +It requires the connector to depend on two maven packages: spark-extension-interfaces and spark-extension-entrypoint. The first one contains the necessary classes to implement support for OpenLineage, however, to maintain compatibility with other connectors (that might rely on a different version of the same jar) the relocation of the package is required. The second package, spark-extension-entrypoint acts like a “pointer” for the actual implementation in the connector, allowing OpenLineage-Spark integration use those relocated classes. + +The detailed documentation for interfaces is [here](https://openlineage.io/docs/development/developing/spark/built_in_lineage/). diff --git a/versioned_docs/version-1.26.0/guides/spark.md b/versioned_docs/version-1.26.0/guides/spark.md new file mode 100644 index 0000000..387e088 --- /dev/null +++ b/versioned_docs/version-1.26.0/guides/spark.md @@ -0,0 +1,201 @@ +--- +sidebar_position: 2 +--- + +# Using OpenLineage with Spark + +#### Adapted from a [blog post](https://openlineage.io/blog/openlineage-spark/) by Michael Collado + +:::caution +This guide was developed using an **earlier version** of this integration and may require modification for recent releases. +::: + +Adding OpenLineage to Spark is refreshingly uncomplicated, and this is thanks to Spark's SparkListener interface. OpenLineage integrates with Spark by implementing SparkListener and collecting information about jobs executed inside a Spark application. To activate the listener, add the following properties to your Spark configuration in your cluster's `spark-defaults.conf` file or, alternatively, add them to specific jobs on submission via the `spark-submit` command: + +``` +spark.jars.packages io.openlineage:openlineage-spark:{{PREPROCESSOR:OPENLINEAGE_VERSION}} +spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener +``` + +Once activated, the listener needs to know where to report lineage events, as well as the namespace of your jobs. Add the following additional configuration lines to your `spark-defaults.conf` file or your Spark submission script: + +``` +spark.openlineage.transport.url {your.openlineage.host} +spark.openlineage.transport.type {your.openlineage.transport.type} +spark.openlineage.namespace {your.openlineage.namespace} +``` + +## Running Spark with OpenLineage + +### Prerequisites + +- Docker Desktop +- git +- Google Cloud Service account +- Google Cloud Service account JSON key file + +Note: your Google Cloud account should have access to BigQuery and read/write access to your GCS bucket. Giving your key file an easy-to-remember name (bq-spark-demo.json) is recommended. Finally, if using macOS Monterey (macOS 12), port 5000 will have to be released by [disabling the AirPlay Receiver](https://developer.apple.com/forums/thread/682332). + +### Instructions + +Clone the OpenLineage project, navigate to the spark directory, and create a directory for your Google Cloud Service credentials: + +``` +git clone https://github.com/OpenLineage/OpenLineage +cd integration/spark +mkdir -p docker/notebooks/gcs +``` + +Copy your Google Cloud Service credentials file into that directory, then run: + +``` +docker-compose up +``` + +This launches a Jupyter notebook with Spark as well as a Marquez API endpoint already installed to report lineage. Once the notebook server is up and running, you should see something like the following in the logs: + +``` +notebook_1 | [I 21:43:39.014 NotebookApp] Jupyter Notebook 6.4.4 is running at: +notebook_1 | [I 21:43:39.014 NotebookApp] http://082cb836f1ec:8888/?token=507af3cf9c22f627f6c5211d6861fe0804d9f7b19a93ca48 +notebook_1 | [I 21:43:39.014 NotebookApp] or http://127.0.0.1:8888/?token=507af3cf9c22f627f6c5211d6861fe0804d9f7b19a93ca48 +notebook_1 | [I 21:43:39.015 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). +``` + +Copy the URL with 127.0.0.1 as the hostname from your own log (the token will be different from this one) and paste it into your browser window. You should have a blank Jupyter notebook environment ready to go. + +![Jupyter notebook environment](jupyter_home.png) + +Click on the notebooks directory, then click on the New button to create a new Python 3 notebook. + +![Jupyter new notebook](jupyter_new_notebook.png) + +In the first cell in the window paste the below text. Update the GCP project and bucket names and the service account credentials file, then run the code: + +``` +from pyspark.sql import SparkSession +import urllib.request + +# Download dependencies for BigQuery and GCS +gc_jars = ['https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.1.1/gcs-connector-hadoop3-2.1.1-shaded.jar', + 'https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/bigquery-connector/hadoop3-1.2.0/bigquery-connector-hadoop3-1.2.0-shaded.jar', + 'https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.22.2/spark-bigquery-with-dependencies_2.12-0.22.2.jar'] + +files = [urllib.request.urlretrieve(url)[0] for url in gc_jars] + +# Set these to your own project and bucket +project_id = 'bq-openlineage-spark-demo' +gcs_bucket = 'bq-openlineage-spark-demo-bucket' +credentials_file = '/home/jovyan/notebooks/gcs/bq-spark-demo.json' + +spark = (SparkSession.builder.master('local').appName('openlineage_spark_test') + .config('spark.jars', ",".join(files)) + + # Install and set up the OpenLineage listener + .config('spark.jars.packages', 'io.openlineage:openlineage-spark:{{PREPROCESSOR:OPENLINEAGE_VERSION}}') + .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') + .config('spark.openlineage.transport.url', 'http://marquez-api:5000') + .config('spark.openlineage.transport.type', 'http') + .config('spark.openlineage.namespace', 'spark_integration') + + # Configure the Google credentials and project id + .config('spark.executorEnv.GCS_PROJECT_ID', project_id) + .config('spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS', '/home/jovyan/notebooks/gcs/bq-spark-demo.json') + .config('spark.hadoop.google.cloud.auth.service.account.enable', 'true') + .config('spark.hadoop.google.cloud.auth.service.account.json.keyfile', credentials_file) + .config('spark.hadoop.fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem') + .config('spark.hadoop.fs.AbstractFileSystem.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS') + .config("spark.hadoop.fs.gs.project.id", project_id) + .getOrCreate()) +``` + +Most of this is boilerplate for installing the BigQuery and GCS libraries in the notebook environment. This also sets the configuration parameters to tell the libraries what GCP project to use and how to authenticate with Google. The parameters specific to OpenLineage are the four already mentioned: `spark.jars.packages`, `spark.extraListeners`, `spark.openlineage.host`, `spark.openlineage.namespace`. Here, the host has been configured to be the `marquez-api` container started by Docker. + +With OpenLineage configured, it's time to get some data. The below code populates Spark DataFrames with data from two COVID-19 public data sets. Create a new cell in the notebook and paste the following: + +``` +from pyspark.sql.functions import expr, col + +mask_use = spark.read.format('bigquery') \ + .option('parentProject', project_id) \ + .option('table', 'bigquery-public-data:covid19_nyt.mask_use_by_county') \ + .load() \ + .select(expr("always + frequently").alias("frequent"), + expr("never + rarely").alias("rare"), + "county_fips_code") + +opendata = spark.read.format('bigquery') \ + .option('parentProject', project_id) \ + .option('table', 'bigquery-public-data.covid19_open_data.covid19_open_data') \ + .load() \ + .filter("country_name == 'United States of America'") \ + .filter("date == '2021-10-31'") \ + .select("location_key", + expr('cumulative_deceased/(population/100000)').alias('deaths_per_100k'), + expr('cumulative_persons_fully_vaccinated/(population - population_age_00_09)').alias('vaccination_rate'), + col('subregion2_code').alias('county_fips_code')) +joined = mask_use.join(opendata, 'county_fips_code') + +joined.write.mode('overwrite').parquet(f'gs://{gcs_bucket}/demodata/covid_deaths_and_mask_usage/') +``` + +Some background on the above: the `covid19_open_data` table is being filtered to include only U.S. data and data for Halloween 2021. The `deaths_per_100k` data point is being calculated using the existing `cumulative_deceased` and `population` columns and the `vaccination_rate` using the total population, subtracting the 0-9 year olds, since they were ineligible for vaccination at the time. For the `mask_use_by_county` data, "rarely" and "never" data are being combined into a single number, as are "frequently" and "always." The columns selected from the two datasets are then stored in GCS. + +Now, add a cell to the notebook and paste this line: + +``` +spark.read.parquet(f'gs://{gcs_bucket}/demodata/covid_deaths_and_mask_usage/').count() +``` + +The notebook should print a warning and a stacktrace (probably a debug statement), then return a total of 3142 records. + +Now that the pipeline is operational it is available for lineage collection. + +The `docker-compose.yml` file that ships with the OpenLineage repo includes only the Jupyter notebook and the Marquez API. To explore the lineage visually, start up the Marquez web project. Without terminating the existing docker containers, run the following command in a new terminal: + +``` +docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=marquez-api -e MARQUEZ_PORT=5000 -e WEB_PORT=3000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1 +``` + +Next, open a new browser tab and navigate to http://localhost:3000, which should look like this: + +![Marquez home](marquez_home.png) + +Note: the `spark_integration` namespace is automatically chosen because there are no other namespaces available. Three jobs are listed on the jobs page of the UI. They all start with `openlineage_spark_test`, which is the appName passed to the SparkSession when the first cell of the notebook was built. Each query execution or RDD action is represented as a distinct job and the name of the action is appended to the application name to form the name of the job. Clicking on the `openlineage_spark_test.execute_insert_into_hadoop_fs_relation_command` node calls up the lineage graph for our notebook: + +![Marquez job graph](marquez_job_graph.png) + +The graph shows that the `openlineage_spark_test.execute_insert_into_hadoop_fs_relation_command` job reads from two input datasets, `bigquery-public-data.covid19_nyt.mask_use_by_county` and `bigquery-public-data.covid19_open_data.covid19_open_data`, and writes to a third dataset, `/demodata/covid_deaths_and_mask_usage`. The namespace is missing from that third dataset, but the fully qualified name is `gs:///demodata/covid_deaths_and_mask_usage`. + +The bottom bar shows some interesting data that was collected from the Spark job. Dragging the bar up expands the view to offer a closer look. + +![Marquez job facets](marquez_job_facets.png) + +Two facets always collected from Spark jobs are the `spark_version` and the `spark.logicalPlan`. The first simply reports what version of Spark was executing, as well as the version of the openlineage-spark library. This is helpful for debugging job runs. + +The second facet is the serialized optimized LogicalPlan Spark reports when the job runs. Spark’s query optimization can have dramatic effects on the execution time and efficiency of the query job. Tracking how query plans change over time can significantly aid in debugging slow queries or `OutOfMemory` errors in production. + +Clicking on the first BigQuery dataset provides information about the data: + +![Marquez BigQuery dataset](marquez_bigquery_dataset_latest.png) + +One can see the schema of the dataset as well as the datasource. + +Similar information is available about the dataset written to in GCS: + +![Marquez output dataset](marquez_output_dataset_latest.png) + +As in the BigQuery dataset, one can see the output schema and the datasource — in this case, the `gs://` scheme and the name of the bucket written to. + +In addition to the schema, one can also see a stats facet, reporting the number of output records and bytes as -1. + +The VERSIONS tab on the bottom bar would display multiple versions if there were any (not the case here). Clicking on the version shows the same schema and statistics facets, but they are specific to the version selected. + +![Marquez output dataset version](marquez_output_dataset_version.png) + +In production, this dataset would have many versions, as each time a job runs a new version of the dataset is created. This permits the tracking of changes to the statistics and schema over time, aiding in debugging slow jobs or data quality issues and job failures. + +The final job in the UI is a HashAggregate job. This represents the `count()` method called at the end to show the number of records in the dataset. Rather than a `count()`, this could easily be a `toPandas()` call or some other job that reads and processes that data -- perhaps one that stores output back into GCS or updates a Postgres database, publishes a new model, etc. Regardless of where the output gets stored, the OpenLineage integration allows one to see the entire lineage graph, unifying datasets in object stores, relational databases, and more traditional data warehouses. + +### Conclusion + +The Spark integration from OpenLineage offers users insights into graphs of datasets stored in object stores like S3, GCS, and Azure Blob Storage, as well as BigQuery and relational databases like Postgres. Now with support for Spark 3.1, OpenLineage offers visibility into more environments, such as Databricks, EMR, and Dataproc clusters. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/index.md b/versioned_docs/version-1.26.0/index.md new file mode 100644 index 0000000..649e5f8 --- /dev/null +++ b/versioned_docs/version-1.26.0/index.md @@ -0,0 +1,65 @@ +--- +sidebar_position: 1 +--- + +# About OpenLineage + +OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata. + +### Design + +OpenLineage is an _Open Standard_ for lineage metadata collection designed to record metadata for a _job_ in execution. + +The standard defines a generic model of _dataset_, _job_, and _run_ entities uniquely identified using consistent naming strategies. The core model is highly extensible via facets. A **facet** is user-defined metadata and enables entity enrichment. We encourage you to familiarize yourself with the core model below: + +![image](./model.svg) + + +### How OpenLineage Benefits the Ecosystem + +Below, we illustrate the challenges of collecting lineage metadata from multiple sources, schedulers and/or data processing frameworks. We then outline the design benefits of defining an _Open Standard_ for lineage metadata collection. + +#### BEFORE: + +![image](./before-ol.svg) + +* Each project has to instrument its own custom metadata collection integration, therefore duplicating efforts. +* Integrations are external and can break with new versions of the underlying scheduler and/or data processing framework, requiring projects to ensure _backwards_ compatibility. + +#### WITH OPENLINEAGE: + +![image](./with-ol.svg) + +* Integration efforts are shared _across_ projects. +* Integrations can be _pushed_ to the underlying scheduler and/or data processing framework; no longer does one need to play catch up and ensure compatibility! + +## Scope +OpenLineage defines the metadata for running jobs and their corresponding events. +A configurable backend allows the user to choose what protocol to send the events to. + ![Scope](./scope.svg) + +## Core model + + ![Model](./datamodel.svg) + + A facet is an atomic piece of metadata attached to one of the core entities. + See the spec for more details. + +## Spec +The [specification](https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md) is defined using OpenAPI and allows extension through custom facets. + +## Integrations + +The OpenLineage repository contains integrations with several systems. + +- [Apache Airflow](https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow) +- [Apache Flink](https://github.com/OpenLineage/OpenLineage/tree/main/integration/flink) +- [Apache Spark](https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark) +- [Dagster](https://github.com/OpenLineage/OpenLineage/tree/main/integration/dagster) +- [dbt](https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt) +- [SQL](https://github.com/OpenLineage/OpenLineage/tree/main/integration/sql) + +## Related projects +- [Marquez](https://marquezproject.ai/): Marquez is an [LF AI & DATA](https://lfaidata.foundation/) project to collect, aggregate, and visualize a data ecosystem's metadata. It is the reference implementation of the OpenLineage API. + - [OpenLineage collection implementation](https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/api/OpenLineageResource.java) +- [Egeria](https://egeria.odpi.org/): Egeria Open Metadata and Governance. A metadata bus. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/_category_.json b/versioned_docs/version-1.26.0/integrations/_category_.json new file mode 100644 index 0000000..c2495ff --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Integrations", + "position": 5 +} diff --git a/versioned_docs/version-1.26.0/integrations/about.md b/versioned_docs/version-1.26.0/integrations/about.md new file mode 100644 index 0000000..bedd9ba --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/about.md @@ -0,0 +1,55 @@ +--- +sidebar_position: 1 +--- + +# OpenLineage Integrations + +## Capability Matrix + +:::caution +This matrix is not yet complete. +::: + +The matrix below shows the relationship between an input facet and various mechanisms OpenLineage uses to gather metadata. Not all mechanisms collect data to fill in all facets, and some facets are specific to one integration. +✔️: The mechanism does implement this facet. +✖️: The mechanism does not implement this facet. +An empty column means it is not yet documented if the mechanism implements this facet. + +| Mechanism | Integration | Metadata Gathered | InputDatasetFacet | OutputDatasetFacet | SqlJobFacet | SchemaDatasetFacet | DataSourceDatasetFacet | DataQualityMetricsInputDatasetFacet | DataQualityAssertionsDatasetFacet | SourceCodeJobFacet | ExternalQueryRunFacet | DocumentationDatasetFacet | SourceCodeLocationJobFacet | DocumentationJobFacet | ParentRunFacet | +|:-------------------|:------------------|:----------------------------------------------|:------------------|:-------------------|:------------|:-------------------|:-----------------------|:------------------------------------|:----------------------------------|:-------------------|:----------------------|:--------------------------|:---------------------------|:----------------------|:---------------| +| SnowflakeOperator* | Airflow Extractor | Lineage
Job duration | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✖️ | ✖️ | | | | | | | +| BigQueryOperator** | Airflow Extractor | Lineage
Schema details
Job duration | ✔️ | ✔️ | | ✔️ | | | | | | | | | | +| PostgresOperator* | Airflow Extractor | Lineage
Job duration | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | | | | | | | | | +| SqlCheckOperators | Airflow Extractor | Lineage
Data quality assertions | ✔️ | ✖️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | | | | | | | +| dbt | dbt Project Files | Lineage
Row count
Byte count. | ✔️ | | | | | | | | | | | | | +| Great Expectations | Action | Data quality assertions | ✔️ | | | | | ✔️ | ✔️ | | | | | | | +| Spark | SparkListener | Schema
Row count
Column lineage | ✔️ | | | | | | | | | | | | | +| Snowflake*** | Access History | Lineage | | | | | | | | | | | | | | + +\* Uses the Rest SQL parser +\*\* Uses the BigQuery API +\*\*\* Uses Snowflake query logs + +## Compatibility matrix + +This matrix shows which data sources are known to work with each integration, along with the minimum versions required in the target system or framework. + +| Platform | Version | Data Sources | +|:-------------------|:-------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Apache Airflow | 1.10+
2.0+ | PostgreSQL
MySQL
Snowflake
Amazon Athena
Amazon Redshift
Amazon SageMaker
Amazon S3 Copy and Transform
Google BigQuery
Google Cloud Storage
Great Expectations
SFTP
FTP | +| Apache Spark | 2.4+ | JDBC
HDFS
Google Cloud Storage
Google BigQuery
Amazon S3
Azure Blob Storage
Azure Data Lake Gen2
Azure Synapse | +| dbt | 0.20+ | Snowflake
Google BigQuery | + +## Integration strategies + +:::info +This section could use some more detail! You're welcome to contribute using the Edit link at the bottom. +::: + +### Integrating with pipelines + +![Integrating with Pipelines](integrate-pipelines.svg) + +### Integrating with data sources + +![Integrating with Data Sources](integrate-datasources.svg) diff --git a/versioned_docs/version-1.26.0/integrations/airflow/_category_.json b/versioned_docs/version-1.26.0/integrations/airflow/_category_.json new file mode 100644 index 0000000..e836aa5 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/_category_.json @@ -0,0 +1,8 @@ +{ + "label": "Apache Airflow", + "position": 4, + "link": { + "type": "doc", + "id": "airflow" + } +} \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/airflow/af-schematic.svg b/versioned_docs/version-1.26.0/integrations/airflow/af-schematic.svg new file mode 100644 index 0000000..c1e7b36 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/af-schematic.svg @@ -0,0 +1,229 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/versioned_docs/version-1.26.0/integrations/airflow/airflow.md b/versioned_docs/version-1.26.0/integrations/airflow/airflow.md new file mode 100644 index 0000000..9c0cdbe --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/airflow.md @@ -0,0 +1,52 @@ +--- +sidebar_position: 1 +title: Apache Airflow +--- + +:::caution +This page is about Airflow's external integration that works mainly for Airflow versions \<2.7. +[If you're using Airflow 2.7+, look at native Airflow OpenLineage provider documentation.](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html)

+ +The ongoing development and enhancements will be focused on the `apache-airflow-providers-openlineage` package, +while the `openlineage-airflow` will primarily be updated for bug fixes. See [all Airflow versions supported by this integration](older.md#supported-airflow-versions) +::: + + +**Airflow** is a widely-used workflow automation and scheduling platform that can be used to author and manage data pipelines. Airflow uses workflows made of directed acyclic graphs (DAGs) of tasks. To learn more about Airflow, check out the Airflow [documentation](https://airflow.apache.org/docs/apache-airflow/stable/index.html). + +## How does Airflow work with OpenLineage? + +Understanding complex inter-DAG dependencies and providing up-to-date runtime visibility into DAG execution can be challenging. OpenLineage integrates with Airflow to collect DAG lineage metadata so that inter-DAG dependencies are easily maintained and viewable via a lineage graph, while also keeping a catalog of historical runs of DAGs. + +![image](./af-schematic.svg) + + +The DAG metadata collected can answer questions like: + +* Why has a DAG failed? +* Why has the DAG runtime increased after a code change? +* What are the upstream dependencies of a DAG? + + +## How can I use this integration? + +To instrument your Airflow instance with OpenLineage, follow [these instructions](usage.md). + +## How to add lineage coverage for more operators? + +OpenLineage provides a set of `extractors` that extract lineage from operators. + +If you want to add lineage coverage for your own custom operators, follow these [instructions to add lineage to operators](default-extractors.md). + +If you want to add coverage for operators you can not modify, follow [instructions to add custom extractors](extractors/custom-extractors.md). + +If you want to expose lineage as a one off in your workflow, [you can also manually annotate the tasks in your DAG](manual.md). + +## Where can I learn more? + +* Take a look at Marquez's Airflow [example](https://github.com/MarquezProject/marquez/tree/main/examples/airflow) to learn how to enable OpenLineage metadata collection for Airflow DAGs and troubleshoot failing DAGs using Marquez. +* Watch [Data Lineage with OpenLineage and Airflow](https://www.youtube.com/watch?v=2s013GQy1Sw) + +## Feedback + +You can reach out to us on [slack](https://join.slack.com/t/openlineage/shared_invite/zt-2u4oiyz5h-TEmqpP4fVM5eCdOGeIbZvA) and leave us feedback! diff --git a/versioned_docs/version-1.26.0/integrations/airflow/default-extractors.md b/versioned_docs/version-1.26.0/integrations/airflow/default-extractors.md new file mode 100644 index 0000000..8cb012b --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/default-extractors.md @@ -0,0 +1,369 @@ +--- +sidebar_position: 4 +title: Exposing Lineage in Airflow Operators +--- + +:::caution +This page is about Airflow's external integration that works mainly for Airflow versions \<2.7. +[If you're using Airflow 2.7+, look at native Airflow OpenLineage provider documentation.](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html)

+ +The ongoing development and enhancements will be focused on the `apache-airflow-providers-openlineage` package, +while the `openlineage-airflow` will primarily be updated for bug fixes. See [all Airflow versions supported by this integration](older.md#supported-airflow-versions) +::: + +OpenLineage 0.17.0+ makes adding lineage to your data pipelines easy through support of direct modification of Airflow operators. This means that custom operators—built in-house or forked from another project—can provide you and your team with lineage data without requiring modification of the OpenLineage project. The data will still go to your lineage backend of choice, most commonly using the `OPENLINEAGE_URL` environment variable. + +Lineage extraction works a bit differently under the hood starting with OpenLineage 0.17.0. While extractors in the OpenLineage project have a getter method for operator names that they’re associated with, the default extractor looks for two specific methods in the operator itself and calls them directly if found. This means that implementation now consists of just two methods in your operator. + +Those methods are `get_openlineage_facets_on_start()` and `get_openlineage_facets_on_complete()`, called when the operator is first scheduled to run and when the operator has finished execution respectively. Either, or both, of the methods may be implemented by the operator. + +In the rest of this doc, you will see how to write these methods within an operator class called `DfToGcsOperator`. This operator moves a Dataframe from an arbitrary source table using a supplied Python callable to a specified path in GCS. Thorough understanding of the `__init__()` and `execute()` methods of the operator is not required, but an abbreviated version of each method is given below for context. The final two methods in the class are `get_openlineage_facets_on_start()` and `get_openlineage_facets_on_complete()`, which we will be implementing piece-by-piece in the rest of the doc. They are provided here in their entirety for completeness. + +```python +from openlineage.airflow.extractors.base import OperatorLineage +from openlineage.client.facet import ( + DataSourceDatasetFacet, + DocumentationJobFacet, + OwnershipJobFacet, + OwnershipJobFacetOwners, + SchemaDatasetFacet, + SchemaField, +) +from openlineage.client.run import Dataset + + +class DfToGcsOperator(): + def __init__( + self, + task_id, + python_callable, + data_source, + bucket=None, + table=None, + security_group, + pipeline_phase, + col_types=None, + check_cols=True, + **kwargs, + ): + """Initialize a DfToGcsOperator.""" + super().__init__(task_id=task_id, **kwargs) + self.python_callable = python_callable + self.data_source = data_source + self.table = table if table is not None else task_id + self.bucket = bucket + self.security_group = security_group + self.pipeline_phase = pipeline_phase + # col_types is a dict that stores expected column names and types, + self.col_types = col_types + self.check_cols = check_cols + + self.base_path = "/".join( + [self.security_group, self.pipeline_phase, self.data_source, self.table] + ) + # Holds meta information about the dataframe, col names and col types, + # that are used in the extractor. + self.df_meta = None + + def execute(self, context): + """ + Run a DfToGcs task. + + The task will run the python_callable and save + the resulting dataframe to GCS under the proper object path + ////. + """ + ... + + df = get_python_callable_result(self.python_callable, context) + if len(df) > 0: + df.columns = [clean_column_name(c) for c in df.columns] + if self.col_types and self.check_cols: + check_cols = [c.lower().strip() for c in self.col_types.keys()] + missing = [m for m in check_cols if m not in df.columns] + assert ( + len(missing) == 0 + ), "Columns present in col_types but not in DataFrame: " + ",".join( + missing + ) + + # ----------- # + # Save to GCS # + # ----------- # + + # Note: this is an imported helper function. + df_to_gcs(df, self.bucket, save_to_path) + + # ----------- # + # Return Data # + # ----------- # + + # Allow us to extract additional lineage information + # about all of the fields available in the dataframe + self.df_meta = extract_df_fields(df) + else: + print("Empty dataframe, no artifact saved to GCS.") + + def extract_df_fields(df): + from openlineage.common.dataset import SchemaField + """Extract a list of SchemaFields from a DataFrame.""" + fields = [] + for (col, dtype) in zip(df.columns, df.dtypes): + fields.append(SchemaField(name=col, type=str(dtype))) + return fields + + def get_openlineage_facets_on_start(self): + """Add lineage to DfToGcsOperator on task start.""" + if not self.bucket: + ol_bucket = get_env_bucket() + else: + ol_bucket = self.bucket + + input_uri = "://".join([self.data_source, self.table]) + input_source = DataSourceDatasetFacet( + name=self.table, + uri=input_uri, + ) + + input_facet = { + "datasource": input_source, + "schema": SchemaDatasetFacet( + fields=[ + SchemaField(name=col_name, type=col_type) + for col_name, col_type in self.col_types.items() + ] + ), + } + + input = Dataset(namespace=self.data_source, name=self.table, facets=input_facet) + + output_namespace = "gs://" + ol_bucket + output_name = self.base_path + output_uri = "/".join( + [ + output_namespace, + output_name, + ] + ) + + output_source = DataSourceDatasetFacet( + name=output_name, + uri=output_uri, + ) + + output_facet = { + "datasource": output_source, + "schema": SchemaDatasetFacet( + fields=[ + SchemaField(name=col_name, type=col_type) + for col_name, col_type in self.col_types.items() + ] + ), + } + + output = Dataset( + namespace=output_namespace, + name=output_name, + facets=output_facet, + ) + + return OperatorLineage( + inputs=[input], + outputs=[output], + run_facets={}, + job_facets={ + "documentation": DocumentationJobFacet( + description=f""" + Takes data from the data source {input_uri} + and puts it in GCS at the path: {output_uri} + """ + ), + "ownership": OwnershipJobFacet( + owners=[OwnershipJobFacetOwners(name=self.owner, type=self.email)] + ), + } + ) + + def get_openlineage_facets_on_complete(self, task_instance): + """Add lineage to DfToGcsOperator on task completion.""" + starting_facets = self.get_openlineage_facets_on_start() + if task_instance.task.df_meta is not None: + for i in starting_facets.inputs: + i.facets["SchemaDatasetFacet"].fields = task_instance.task.df_meta + else: + starting_facets.run_facets = { + "errorMessage": ErrorMessageRunFacet( + message="Empty dataframe, no artifact saved to GCS.", + programmingLanguage="python" + ) + } + return starting_facets +``` + +## Implementing lineage in an operator + +Not surprisingly, you will need an operator class to implement lineage collection in an operator. Here, we’ll use the `DfToGcsOperator`, a custom operator created by the Astronomer Data team to load arbitrary dataframes to our GCS bucket. We’ll implement both `get_openlineage_facets_on_start()` and `get_openlineage_facets_on_complete()` for our custom operator. The specific details of the implementation will vary from operator to operator, but there will always be five basic steps that these functions will share. + +Both the methods return an `OperatorLineage` object, which itself is a collection of facets. Four of the five steps mentioned above are creating these facets where necessary, and the fifth is creating the `DataSourceDatasetFacet`. First, though, we’ll need to import some OpenLineage objects: + +```python +from openlineage.airflow.extractors.base import OperatorLineage +from openlineage.client.facet import ( + DataSourceDatasetFacet, + SchemaDatasetFacet, + SchemaField, +) +from openlineage.client.run import Dataset +``` + +Now, we’ll start building the facets for the `OperatorLineage` object in the `get_openlineage_facets_on_start()` method. + +### 1. `DataSourceDatasetFacet` + +The `DataSourceDatasestFacet` is a simple object, containing two fields, `name` and `uri`, which should be populated with the unique name of the data source and the URI. We’ll make two of these objects, an `input_source` to specify where the data came from and an `output_source` to specify where the data is going. + +A quick note about the philosophy behind the `name` and `uri` in the OpenLineage spec: the `uri` is built from the `namespace` and the `name`, and each is expected to be unique with respect to its environment. This means a `namespace` should be globally unique in the OpenLineage universe, and the `name` unique within the `namespace`. The two are then concatenated to form the `uri`, so that `uri = namespace + name`. The full naming spec can be found [here](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md). + +In our case, the input `name` will be the table we are pulling data from, `self.table`, and the `namespace` will be our `self.data_source`. + +```python +input_source = DataSourceDatasetFacet( + name=self.table, + uri="://".join([self.data_source, self.table]), +) +``` + +The output data source object’s `name` will always be the base path given to the operator, `self.base_path`. The `namespace` is always in GCS, so we use the OpenLineage spec’s `gs://` as the scheme and our bucket as the authority, giving us `gs://{ol_bucket}`. The `uri` is simply the concatenation of the two. + +```python +if not self.bucket: + ol_bucket = get_env_bucket() +else: + ol_bucket = self.bucket + +output_namespace = "gs://" + ol_bucket +output_name = self.base_path +output_uri = "/".join( + [ + output_namespace, + output_name, + ] +) + +output_source = DataSourceDatasetFacet( + name=output_name, + uri=output_uri, +) +``` + +### 2. Inputs + +Next we’ll create the input dataset object. As we are moving data from a dataframe to GCS in this operator, we’ll make sure that we are capturing all the info in the dataframe being extracted in a `Dataset`. To create the `Dataset` object, we’ll need `namespace`, `name`, and `facets` objects. The first two are strings, and `facets` is a dictionary. + +Our `namespace` will come from the operator, where we use `self.data_source` again. The `name` parameter for this facet will be the table, again coming from the operator’s parameter list. The `facets` will contain two entries, the first being our `DataSourceDatasetFacet` with the key "datasource" coming from the previous step and `input_source` being the value. The second has the key "schema", with the value being a `SchemaDatasetFacet`, which itself is a collection of `SchemaField` objects, one for each column, created via a list comprehension over the operator's `self.col_types` parameter. + +The `inputs` parameter to `OperatorLineage` is a list of `Dataset` objects, so we’ll end up adding a single `Dataset` object to the list later. The creation of the `Dataset` object looks like the following: + +```python +input_facet = { + "datasource": input_source, + "schema": SchemaDatasetFacet( + fields=[ + SchemaField(name=col_name, type=col_type) + for col_name, col_type in self.col_types.items() + ] + ), +} + +input = Dataset(namespace=self.data_source, name=self.table, facets=input_facet) +``` + +### 3. Outputs + +Our output facet will closely resemble the input facet, except it will use the `output_source` we previously created, and will also have a different `namespace`. Our output facet object will be built as follows: + +```python +output_facet = { + "datasource": output_source, + "schema": SchemaDatasetFacet( + fields=[ + SchemaField(name=col_name, type=col_type) + for col_name, col_type in self.col_types.items() + ] + ), +} + +output = Dataset( + namespace=output_namespace, + name=output_name, + facets=output_facet, +) +``` + +### 4. Job facets + +A Job in OpenLineage is a process definition that consumes and produces datasets. The Job evolves over time, and this change is captured when the Job runs. This means the facets we would want to capture in the Job level are independent of the state of the Job. Custom facets can be created to capture this Job data. For our operator, we went with pre-existing job facets, the `DocumentationJobFacet` and the `OwnershipJobFacet`: + +```python +job_facets = { + "documentation": DocumentationJobFacet( + description=f""" + Takes data from the data source {input_uri} + and puts it in GCS at the path: {output_uri} + """ + ), + "ownership": OwnershipJobFacet( + owners=[OwnershipJobFacetOwners(name=self.owner, type=self.email)] + ) +} +``` + +### 5. Run facets + +A Run is an instance of a Job execution. For example, when an Airflow Operator begins execution, the Run state of the OpenLineage Job transitions to Start, then to Running. When writing an emitter, this means a Run facet should contain information pertinent to the specific instance of the Job, something that could change every Run. + +In this example, we will output an error message when there is an empty dataframe, using the existing `ErrorMessageRunFacet`. + +```python +starting_facets.run_facets = { + "errorMessage": ErrorMessageRunFacet( + message="Empty dataframe, no artifact saved to GCS.", + programmingLanguage="python" + ) +} +``` + +### 6. On complete + +Finally, we’ll implement the `get_openlineage_metadata_on_complete()` method. Most of our work has already been done for us, so we will start by calling `get_openlineage_metadata_on_start()` and then modifying the returned object slightly before returning it again. The two main additions here are replacing the original `SchemaDatasetFacet` fields and adding a potential error message to the `run_facets`. + +For the `SchemaDatasetFacet` update, we replace the old fields facet with updated ones based on the now-filled-out `df_meta` dict, which is populated during the operator’s `execute()` method and is therefore unavailable to `get_openlineage_metadata_on_start()`. Because `df_meta` is already a list of `SchemaField` objects, we can set the property directly. Although we use a for loop here, the operator ensures only one dataframe will ever be extracted per execution, so the for loop will only ever run once and we therefore do not have to worry about multiple input dataframes updating. + +The `run_facets` update is performed only if there is an error, which is a mutually exclusive event to updating the fields facets. We pass the same message to this facet that is printed in the `execute()` method when an empty dataframe is found. This error message does not halt operator execution, as it gets added *****after***** execution, but it does create an alert in the Marquez UI. + +```python +def get_openlineage_facets_on_complete(self, task_instance): + """Add lineage to DfToGcsOperator on task completion.""" + starting_facets = self.get_openlineage_facets_on_start() + if task_instance.task.df_meta is not None: + for i in starting_facets.inputs: + i.facets["SchemaDatasetFacet"].fields = task_instance.task.df_meta + else: + starting_facets.run_facets = { + "errorMessage": ErrorMessageRunFacet( + message="Empty dataframe, no artifact saved to GCS.", + programmingLanguage="python" + ) + } + return starting_facets +``` + +And with that final piece of the puzzle, we have a working implementation of lineage extraction from our custom operator! + +### Custom Facets + +The OpenLineage spec might not contain all the facets you need to write your extractor, in which case you will have to make your own [custom facets](https://openlineage.io/docs/spec/facets/custom-facets). More on creating custom facets can be found [here](https://openlineage.io/blog/extending-with-facets/). + +### Testing + +For information about testing your implementation, see the doc on [testing custom extractors](https://openlineage.io/docs/integrations/airflow/extractors/extractor-testing). diff --git a/versioned_docs/version-1.26.0/integrations/airflow/extractors/_category_.json b/versioned_docs/version-1.26.0/integrations/airflow/extractors/_category_.json new file mode 100644 index 0000000..0064691 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/extractors/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Extractors", + "position": 6 +} diff --git a/versioned_docs/version-1.26.0/integrations/airflow/extractors/custom-extractors.md b/versioned_docs/version-1.26.0/integrations/airflow/extractors/custom-extractors.md new file mode 100644 index 0000000..9bb1280 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/extractors/custom-extractors.md @@ -0,0 +1,121 @@ +--- +sidebar_position: 1 +title: Custom Extractors +--- + +:::caution +This page is about Airflow's external integration that works mainly for Airflow versions \<2.7. +[If you're using Airflow 2.7+, look at native Airflow OpenLineage provider documentation.](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html)

+ +The ongoing development and enhancements will be focused on the `apache-airflow-providers-openlineage` package, +while the `openlineage-airflow` will primarily be updated for bug fixes. See [all Airflow versions supported by this integration](../older.md#supported-airflow-versions) +::: + +This integration works by detecting which Airflow operators your DAG is using, and extracting lineage data from them using corresponding extractors. + +However, not all operators are covered. In particular, third party providers may not be. To handle this situation, OpenLineage allows you to provide custom extractors for any operators where there is not one built-in. + +If you want to extract lineage from your own Operators, you may prefer directly implementing [lineage support as described here](../default-extractors.md). + + +## Interface + +Custom extractors have to derive from `BaseExtractor`. + +Extractors have three methods to implement: `extract`, `extract_on_complete` and `get_operator_classnames`. +The last one is a classmethod that is used to provide list of operators that your extractor can get lineage from. + +For example: + +```python +@classmethod +def get_operator_classnames(cls) -> List[str]: + return ['PostgresOperator'] +``` + +If the name of the operator matches one of the names on the list, the extractor will be instantiated - with operator +provided in the extractor's `self.operator` property - and both `extract` and `extract_on_complete` methods will be called. +They are used to provide actual information data. The difference is that `extract` is called before operator's `execute` +method, while `extract_on_complete` is called after. This can be used to extract any additional information that the operator +sets on it's own properties. Good example is `SnowflakeOperator` that sets `query_ids` after execution. + +Both methods return `TaskMetadata` structure: + +```python +@attr.s +class TaskMetadata: + name: str = attr.ib() # deprecated + inputs: List[Dataset] = attr.ib(factory=list) + outputs: List[Dataset] = attr.ib(factory=list) + run_facets: Dict[str, BaseFacet] = attr.ib(factory=dict) + job_facets: Dict[str, BaseFacet] = attr.ib(factory=dict) +``` + +Inputs and outputs are lists of plain [OpenLineage datasets](../../../client/python.md) + +`run_facets` and `job_facets` are dictionaries of optional [JobFacets](../../../client/python.md) and [RunFacets](../../../client/python.md) that would be attached to the job - for example, +you might want to attach `SqlJobFacet` if your operator is executing SQL. + +To learn more about facets in OpenLineage, please visit this [section](../../../spec/facets). + + +## Registering custom extractor + +OpenLineage integration does not know that you've provided an extractor unless you'll register it. + +The way to do that is to add them to `OPENLINEAGE_EXTRACTORS` environment variable. +``` +OPENLINEAGE_EXTRACTORS=full.path.to.ExtractorClass +``` + +If you have multiple custom extractors, separate the paths with comma `(;)` +``` +OPENLINEAGE_EXTRACTORS=full.path.to.ExtractorClass;full.path.to.AnotherExtractorClass +``` + +Optionally, you can separate them with whitespace. It's useful if you're providing them as part of some YAML file. + +``` +OPENLINEAGE_EXTRACTORS: >- + full.path.to.FirstExtractor; + full.path.to.SecondExtractor +``` + +Remember to make sure that the path is importable for scheduler and worker. + +## Adding extractor to OpenLineage Airflow integration package + +All Openlineage extractors are defined in [this path](https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors). +In order to add new extractor you should put your code in this directory. Additionally, you need to add the class to `_extractors` list in [extractors.py](https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/extractors.py), e.g.: + +```python +_extractors = list( + filter( + lambda t: t is not None, + [ + try_import_from_string( + 'openlineage.airflow.extractors.postgres_extractor.PostgresExtractor' + ), + ... # other extractors are listed here ++ try_import_from_string( ++ 'openlineage.airflow.extractors.new_extractor.ExtractorClass' ++ ), + ] + ) +) +``` + +## Debugging issues + +There are two common problems associated with custom extractors. +First, is wrong path provided to `OPENLINEAGE_EXTRACTORS`. +The path needs to be exactly the same as one you'd use from your code. If the path is wrong or non-importable from worker, +plugin will fail to load the extractors and proper OpenLineage events for that operator won't be emitted. + +Second one, and maybe more insidious, are imports from Airflow. Due to the fact that OpenLineage code gets instantiated when +Airflow worker itself starts, any import from Airflow can be unnoticeably cyclical. This causes OpenLineage extraction to fail. + +To avoid this issue, import from Airflow only locally - in `extract` or `extract_on_complete` methods. If you need imports for +type checking, guard them behind `typing.TYPE_CHECKING`. + +You can also check [Development section](../../../development/developing/) to learn more about how to setup development environment and create tests. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/airflow/extractors/extractor-testing.md b/versioned_docs/version-1.26.0/integrations/airflow/extractors/extractor-testing.md new file mode 100644 index 0000000..0a891cf --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/extractors/extractor-testing.md @@ -0,0 +1,109 @@ +--- +sidebar_position: 2 +title: Testing Custom Extractors +--- + +:::caution +This page is about Airflow's external integration that works mainly for Airflow versions \<2.7. +[If you're using Airflow 2.7+, look at native Airflow OpenLineage provider documentation.](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html)

+ +The ongoing development and enhancements will be focused on the `apache-airflow-providers-openlineage` package, +while the `openlineage-airflow` will primarily be updated for bug fixes. See [all Airflow versions supported by this integration](../older.md#supported-airflow-versions) +::: + +OpenLineage comes with a variety of extractors for Airflow operators out of the box, but not every operator is covered. And if you are using a custom operator you or your team wrote, you'll certainly need to write a custom extractor. This guide will walk you through how to set up testing in a local dev environment, the most important data structures to write tests for, unit testing private functions, and some notes on troubleshooting. + +We assume prior knowledge of writing custom extractors. For details on multiple ways to write extractors, check out the Astronomer blog on [extractors](https://www.astronomer.io/blog/3-ways-to-extract-data-lineage-from-airflow/#using-custom-extractors-for-airflow-operators). This post builds on [Pursuing Lineage from Airflow using Custom Extractors](https://openlineage.io/blog/extractors/), and it is recommended to read that post first. To learn more about how Operators and Extractors work together under the hood, check out this [guide](https://openlineage.io/blog/operators-and-extractors-technical-deep-dive/). + +## Testing set-up + +We’ll use the same extractor that we built in the blog post, the `RedshiftDataExtractor`. When testing an extractor, we want to verify a few different sets of assumptions. The first set of assumptions are about the `TaskMetadata` object being created, specifically verifying that the object is being built with the correct input and output datasets and relevant facets. This is done in OpenLineage via pytest, with appropriate mocking and patching for connections and objects. In the OpenLineage repository, extractor unit tests are found in under `[integration/airflow/tests](https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/tests)`. For custom extractors, these tests should go under a `tests` directory at the top level of your project hierarchy. + +![An Astro project directory structure, with extractors in an `extractors`/ folder under `include/`, and tests under a top-level `tests/` folder.](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/95581136-2c1e-496a-ba51-a9b70256e004/Untitled.png) + +An Astro project directory structure, with extractors in an `extractors`/ folder under `include/`, and tests under a top-level `tests/` folder. + +### Testing the TaskMetadata object + +For the `RedshiftDataExtractor`, this core extract test is actually run on `extract_on_complete()`, as the `extract()` method is empty. We’ll walk through a test function to see how we can ensure the output dataset is being built as expected (full test code [here](https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/extractors/test_redshift_data_extractor.py)) + +```python +# First, we add patching to mock our connection to Redshift. +@mock.patch( + "airflow.providers.amazon.aws.operators.redshift_data.RedshiftDataOperator.hook", + new_callable=PropertyMock, +) +@mock.patch("botocore.client") +def test_extract_e2e(self, mock_client, mock_hook): + # Mock the descriptions we can expect from a real call. + mock_client.describe_statement.return_value = self.read_file_json( + "tests/extractors/redshift_statement_details.json" + ) + mock_client.describe_table.return_value = self.read_file_json( + "tests/extractors/redshift_table_details.json" + ) + # Finish setting mock objects' expected values. + job_id = "test_id" + mock_client.execute_statement.return_value = {"Id": job_id} + mock_hook.return_value.conn = mock_client + + # Set the extractor and ensure that the extract() method is not returning anything, as expected. + extractor = RedshiftDataExtractor(self.task) + task_meta_extract = extractor.extract() + assert task_meta_extract is None + + # Run an instance of RedshiftDataOperator with the predefined test values. + self.ti.run() + + # Run extract_on_complete() with the task instance object. + task_meta = extractor.extract_on_complete(self.ti) + + # Assert that the correct job_id was used in the client call. + mock_client.describe_statement.assert_called_with(Id=job_id) + + # Assert there is a list of output datasets. + assert task_meta.outputs + # Assert there is only dataset in the list. + assert len(task_meta.outputs) == 1 + # Assert the output dataset name is the same as the table created by the operator query. + assert task_meta.outputs[0].name == "dev.public.fruit" + # Assert the output dataset has a parsed schema. + assert task_meta.outputs[0].facets["schema"].fields is not None + # Assert the datasource is the correct Redshift URI. + assert ( + task_meta.outputs[0].facets["dataSource"].name + == f"redshift://{CLUSTER_IDENTIFIER}.{REGION_NAME}:5439" + ) + # Assert the uri is None (as it already exists in dataSource). + assert task_meta.outputs[0].facets["dataSource"].uri is None + # Assert the schema fields match the number of fields of the table created by the operator query. + assert len(task_meta.outputs[0].facets["schema"].fields) == 3 + # Assert the output statistics match the results of the operator query. + assert ( + OutputStatisticsOutputDatasetFacet( + rowCount=1, + size=11, + ) == task_meta.outputs[0].facets['stats'] + ) +``` + +Most of the assertions above are straightforward, yet all are important in ensuring that no unexpected behavior occurs when building the metadata object. Testing each facet is important, as data or graphs in the UI can render incorrectly if the facets are wrong. For example, if the `task_meta.outputs[0].facets["dataSource"].name` is created incorrectly in the extractor, then the operator’s task will not show up in the lineage graph, creating a gap in pipeline observability. + +### Testing private functions + +Private functions with any complexity beyond returning a string should be unit tested as well. An example of this is the `_get_xcom_redshift_job_id()` private function in the `RedshiftDataExtractor`. The unit test is shown below: + +```python +@mock.patch("airflow.models.TaskInstance.xcom_pull") +def test_get_xcom_redshift_job_id(self, mock_xcom_pull): + self.extractor._get_xcom_redshift_job_id(self.ti) + mock_xcom_pull.assert_called_once_with(task_ids=self.ti.task_id) +``` + +Unit tests do not have to be particularly complex, and in this instance the single assertion is enough to cover the expected behavior that the function was called only once. + +### Troubleshooting + +Even with unit tests, an extractor may still not be operating as expected. The easiest way to tell if data isn’t coming through correctly is if the UI elements are not showing up correctly in the Lineage tab. + +When testing code locally, Marquez can be used to inspect the data being emitted—or ***not*** being emitted. Using Marquez will allow you to figure out if the error is being caused by the extractor or the API. If data is being emitted from the extractor as expected but isn’t making it to the UI, then the extractor is fine and an issue should be opened up in OpenLineage. However, if data is not being emitted properly, it is likely that more unit tests are needed to cover extractor behavior. Marquez can help you pinpoint which facets are not being formed properly so you know where to add test coverage. diff --git a/versioned_docs/version-1.26.0/integrations/airflow/job-hierarchy.md b/versioned_docs/version-1.26.0/integrations/airflow/job-hierarchy.md new file mode 100644 index 0000000..c8491ad --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/job-hierarchy.md @@ -0,0 +1,22 @@ +--- +sidebar_position: 6 +title: Job Hierarchy +--- + +:::caution +This page is about Airflow's external integration that works mainly for Airflow versions \<2.7. +[If you're using Airflow 2.7+, look at native Airflow OpenLineage provider documentation.](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html)

+ +The ongoing development and enhancements will be focused on the `apache-airflow-providers-openlineage` package, +while the `openlineage-airflow` will primarily be updated for bug fixes. See [all Airflow versions supported by this integration](older.md#supported-airflow-versions) +::: + +## Job Hierarchy + +Apache Airflow features an inherent job hierarchy: DAGs, large and independently schedulable units, comprise smaller, executable tasks. + +OpenLineage reflects this structure in its Job Hierarchy model. +Upon DAG scheduling, a START event is emitted. +Subsequently, each task triggers START events at TaskInstance start and COMPLETE/FAILED events upon completion, following Airflow's task order. +Finally, upon DAG termination, a completion event (COMPLETE or FAILED) is emitted. +TaskInstance events' ParentRunFacet references the originating DAG run. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/airflow/manual.md b/versioned_docs/version-1.26.0/integrations/airflow/manual.md new file mode 100644 index 0000000..b3a40d0 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/manual.md @@ -0,0 +1,101 @@ +--- +sidebar_position: 5 +title: Manually Annotated Lineage +--- + +:::caution +This page is about Airflow's external integration that works mainly for Airflow versions \<2.7. +[If you're using Airflow 2.7+, look at native Airflow OpenLineage provider documentation.](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html)

+ +The ongoing development and enhancements will be focused on the `apache-airflow-providers-openlineage` package, +while the `openlineage-airflow` will primarily be updated for bug fixes. See [all Airflow versions supported by this integration](older.md#supported-airflow-versions) +::: + +:::caution +This feature is only supported with Airflow versions greater than 2.1.0) +::: + +Airflow allows operators to track lineage by specifying the input and outputs of the operators via inlets and outlets. OpenLineage tries to find the input and output datasets of the Airflow job via provided extractors or custom extractors. As fallback, if it fails to find any input or output datasets, then OpenLineage defaults to inlets and outlets of Airflow jobs. + + +OpenLineage supports automated lineage extraction only for selective operators. For other operators and custom-defined ones, users need to write their own custom extractors (by implementing `extract` / `extract_on_complete` method) for Airflow operators that indicate the input and output dataset of the corresponding task. +This can be circumvented by specifying the input and output datasets using operator's inlets and outlets. OpenLineage will default to use inlets and outlets as input/output datasets if it cannot find any successful extraction from the extractors. + +While specifying the DAG, inlets and outlets can be provided as lists of Tables for every operator. + +:::note +Airflow supports inlets and outlets to be either a Table, Column, File or User entity. However, currently OpenLineage only extracts lineage via Table entity* +::: + +## Example + +An operator insider the Airflow DAG can be annotated with inlets and outlets like - + +``` +"""Example DAG demonstrating the usage of the extraction via Inlets and Outlets.""" + +import pendulum +import datetime + +from airflow import DAG +from airflow.operators.bash import BashOperator +from airflow.lineage.entities import Table, File + +def create_table(cluster, database, name): + return Table( + database=database, + cluster=cluster, + name=name, + ) + +t1 = create_table("c1", "d1", "t1") +t2 = create_table("c1", "d1", "t2") +t3 = create_table("c1", "d1", "t3") +t4 = create_table("c1", "d1", "t4") +f1 = File(url = "http://randomfile") + +with DAG( + dag_id='example_operator', + schedule_interval='0 0 * * *', + start_date=pendulum.datetime(2021, 1, 1, tz="UTC"), + dagrun_timeout=datetime.timedelta(minutes=60), + params={"example_key": "example_value"}, +) as dag: + task1 = BashOperator( + task_id='task_1_with_inlet_outlet', + bash_command='echo "{{ task_instance_key_str }}" && sleep 1', + inlets=[t1, t2], + outlets=[t3], + ) + + task2 = BashOperator( + task_id='task_2_with_inlet_outlet', + bash_command='echo "{{ task_instance_key_str }}" && sleep 1', + inlets=[t3, f1], + outlets=[t4], + ) + + task1 >> task2 + +if __name__ == "__main__": + dag.cli() +``` + +--- + +The corresponding lineage graph will be - + + +marquez_lineage + +(The image is shown with the **Marquez** UI (metadata collector of OpenLineage events). More info can be found [here](https://marquezproject.github.io/marquez/). + +Also note that the *File* entity is not captured by the lineage event currently. + +--- + +## Conversion from Airflow Table entity to Openlineage Dataset + +The naming convention followed here is: +1. `CLUSTER` of the table entity becomes the namespace of OpenLineage's Dataset +2. The name of the dataset is formed by `{{DATABASE}}.{{NAME}}` where `DATABASE` and `NAME` are attributes specified by Airflow's Table entity. diff --git a/versioned_docs/version-1.26.0/integrations/airflow/older.md b/versioned_docs/version-1.26.0/integrations/airflow/older.md new file mode 100644 index 0000000..2054a5b --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/older.md @@ -0,0 +1,49 @@ +--- +sidebar_position: 2 +title: Supported Airflow versions +--- + +:::caution +This page is about Airflow's external integration that works mainly for Airflow versions \<2.7. +[If you're using Airflow 2.7+, look at native Airflow OpenLineage provider documentation.](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html)

+ +The ongoing development and enhancements will be focused on the `apache-airflow-providers-openlineage` package, +while the `openlineage-airflow` will primarily be updated for bug fixes. +::: + +#### SUPPORTED AIRFLOW VERSIONS + +##### Airflow 2.7+ + +This package **should not** be used starting with Airflow 2.7.0 and **can not** be used with Airflow 2.8+. +It was designed as Airflow's external integration that works mainly for Airflow versions \<2.7. +For Airflow 2.7+ use the native Airflow OpenLineage provider +[package](https://airflow.apache.org/docs/apache-airflow-providers-openlineage) `apache-airflow-providers-openlineage`. + +##### Airflow 2.3 - 2.6 + +The integration automatically registers itself starting from Airflow 2.3 if it's installed on the Airflow worker's Python. +This means you don't have to do anything besides configuring where the events are sent, which is described in the [configuration](#configuration) section. + +##### Airflow 2.1 - 2.2 + +> **_NOTE:_** The last version of openlineage-airflow to support Airflow versions 2.1-2.2 is **1.14.0** + +
+ +Integration for those versions has limitations: it does not support tracking failed jobs, +and job starts are registered only when a job ends (a `LineageBackend`-based approach collects all metadata +for a task on each task's completion). + +To make OpenLineage work, in addition to installing `openlineage-airflow` you need to set your `LineageBackend` +in your [airflow.cfg](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html) or via environmental variable `AIRFLOW__LINEAGE__BACKEND` to + +``` +openlineage.lineage_backend.OpenLineageBackend +``` + +The OpenLineageBackend does not take into account manually configured inlets and outlets. + +##### Airflow \<2.1 + +OpenLineage does not work with versions older than Airflow 2.1. diff --git a/versioned_docs/version-1.26.0/integrations/airflow/preflight-check-dag.md b/versioned_docs/version-1.26.0/integrations/airflow/preflight-check-dag.md new file mode 100644 index 0000000..421ec91 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/preflight-check-dag.md @@ -0,0 +1,338 @@ +--- +sidebar_position: 3 +title: Preflight check DAG +--- +# Preflight Check DAG + +## Purpose + +The preflight check DAG is created to verify the setup of OpenLineage within an Airflow environment. It checks the Airflow version, the version of the installed OpenLineage package, and the configuration settings read by the OpenLineage listener. This validation is crucial because, after setting up OpenLineage with Airflow and configuring necessary environment variables, users need confirmation that the setup is correctly done to start receiving OL events. + +## Configuration Variables + +The DAG introduces two configurable variables that users can set according to their requirements: + +- `BYPASS_LATEST_VERSION_CHECK`: Set this to `True` to skip checking for the latest version of the OpenLineage package. This is useful when accessing the PyPI URL is not possible or if users prefer not to upgrade. +- `LINEAGE_BACKEND`: This variable specifies the backend used for OpenLineage events ingestion. By default, it is set to `MARQUEZ`. Users utilizing a custom backend for OpenLineage should implement custom checks within the `_verify_custom_backend` function. + +## Implementation + +The DAG comprises several key functions, each designed to perform specific validations: + +1. **Version Checks**: It validates the installed OpenLineage package against the latest available version on PyPI, considering the `BYPASS_LATEST_VERSION_CHECK` flag. +2. **Airflow Version Compatibility**: Ensures that the Airflow version is compatible with OpenLineage. OpenLineage requires Airflow version 2.1 or newer. +3. **Transport and Configuration Validation**: Checks if necessary transport settings and configurations are set for OpenLineage to communicate with the specified backend. +4. **Backend Connectivity**: Verifies the connection to the specified `LINEAGE_BACKEND` to ensure that OpenLineage can successfully send events. +5. **Listener Accessibility and OpenLineage Plugin Checks**: Ensures that the OpenLineage listener is accessible and that OpenLineage is not disabled (by [environment variable](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html#:~:text=OPENLINEAGE_DISABLED%20is%20an%20equivalent%20of%20AIRFLOW__OPENLINEAGE__DISABLED.) or [config](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html#disable)). + +### DAG Tasks + +The DAG defines three main tasks that sequentially execute the above validations: + +1. `validate_ol_installation`: Confirms that the OpenLineage installation is correct and up-to-date. +2. `is_ol_accessible_and_enabled`: Checks if OpenLineage is accessible and enabled within Airflow. +3. `validate_connection`: Verifies the connection to the specified lineage backend. + +### Setup and Execution + +To use this DAG: + +1. Ensure that OpenLineage is installed within your Airflow environment. +2. Set the necessary environment variables for OpenLineage, such as the namespace and the URL or transport mechanism using [provider package docs](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html) or [OL docs](https://openlineage.io/docs/integrations/airflow/usage). +3. Configure the `BYPASS_LATEST_VERSION_CHECK` and `LINEAGE_BACKEND` variables as needed. +4. Add the DAG file to your Airflow DAGs folder. +5. Trigger the DAG manually or just enable it and allow it to run once automatically based on its schedule (@once) to perform the preflight checks. + +## Preflight check DAG code +```python +from __future__ import annotations + +import logging +import os +import attr + +from packaging.version import Version + +from airflow import DAG +from airflow.configuration import conf +from airflow import __version__ as airflow_version +from airflow.operators.python import PythonOperator +from airflow.utils.dates import days_ago + +# Set this to True to bypass the latest version check for OpenLineage package. +# Version check will be skipped if unable to access PyPI URL +BYPASS_LATEST_VERSION_CHECK = False +# Update this to `CUSTOM` if using any other backend for OpenLineage events ingestion +# When using custom transport - implement custom checks in _verify_custom_backend function +LINEAGE_BACKEND = "MARQUEZ" + +log = logging.getLogger(__name__) + + +def _get_latest_package_version(library_name: str) -> Version | None: + try: + import requests + + response = requests.get(f"https://pypi.org/pypi/{library_name}/json") + response.raise_for_status() + version_string = response.json()["info"]["version"] + return Version(version_string) + except Exception as e: + log.error(f"Failed to fetch latest version for `{library_name}` from PyPI: {e}") + return None + + +def _get_installed_package_version(library_name) -> Version | None: + try: + from importlib.metadata import version + + return Version(version(library_name)) + except Exception as e: + raise ModuleNotFoundError(f"`{library_name}` is not installed") from e + + +def _provider_can_be_used() -> bool: + parsed_version = Version(airflow_version) + if parsed_version < Version("2.1"): + raise RuntimeError("OpenLineage is not supported in Airflow versions <2.1") + elif parsed_version >= Version("2.7"): + return True + return False + + +def validate_ol_installation() -> None: + library_name = "openlineage-airflow" + if _provider_can_be_used(): + library_name = "apache-airflow-providers-openlineage" + + library_version = _get_installed_package_version(library_name) + if Version(airflow_version) >= Version("2.10.0") and library_version < Version("1.8.0"): + raise ValueError( + f"Airflow version `{airflow_version}` requires `{library_name}` version >=1.8.0. " + f"Installed version: `{library_version}` " + f"Please upgrade the package using `pip install --upgrade {library_name}`" + ) + if BYPASS_LATEST_VERSION_CHECK: + log.info(f"Bypassing the latest version check for `{library_name}`") + return + + latest_version = _get_latest_package_version(library_name) + if latest_version is None: + log.warning(f"Failed to fetch the latest version for `{library_name}`. Skipping version check.") + return + + if library_version < latest_version: + raise ValueError( + f"`{library_name}` is out of date. " + f"Installed version: `{library_version}`, " + f"Required version: `{latest_version}`" + f"Please upgrade the package using `pip install --upgrade {library_name}` or set BYPASS_LATEST_VERSION_CHECK to True" + ) + + +def _is_transport_set() -> None: + transport = conf.get("openlineage", "transport", fallback="") + if transport: + raise ValueError( + "Transport value found: `%s`\n" + "Please check the format at " + "https://openlineage.io/docs/client/python/#built-in-transport-types", + transport, + ) + log.info("Airflow OL transport is not set.") + return + + +def _is_config_set(provider: bool = True) -> None: + if provider: + config_path = conf.get("openlineage", "config_path", fallback="") + else: + config_path = os.getenv("OPENLINEAGE_CONFIG", "") + + if config_path and not _check_openlineage_yml(config_path): + raise ValueError( + "Config file is empty or does not exist: `%s`", + config_path, + ) + + log.info("OL config is not set.") + return + + +def _check_openlineage_yml(file_path) -> bool: + file_path = os.path.expanduser(file_path) + if os.path.exists(file_path): + with open(file_path, "r") as file: + content = file.read() + if not content: + raise ValueError(f"Empty file: `{file_path}`") + raise ValueError( + f"File found at `{file_path}` with the following content: `{content}`. " + "Make sure there the configuration is correct." + ) + log.info("File not found: `%s`", file_path) + return False + + +def _check_http_env_vars() -> None: + from urllib.parse import urljoin + + final_url = urljoin(os.getenv("OPENLINEAGE_URL", ""), os.getenv("OPENLINEAGE_ENDPOINT", "")) + if final_url: + raise ValueError("OPENLINEAGE_URL and OPENLINEAGE_ENDPOINT are set to: %s", final_url) + log.info( + "OPENLINEAGE_URL and OPENLINEAGE_ENDPOINT are not set. " + "Please set up OpenLineage using documentation at " + "https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html" + ) + return + + +def _debug_missing_transport(): + if _provider_can_be_used(): + _is_config_set(provider=True) + _is_transport_set() + _is_config_set(provider=False) + _check_openlineage_yml("openlineage.yml") + _check_openlineage_yml("~/.openlineage/openlineage.yml") + _check_http_env_vars() + raise ValueError("OpenLineage is missing configuration, please refer to the OL setup docs.") + + +def _is_listener_accessible(): + if _provider_can_be_used(): + try: + from airflow.providers.openlineage.plugins.openlineage import OpenLineageProviderPlugin as plugin + except ImportError as e: + raise ValueError("OpenLineage provider is not accessible") from e + else: + try: + from openlineage.airflow.plugin import OpenLineagePlugin as plugin + except ImportError as e: + raise ValueError("OpenLineage is not accessible") from e + + if len(plugin.listeners) == 1: + return True + + return False + + +def _is_ol_disabled(): + if _provider_can_be_used(): + try: + # apache-airflow-providers-openlineage >= 1.7.0 + from airflow.providers.openlineage.conf import is_disabled + except ImportError: + # apache-airflow-providers-openlineage < 1.7.0 + from airflow.providers.openlineage.plugins.openlineage import _is_disabled as is_disabled + else: + from openlineage.airflow.plugin import _is_disabled as is_disabled + + if is_disabled(): + if _provider_can_be_used() and conf.getboolean("openlineage", "disabled", fallback=False): + raise ValueError("OpenLineage is disabled in airflow.cfg: openlineage.disabled") + elif os.getenv("OPENLINEAGE_DISABLED", "false").lower() == "true": + raise ValueError( + "OpenLineage is disabled due to the environment variable OPENLINEAGE_DISABLED" + ) + raise ValueError( + "OpenLineage is disabled because required config/env variables are not set. " + "Please refer to " + "https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html" + ) + return False + + +def _get_transport(): + if _provider_can_be_used(): + from airflow.providers.openlineage.plugins.openlineage import OpenLineageProviderPlugin + transport = OpenLineageProviderPlugin().listeners[0].adapter.get_or_create_openlineage_client().transport + else: + from openlineage.airflow.plugin import OpenLineagePlugin + transport = ( + OpenLineagePlugin.listeners[0].adapter.get_or_create_openlineage_client().transport + ) + return transport + +def is_ol_accessible_and_enabled(): + if not _is_listener_accessible(): + _is_ol_disabled() + + try: + transport = _get_transport() + except Exception as e: + raise ValueError("There was an error when trying to build transport.") from e + + if transport is None or transport.kind in ("noop", "console"): + _debug_missing_transport() + + +def validate_connection(): + transport = _get_transport() + config = attr.asdict(transport.config) + verify_backend(LINEAGE_BACKEND, config) + + +def verify_backend(backend_type: str, config: dict): + backend_type = backend_type.lower() + if backend_type == "marquez": + return _verify_marquez_http_backend(config) + elif backend_type == "atlan": + return _verify_atlan_http_backend(config) + elif backend_type == "custom": + return _verify_custom_backend(config) + raise ValueError(f"Unsupported backend type: {backend_type}") + + +def _verify_marquez_http_backend(config): + log.info("Checking Marquez setup") + ol_url = config["url"] + ol_endpoint = config["endpoint"] # "api/v1/lineage" + marquez_prefix_path = ol_endpoint[: ol_endpoint.rfind("/") + 1] # "api/v1/" + list_namespace_url = ol_url + "/" + marquez_prefix_path + "namespaces" + import requests + + try: + response = requests.get(list_namespace_url) + response.raise_for_status() + except Exception as e: + raise ConnectionError(f"Failed to connect to Marquez at `{list_namespace_url}`") from e + log.info("Airflow is able to access the URL") + + +def _verify_atlan_http_backend(config): + raise NotImplementedError("This feature is not implemented yet") + + +def _verify_custom_backend(config): + raise NotImplementedError("This feature is not implemented yet") + + +with DAG( + dag_id="openlineage_preflight_check_dag", + start_date=days_ago(1), + description="A DAG to check OpenLineage setup and configurations", + schedule_interval="@once", +) as dag: + validate_ol_installation_task = PythonOperator( + task_id="validate_ol_installation", + python_callable=validate_ol_installation, + ) + + is_ol_accessible_and_enabled_task = PythonOperator( + task_id="is_ol_accessible_and_enabled", + python_callable=is_ol_accessible_and_enabled, + ) + + validate_connection_task = PythonOperator( + task_id="validate_connection", + python_callable=validate_connection, + ) + + validate_ol_installation_task >> is_ol_accessible_and_enabled_task + is_ol_accessible_and_enabled_task >> validate_connection_task +``` + +## Conclusion + +The OpenLineage Preflight Check DAG serves as a vital tool for ensuring that the OpenLineage setup within Airflow is correct and fully operational. By following the instructions and configurations documented here, users can confidently verify their setup and start utilizing OpenLineage for monitoring and managing data lineage within their Airflow workflows. diff --git a/versioned_docs/version-1.26.0/integrations/airflow/usage.md b/versioned_docs/version-1.26.0/integrations/airflow/usage.md new file mode 100644 index 0000000..4815b21 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/airflow/usage.md @@ -0,0 +1,83 @@ +--- +sidebar_position: 1 +title: Using the Airflow Integration +--- + +:::caution +This page is about Airflow's external integration that works mainly for Airflow versions \<2.7. +[If you're using Airflow 2.7+, look at native Airflow OpenLineage provider documentation.](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html)

+ +The ongoing development and enhancements will be focused on the `apache-airflow-providers-openlineage` package, +while the `openlineage-airflow` will primarily be updated for bug fixes. See [all Airflow versions supported by this integration](older.md#supported-airflow-versions) +::: + +#### PREREQUISITES + +- [Python 3.8](https://www.python.org/downloads) +- [Airflow >= 2.1,\<2.8](https://pypi.org/project/apache-airflow) + +To use the OpenLineage Airflow integration, you'll need a running [Airflow instance](https://airflow.apache.org/docs/apache-airflow/stable/start.html). You'll also need an OpenLineage-compatible [backend](https://github.com/OpenLineage/OpenLineage#scope). + +#### INSTALLATION + +Before installing check [supported Airflow versions](older.md#supported-airflow-versions). + +To download and install the latest `openlineage-airflow` library run: + +``` +openlineage-airflow +``` + +You can also add `openlineage-airflow` to your `requirements.txt` for Airflow. + +To install from source, run: + +```bash +$ python3 setup.py install +``` + +#### CONFIGURATION + +Next, specify where you want OpenLineage to send events. + +We recommend configuring the client with an `openlineage.yml` file that tells the client how to connect to an OpenLineage backend. +[See how to do it.](../../client/python.md#configuration) + +The simplest option, limited to HTTP client, is to use the environment variables. +For example, to send OpenLineage events to a local instance of [Marquez](https://github.com/MarquezProject/marquez), use: + +```bash +OPENLINEAGE_URL=http://localhost:5000 +OPENLINEAGE_ENDPOINT=api/v1/lineage # This is the default value when this variable is not set, it can be omitted in this example +OPENLINEAGE_API_KEY=secret_token # This is only required if authentication headers are required, it can be omitted in this example +``` + +To set up an additional configuration, or to send events to targets other than an HTTP server (e.g., a Kafka topic), [configure a client.](../../client/python.md#configuration) + +> **_NOTE:_** If you use a version of Airflow older than 2.3.0, [additional configuration is required](older.md#airflow-21---22). + +##### Environment Variables + +The following environment variables are available specifically for the Airflow integration, in addition to [Python client variables](../../client/python.md#environment-variables). + +| Name | Description | Example | +|-----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------| +| OPENLINEAGE_AIRFLOW_DISABLE_SOURCE_CODE | Set to `False` if you want source code of callables provided in PythonOperator or BashOperator `NOT` to be included in OpenLineage events. | False | +| OPENLINEAGE_EXTRACTORS | The optional list of extractors class (as semi-colon separated string) in case you need to use custom extractors. | full.path.to.ExtractorClass;full.path.to.AnotherExtractorClass | +| OPENLINEAGE_NAMESPACE | The optional namespace that the lineage data belongs to. If not specified, defaults to `default`. | my_namespace | +| OPENLINEAGE_AIRFLOW_LOGGING | Logging level of OpenLineage client in Airflow (the OPENLINEAGE_CLIENT_LOGGING variable from python client has no effect here). | DEBUG | + +For backwards compatibility, `openlineage-airflow` also supports configuration via +`MARQUEZ_NAMESPACE`, `MARQUEZ_URL` and `MARQUEZ_API_KEY` variables, instead of standard +`OPENLINEAGE_NAMESPACE`, `OPENLINEAGE_URL` and `OPENLINEAGE_API_KEY`. +Variables with different prefix should not be mixed together. + + +#### USAGE + +When enabled, the integration will: + +* On TaskInstance **start**, collect metadata for each task. +* Collect task input / output metadata (source, schema, etc.). +* Collect task run-level metadata (execution time, state, parameters, etc.) +* On TaskInstance **complete**, also mark the task as complete in Marquez. diff --git a/versioned_docs/version-1.26.0/integrations/dbt.md b/versioned_docs/version-1.26.0/integrations/dbt.md new file mode 100644 index 0000000..beca93c --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/dbt.md @@ -0,0 +1,61 @@ +--- +sidebar_position: 5 +title: dbt +--- + +dbt (data build tool) is a powerful transformation engine. It operates on data already within a warehouse, making it easy for data engineers to build complex pipelines from the comfort of their laptops. While it doesn’t perform extraction and loading of data, it’s extremely powerful at transformations. + +To learn more about dbt, visit the [documentation site](https://docs.getdbt.com) or run through the [getting started tutorial](https://docs.getdbt.com/tutorial/setting-up). + +## How does dbt work with OpenLineage? + +Fortunately, dbt already collects a lot of the data required to create and emit OpenLineage events. When it runs, it creates a `target/manifest.json` file containing information about jobs and the datasets they affect, and a `target/run_results.json` file containing information about the run-cycle. These files can be used to trace lineage and job performance. In addition, by using the `create catalog` command, a user can instruct dbt to create a `target/catalog.json` file containing information about dataset schemas. + +These files contain everything needed to trace lineage. However, the `target/manifest.json` and `target/run_results.json` files are only populated with comprehensive metadata after completion of a run-cycle. + +This integration is implemented as a wrapper script, `dbt-ol`, that calls `dbt` and, after the run has completed, collects information from the three json files and calls the OpenLineage API accordingly. For most users, enabling OpenLineage metadata collection can be accomplished by simply substituting `dbt-ol` for `dbt` when performing a run. + +## Preparing a dbt project for OpenLineage + +First, we need to install the integration: + +```bash +pip3 install openlineage-dbt +``` + +Next, we specify where we want dbt to send OpenLineage events by setting the `OPENLINEAGE_URL` environment variable. For example, to send OpenLineage events to a local instance of Marquez, use: + +```bash +OPENLINEAGE_URL=http://localhost:5000 +``` + +Finally, we can optionally specify a namespace where the lineage events will be stored. For example, to use the namespace "dev": + +```bash +OPENLINEAGE_NAMESPACE=dev +``` + +## Running dbt with OpenLineage + +To run your dbt project with OpenLineage collection, simply replace `dbt` with `dbt-ol`: + +```bash +dbt-ol run +``` + +The `dbt-ol` wrapper supports all of the standard `dbt` subcommands, and is safe to use as a substitutuon (i.e., in an alias). Once the run has completed, you will see output containing the number of events sent via the OpenLineage API: + +```bash +Completed successfully + +Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2 +Emitted 4 openlineage events +``` + +## Where can I learn more? + +* Watch [a short demonstration of the integration in action](https://youtu.be/7caHXLDKacg) + +## Feedback + +What did you think of this guide? You can reach out to us on [slack](https://join.slack.com/t/openlineage/shared_invite/zt-2u4oiyz5h-TEmqpP4fVM5eCdOGeIbZvA) and leave us feedback! \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/flink.md b/versioned_docs/version-1.26.0/integrations/flink.md new file mode 100644 index 0000000..871b94e --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/flink.md @@ -0,0 +1,138 @@ +--- +sidebar_position: 4 +title: Apache Flink +--- + + +**Apache Flink** is one of the most popular stream processing frameworks. Apache Flink jobs run on clusters, +which are composed of two types of nodes: `TaskManagers` and `JobManagers`. While clusters typically consists of +multiple `TaskManagers`, only reason to run multiple JobManagers is high availability. The jobs are _submitted_ +to `JobManager` by `JobClient`, that compiles user application into dataflow graph which is understandable by `JobManager`. +`JobManager` then coordinates job execution: it splits the parallel units of a job +to `TaskManagers`, manages heartbeats, triggers checkpoints, reacts to failures and much more. + +Apache Flink has multiple deployment modes - Session Mode, Application Mode and Per-Job mode. The most popular +are Session Mode and Application Mode. Session Mode consists of a `JobManager` managing multiple jobs sharing single +Flink cluster. In this mode, `JobClient` is executed on a machine that submits the job to the cluster. + +Application Mode is used where cluster is utilized for a single job. In this mode, `JobClient`, where the main method runs, +is executed on the `JobManager`. + +Flink jobs read data from `Sources` and write data to `Sinks`. In contrast to systems like Apache Spark, Flink jobs can write +data to multiple places - they can have multiple `Sinks`. + +## Getting lineage from Flink + +OpenLineage utilizes Flink's `JobListener` interface. This interface is used by Flink to notify user of job submission, +successful finish of job, or job failure. Implementations of this interface are executed on `JobClient`. + +When OpenLineage listener receives information that job was submitted, it extracts `Transformations` from job's +`ExecutionEnvironment`. The `Transformations` represent logical operations in the dataflow graph; they are composed +of both Flink's built-in operators, but also user-provided `Sources`, `Sinks` and functions. To get the lineage, +OpenLineage integration processes dataflow graph. Currently, OpenLineage is interested only in information contained +in `Sources` and `Sinks`, as they are the places where Flink interacts with external systems. + +After job submission, OpenLineage integration starts actively listening to checkpoints - this gives insight into +whether the job runs properly. + +## Limitations + +Currently, OpenLineage's Flink integration is limited to getting information from jobs running in Application Mode. + +OpenLineage integration extracts lineage only from following `Sources` and `Sinks`: + +
+ + + + + + + + + + + + + + + + + + +
SourcesSinks
KafkaSourceKafkaSink (1)
FlinkKafkaConsumerFlinkKafkaProducer
IcebergFlinkSourceIcebergFlinkSink
+ +We expect this list to grow as we add support for more connectors. + +(1) KafkaSink supports sinks that write to a single topic as well as multi topic sinks. The +limitation for multi topic sink is that: topics need to have the same schema and implementation +of `KafkaRecordSerializationSchema` must extend `KafkaTopicsDescriptor`. +Methods `isFixedTopics` and `getFixedTopics` from `KafkaTopicsDescriptor` are used to extract multiple topics +from a sink. + +## Usage + +In your job, you need to set up `OpenLineageFlinkJobListener`. + +For example: +```java +JobListener listener = OpenLineageFlinkJobListener.builder() + .executionEnvironment(streamExecutionEnvironment) + .build(); +streamExecutionEnvironment.registerJobListener(listener); +``` + +Also, OpenLineage needs certain parameters to be set in `flink-conf.yaml`: + + + + + + + + + + + + + + + + +
Configuration KeyDescriptionExpected ValueDefault
execution.attachedThis setting needs to be true if OpenLineage is to detect job start and failuretruefalse
+ +OpenLineage jar needs to be present on `JobManager`. + +When the `JobListener` is configured, you need to point the OpenLineage integration where the events should end up. +If you're using `Marquez`, simplest way to do that is to set up `OPENLINEAGE_URL` environment +variable to `Marquez` URL. More advanced settings are [in the client documentation.](../client/java/java.md). + +## Configuring Openlineage connector + +Flink Openlineage connector utilizes standard [Java client for Openlineage](https://github.com/OpenLineage/OpenLineage/tree/main/client/java) +and allows all the configuration features present there to be used. The configuration can be passed with: + * `openlineage.yml` file with a environment property `OPENLINEAGE_CONFIG` being set and pointing to configuration file. File structure and allowed options are described [here](https://github.com/OpenLineage/OpenLineage/tree/main/client/java#configuration). + * Standard Flink configuration with the parameters defined below. + +### Flink Configuration parameters + +The following parameters can be specified: + +| Parameter | Definition | Example | +|-------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------| +| openlineage.transport.type | The transport type used for event emit, default type is `console` | http | +| openlineage.facets.disabled | List of facets to disable, enclosed in `[]` (required from 0.21.x) and separated by `;`, default is `[spark_unknown;spark.logicalPlan;]` (currently must contain `;`) | \[some_facet1;some_facet1\] | +| openlineage.job.owners.\ | Specifies ownership of the job. Multiple entries with different types are allowed. Config key name and value are used to create job ownership type and name (available since 1.13). | openlineage.job.owners.team="Some Team" | + +## Transports + +import Transports from '@site/docs/client/java/partials/java_transport.md'; + + + +## Circuit Breakers + +import CircuitBreakers from '@site/docs/client/java/partials/java_circuit_breaker.md'; + + + diff --git a/versioned_docs/version-1.26.0/integrations/great-expectations.md b/versioned_docs/version-1.26.0/integrations/great-expectations.md new file mode 100644 index 0000000..e371556 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/great-expectations.md @@ -0,0 +1,86 @@ +--- +sidebar_position: 6 +title: Great Expectations +--- + +Great Expectations is a robust data quality tool. It runs suites of checks, called expectations, over a defined dataset. This dataset can be a table in a database, or a Spark or Pandas dataframe. Expectations are run by checkpoints, which are configuration files that describe not just the expectations to use, but also any batching, runtime configurations, and, importantly, the action list of actions run after the expectation suite completes. + +To learn more about Great Expectations, visit their [documentation site](https://docs.greatexpectations.io/docs/). + +## How does Great Expectations work with OpenLineage? + +Great Expecations integrates with OpenLineage through the action list in a checkpoint. An OpenLineage action can be specified, which is triggered when all expectations are run. Data from the checkpoint is sent to OpenLineage, which can then be viewed in Marquez or Datakin. + +## Preparing a Great Expectations project for OpenLineage + +First, we specify where we want Great Expectations to send OpenLineage events by setting the `OPENLINEAGE_URL` environment variable. For example, to send OpenLineage events to a local instance of Marquez, use: + +```bash +OPENLINEAGE_URL=http://localhost:5000 +``` + +If data is being sent to an endpoint with an API key, then that key must be supplied as well: + +```bash +OPENLINEAGE_API_KEY=123456789 +``` + +We can optionally specify a namespace where the lineage events will be stored. For example, to use the namespace "dev": + +```bash +OPENLINEAGE_NAMESPACE=dev +``` + +With these environment variables set, we can add the OpenLineage action to the action list of the Great Expecations checkpoint. +Note: this must be done for *each* checkpoint. +Note: when using the `GreatExpectationsOperator>=0.2.0` in Airflow, there is a boolean parameter, defaulting to `True`, that will automatically create this action list item when it detects the OpenLineage environment specified in the previous step. + + +In a python checkpoint, this looks like: + +```python +action_list = [ + { + "name": "store_validation_result", + "action": {"class_name": "StoreValidationResultAction"}, + }, + { + "name": "store_evaluation_params", + "action": {"class_name": "StoreEvaluationParametersAction"}, + }, + { + "name": "update_data_docs", + "action": {"class_name": "UpdateDataDocsAction", "site_names": []}, + }, + { + "name": "open_lineage", + "action": { + "class_name": "OpenLineageValidationAction", + "module_name": "openlineage.common.provider.great_expectations", + "openlineage_host": os.getenv("OPENLINEAGE_URL"), + "openlineage_apiKey": os.getenv("OPENLINEAGE_API_KEY"), + "openlineage_namespace": oss.getenv("OPENLINEAGE_NAMESPACE"), + "job_name": "openlineage_job", + }, + } +] +``` + +And in yaml: + +```yaml +name: openlineage + action: + class_name: OpenLineageValidationAction + module_name: openlineage.common.provider.great_expectations + openlineage_host: + openlineage_apiKey: + openlineage_namespace: # Replace with your job namespace; we recommend a meaningful namespace like `dev` or `prod`, etc. + job_name: validate_my_dataset +``` + +Then run your Great Expecations checkpoint with the CLI or your integration of choice. + +## Feedback + +What did you think of this guide? You can reach out to us on [slack](https://join.slack.com/t/openlineage/shared_invite/zt-2u4oiyz5h-TEmqpP4fVM5eCdOGeIbZvA) and leave us feedback! diff --git a/versioned_docs/version-1.26.0/integrations/integrate-datasources.svg b/versioned_docs/version-1.26.0/integrations/integrate-datasources.svg new file mode 100644 index 0000000..9cc12a5 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/integrate-datasources.svg @@ -0,0 +1,127 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/versioned_docs/version-1.26.0/integrations/integrate-pipelines.svg b/versioned_docs/version-1.26.0/integrations/integrate-pipelines.svg new file mode 100644 index 0000000..8b75824 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/integrate-pipelines.svg @@ -0,0 +1,130 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/versioned_docs/version-1.26.0/integrations/spark/_category_.json b/versioned_docs/version-1.26.0/integrations/spark/_category_.json new file mode 100644 index 0000000..07472be --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Apache Spark", + "position": 3 +} diff --git a/versioned_docs/version-1.26.0/integrations/spark/configuration/_category_.json b/versioned_docs/version-1.26.0/integrations/spark/configuration/_category_.json new file mode 100644 index 0000000..918fb7c --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/configuration/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Configuration", + "position": 3 +} diff --git a/versioned_docs/version-1.26.0/integrations/spark/configuration/airflow.md b/versioned_docs/version-1.26.0/integrations/spark/configuration/airflow.md new file mode 100644 index 0000000..086e15e --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/configuration/airflow.md @@ -0,0 +1,61 @@ +--- +sidebar_position: 4 +title: Scheduling from Airflow +--- + + +The same parameters that are passed to `spark-submit` can also be supplied directly from **Airflow** +and other schedulers, allowing for seamless configuration and execution of Spark jobs. + +When using the [`OpenLineage Airflow`](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html) +integration with operators that submit Spark jobs, the entire Spark OpenLineage integration can be configured +directly within Airflow. + +### Preserving Job Hierarchy + +To establish a correct job hierarchy in lineage tracking, the Spark application and lineage backend require +identifiers of the parent job that triggered the Spark job. These identifiers allow the Spark integration +to automatically add a `parentRunFacet` to the application-level OpenLineage event, facilitating the linkage +of the Spark job to its originating (Airflow) job in the lineage graph. + +The following properties are necessary for the automatic creation of the `parentRunFacet`: + +- `spark.openlineage.parentJobNamespace` +- `spark.openlineage.parentJobName` +- `spark.openlineage.parentRunId` + +Refer to the [Spark Configuration](spark_conf.md) documentation for more information on these properties. + +OpenLineage Airflow integration provides powerful [macros](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/macros.html) +that can be used to dynamically generate these identifiers. + +### Example + +Below is an example of a `DataprocSubmitJobOperator` that submits a PySpark application to Dataproc cluster: + +```python +t1 = DataprocSubmitJobOperator( + task_id="task_id", + project_id="project_id", + region='eu-central2', + job={ + "reference": {"project_id": "project_id"}, + "placement": {"cluster_name": "cluster_name"}, + "pyspark_job": { + "main_python_file_uri": "gs://bucket/your-prog.py", + "properties": { + "spark.extraListeners": "io.openlineage.spark.agent.OpenLineageSparkListener", + "spark.jars.packages": "io.openlineage:openlineage-spark_${SCALA_BINARY_VERSION}:{{PREPROCESSOR:OPENLINEAGE_VERSION}}", + "spark.openlineage.transport.url": openlineage_url, + "spark.openlineage.transport.auth.apiKey": api_key, + "spark.openlineage.transport.auth.type": "apiKey", + "spark.openlineage.namespace": openlineage_spark_namespace, + "spark.openlineage.parentJobNamespace": "{{ macros.OpenLineageProviderPlugin.lineage_job_namespace() }}", + "spark.openlineage.parentJobName": "{{ macros.OpenLineageProviderPlugin.lineage_job_name(task_instance) }}", + "spark.openlineage.parentRunId": "{{ macros.OpenLineageProviderPlugin.lineage_run_id(task_instance) }}", + } + }, + }, + dag=dag +) +``` \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/spark/configuration/circuit_breaker.md b/versioned_docs/version-1.26.0/integrations/spark/configuration/circuit_breaker.md new file mode 100644 index 0000000..1f723a5 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/configuration/circuit_breaker.md @@ -0,0 +1,8 @@ +--- +sidebar_position: 3 +title: Circuit Breaker +--- + +import CircuitBreakers from '@site/docs/client/java/partials/java_circuit_breaker.md'; + + diff --git a/versioned_docs/version-1.26.0/integrations/spark/configuration/spark_conf.md b/versioned_docs/version-1.26.0/integrations/spark/configuration/spark_conf.md new file mode 100644 index 0000000..dfc4182 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/configuration/spark_conf.md @@ -0,0 +1,27 @@ +--- +sidebar_position: 2 +title: Spark Config Parameters +--- + + +The following parameters can be specified: + +| Parameter | Definition | Example | +|-------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------| +| spark.openlineage.transport.type | The transport type used for event emit, default type is `console` | http | +| spark.openlineage.namespace | The default namespace to be applied for any jobs submitted | MyNamespace | +| spark.openlineage.parentJobNamespace | The job namespace to be used for the parent job facet | ParentJobNamespace | +| spark.openlineage.parentJobName | The job name to be used for the parent job facet | ParentJobName | +| spark.openlineage.parentRunId | The RunId of the parent job that initiated this Spark job | xxxx-xxxx-xxxx-xxxx | +| spark.openlineage.appName | Custom value overwriting Spark app name in events | AppName | +| spark.openlineage.facets.disabled | **Deprecated: Use the property `spark.openlineage.facets.disabled` instead**. List of facets to filter out from the events, enclosed in `[]` (required from 0.21.x) and separated by `;`, default is `[]` | \[columnLineage;\] | +| spark.openlineage.facets.<facet name>.disabled | If set to true, it disables the specific facet. The default value is `false`. The name of the facet can be hierarchical. The facets disabled by default are `debug`, `spark.logicalPlan` and `spark_unknown`. You have to switch the flag to `false` to enable them. | true | +| spark.openlineage.facets.variables | List of environment variables (System.getenv() | \[columnLineage;\] | +| spark.openlineage.capturedProperties | comma separated list of properties to be captured in spark properties facet (default `spark.master`, `spark.app.name`) | "spark.example1,spark.example2" | +| spark.openlineage.dataset.removePath.pattern | Java regular expression that removes `?` named group from dataset path. Can be used to last path subdirectories from paths like `s3://my-whatever-path/year=2023/month=04` | `(.*)(?\/.*\/.*)` | +| spark.openlineage.jobName.appendDatasetName | Decides whether output dataset name should be appended to job name. By default `true`. | false | +| spark.openlineage.jobName.replaceDotWithUnderscore | Replaces dots in job name with underscore. Can be used to mimic legacy behaviour on Databricks platform. By default `false`. | false | +| spark.openlineage.debugFacet | Determines whether debug facet shall be generated and included within the event. Set `enabled` to turn it on. By default, facet is disabled. | enabled | +| spark.openlineage.job.owners.\ | Specifies ownership of the job. Multiple entries with different types are allowed. Config key name and value are used to create job ownership type and name (available since 1.13). | spark.openlineage.job.owners.team="Some Team" | +| spark.openlineage.columnLineage.datasetLineageEnabled | Makes the dataset dependencies to be included in their own property `dataset` in the column lineage pattern. If this flag is set to `false`, then the dataset dependencies are merged into `fields` property. The default value is `false`. **It is recommended to set it to `true`** | true | +| spark.openlineage.vendors.iceberg.metricsReporterDisabled | Disables metrics reporter for Iceberg which turns off mechanism to collect scan and commit reports. | false | diff --git a/versioned_docs/version-1.26.0/integrations/spark/configuration/transport.md b/versioned_docs/version-1.26.0/integrations/spark/configuration/transport.md new file mode 100644 index 0000000..49c0369 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/configuration/transport.md @@ -0,0 +1,8 @@ +--- +sidebar_position: 2 +title: Transport +--- + +import Transports from '@site/docs/client/java/partials/java_transport.md'; + + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/spark/configuration/usage.md b/versioned_docs/version-1.26.0/integrations/spark/configuration/usage.md new file mode 100644 index 0000000..6ef8cfa --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/configuration/usage.md @@ -0,0 +1,197 @@ +--- +sidebar_position: 1 +title: Usage +--- + + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +Configuring the OpenLineage Spark integration is straightforward. It uses built-in Spark configuration mechanisms. However, for **Databricks users**, special considerations are required to ensure compatibility and avoid breaking the Spark UI after a cluster shutdown. + +Your options are: + +1. [Setting the properties directly in your application](#setting-the-properties-directly-in-your-application). +2. [Using `--conf` options with the CLI](#using---conf-options-with-the-cli). +3. [Adding properties to the `spark-defaults.conf` file in the `${SPARK_HOME}/conf` directory](#adding-properties-to-the-spark-defaultsconf-file-in-the-spark_homeconf-directory). + +#### Setting the properties directly in your application + +The below example demonstrates how to set the properties directly in your application when +constructing +a `SparkSession`. + +:::warning +The setting `config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener")` is +**extremely important**. Without it, the OpenLineage Spark integration will not be invoked, rendering +the integration ineffective. +::: + +:::note +Databricks For Databricks users, you must include `com.databricks.backend.daemon.driver.DBCEventLoggingListener` in addition to `io.openlineage.spark.agent.OpenLineageSparkListener` in the `spark.extraListeners` setting. Failure to do so will make the Spark UI inaccessible after a cluster shutdown. +::: + + + + +```scala +import org.apache.spark.sql.SparkSession + +object OpenLineageExample extends App { + val spark = SparkSession.builder() + .appName("OpenLineageExample") + // This line is EXTREMELY important + .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") + .config("spark.openlineage.transport.type", "http") + .config("spark.openlineage.transport.url", "http://localhost:5000") + .config("spark.openlineage.namespace", "spark_namespace") + .config("spark.openlineage.parentJobNamespace", "airflow_namespace") + .config("spark.openlineage.parentJobName", "airflow_dag.airflow_task") + .config("spark.openlineage.parentRunId", "xxxx-xxxx-xxxx-xxxx") + .getOrCreate() + + // ... your code + + spark.stop() +} + +// For Databricks +import org.apache.spark.sql.SparkSession + +object OpenLineageExample extends App { + val spark = SparkSession.builder() + .appName("OpenLineageExample") + // This line is EXTREMELY important + .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener,com.databricks.backend.daemon.driver.DBCEventLoggingListener") + .config("spark.openlineage.transport.type", "http") + .config("spark.openlineage.transport.url", "http://localhost:5000") + .config("spark.openlineage.namespace", "spark_namespace") + .config("spark.openlineage.parentJobNamespace", "airflow_namespace") + .config("spark.openlineage.parentJobName", "airflow_dag.airflow_task") + .config("spark.openlineage.parentRunId", "xxxx-xxxx-xxxx-xxxx") + .getOrCreate() + + // ... your code + + spark.stop() +} +``` + + + + +```python +from pyspark.sql import SparkSession + +spark = SparkSession.builder + .appName("OpenLineageExample") + .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") + .config("spark.openlineage.transport.type", "http") + .config("spark.openlineage.transport.url", "http://localhost:5000") + .config("spark.openlineage.namespace", "spark_namespace") + .config("spark.openlineage.parentJobNamespace", "airflow_namespace") + .config("spark.openlineage.parentJobName", "airflow_dag.airflow_task") + .config("spark.openlineage.parentRunId", "xxxx-xxxx-xxxx-xxxx") + .getOrCreate() + +# ... your code + +spark.stop() + +# For Databricks +from pyspark.sql import SparkSession + +spark = SparkSession.builder + .appName("OpenLineageExample") + .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener,com.databricks.backend.daemon.driver.DBCEventLoggingListener") + .config("spark.openlineage.transport.type", "http") + .config("spark.openlineage.transport.url", "http://localhost:5000") + .config("spark.openlineage.namespace", "spark_namespace") + .config("spark.openlineage.parentJobNamespace", "airflow_namespace") + .config("spark.openlineage.parentJobName", "airflow_dag.airflow_task") + .config("spark.openlineage.parentRunId", "xxxx-xxxx-xxxx-xxxx") + .getOrCreate() + +# ... your code + +spark.stop() +``` + + + + +#### Using `--conf` options with the CLI + +The below example demonstrates how to use the `--conf` option with `spark-submit`. + +:::note +Databricks Remember to include `com.databricks.backend.daemon.driver.DBCEventLoggingListener` along with the OpenLineage listener. +::: + +```bash +spark-submit \ + --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ + --conf "spark.openlineage.transport.type=http" \ + --conf "spark.openlineage.transport.url=http://localhost:5000" \ + --conf "spark.openlineage.namespace=spark_namespace" \ + --conf "spark.openlineage.parentJobNamespace=airflow_namespace" \ + --conf "spark.openlineage.parentJobName=airflow_dag.airflow_task" \ + --conf "spark.openlineage.parentRunId=xxxx-xxxx-xxxx-xxxx" \ + # ... other options + +# For Databricks +spark-submit \ + --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener,com.databricks.backend.daemon.driver.DBCEventLoggingListener" \ + --conf "spark.openlineage.transport.type=http" \ + --conf "spark.openlineage.transport.url=http://localhost:5000" \ + --conf "spark.openlineage.namespace=spark_namespace" \ + --conf "spark.openlineage.parentJobNamespace=airflow_namespace" \ + --conf "spark.openlineage.parentJobName=airflow_dag.airflow_task" \ + --conf "spark.openlineage.parentRunId=xxxx-xxxx-xxxx-xxxx" \ + # ... other options +``` + +#### Adding properties to the `spark-defaults.conf` file in the `${SPARK_HOME}/conf` directory + +:::warning +You may need to create this file if it does not exist. If it does exist, **we strongly suggest that +you back it up before making any changes**, particularly if you are not the only user of the Spark +installation. A misconfiguration here can have devastating effects on the operation of your Spark +installation, particularly in a shared environment. +::: + +The below example demonstrates how to add properties to the `spark-defaults.conf` file. + +:::note +Databricks For Databricks users, include `com.databricks.backend.daemon.driver.DBCEventLoggingListener` in the `spark.extraListeners` property. +::: + +```properties +spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener +spark.openlineage.transport.type=http +spark.openlineage.transport.url=http://localhost:5000 +spark.openlineage.namespace=MyNamespace +``` + +For Databricks, +```properties +spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener,com.databricks.backend.daemon.driver.DBCEventLoggingListener +spark.openlineage.transport.type=http +spark.openlineage.transport.url=http://localhost:5000 +spark.openlineage.namespace=MyNamespace +``` + +:::info +The `spark.extraListeners` configuration parameter is **non-additive**. This means that if you set +`spark.extraListeners` via the CLI or via `SparkSession#config`, it will **replace** the value +in `spark-defaults.conf`. This is important to remember if you are using `spark-defaults.conf` to +set a default value for `spark.extraListeners` and then want to override it for a specific job. +::: + +:::info +When it comes to configuration parameters like `spark.openlineage.namespace`, a default value can +be supplied in the `spark-defaults.conf` file. This default value can be overridden by the +application at runtime, via the previously detailed methods. However, it is **strongly** recommended +that more dynamic or quickly changing parameters like `spark.openlineage.parentRunId` or +`spark.openlineage.parentJobName` be set at runtime via the CLI or `SparkSession#config` methods. +::: diff --git a/versioned_docs/version-1.26.0/integrations/spark/dataset_metrics.md b/versioned_docs/version-1.26.0/integrations/spark/dataset_metrics.md new file mode 100644 index 0000000..0480b58 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/dataset_metrics.md @@ -0,0 +1,120 @@ +--- +sidebar_position: 7 +title: Dataset Metrics +--- + +Input and output facets in OpenLineage specification describe datasets in the context of a given +run. For example, an amount of rows read is not a dataset facet as it does not describe the dataset. +For the convenience, OpenLineage events contain this information under `inputFacets` and `outputFacets` +fields of input and output datasets respectively. + +## Standard Input / Output dataset statistics + +OpenLineage specification comes with: + * [InputStatisticsInputDatasetFacet](../../spec/facets/dataset-facets/input-dataset-facets/input_statistics.md) + * [OutputStatisticsOutputDatasetFacet](../../spec/facets/dataset-facets/output-dataset-facets/output_statistics.md) + +which are collected by the Spark integration. Those facets basically contain: + * amount rows read/written, + * amount of bytes read/written, + * amount of files read/written. + +As a limitation to this, a row count for input datasets is collected only +for DataSourceV2 api datasets. + +## Iceberg specific metrics reports + +Even more extensive metrics are collected for Iceberg tables, as +the library exposes [MetricReport API](https://iceberg.apache.org/docs/latest/metrics-reporting/?h=metrics). +Two report types are currently supported: + * `ScanReport` - carries metrics being collected during scan planning against a given table. +Amongst some general information about the involved table, such as the snapshot id or the table +name, it includes metrics like: + * total scan planning duration + * number of data/delete files included in the result + * number of data/delete manifests scanned/skipped + * number of data/delete files scanned/skipped + * number of equality/positional delete files scanned + * `CommitReport` - carries metrics being collected after committing changes to a table (aka producing a snapshot). +Amongst some general information about the involved table, such as the snapshot +id or the table name, it includes metrics like: + * total duration + * number of attempts required for the commit to succeed + * number of added/removed data/delete files + * number of added/removed equality/positional delete files + * number of added/removed equality/positional deletes + +At the bottom of the page, we list example facets generated by Spark integration. + +This feature is delivered by implementing custom `OpenLineageMetricsReporter` class +as Iceberg metrics reporter and injecting it automatically into Iceberg catalog. If any other +custom reporter is present, `OpenLineageMetricsReporter` will overwrite it, but it will still +report metrics to it. + +In case of any issues, a spark config flag: +`spark.openlineage.vendors.iceberg.metricsReporterDisabled=true` can be used to disable this feature. + +```json +"icebergScanReport": { + "_producer":"https://github.com/OpenLineage/OpenLineage/tree/1.26.0-SNAPSHOT/integration/spark", + "_schemaURL":"https://openlineage.io/spec/facets/1-0-0/IcebergScanReportInputDatasetFacet.json", + "snapshotId":4115428054613373118, + "filterDescription":"", + "projectedFieldNames":[ + "a", + "b" + ], + "scanMetrics":{ + "totalPlanningDuration":21, + "resultDataFiles":1, + "resultDeleteFiles":0, + "totalDataManifests":1, + "totalDeleteManifests":0, + "scannedDataManifests":1, + "skippedDataManifests":0, + "totalFileSizeInBytes":676, + "totalDeleteFileSizeInBytes":0, + "skippedDataFiles":0, + "skippedDeleteFiles":0, + "scannedDeleteManifests":0, + "skippedDeleteManifests":0, + "indexedDeleteFiles":0, + "equalityDeleteFiles":0, + "positionalDeleteFiles":0 + }, + "metadata":{ + "engine-version":"3.3.4", + "iceberg-version":"Apache Iceberg 1.6.0 (commit 229d8f6fcd109e6c8943ea7cbb41dab746c6d0ed)", + "app-id":"local-1733228790932", + "engine-name":"spark" + } +} +``` + +```json +"icebergCommitReport": { + "snapshotId":3131594900391425696, + "sequenceNumber":2, + "operation":"append", + "commitMetrics":{ + "totalDuration":87, + "attempts":1, + "addedDataFiles":1, + "totalDataFiles":2, + "totalDeleteFiles":0, + "addedRecords":1, + "totalRecords":4, + "addedFilesSizeInBytes":651, + "totalFilesSizeInBytes":1343, + "totalPositionalDeletes":0, + "totalEqualityDeletes":0 + }, + "metadata":{ + "engine-version":"3.3.4", + "app-id":"local-1733228862465", + "engine-name":"spark", + "iceberg-version":"Apache Iceberg 1.6.0 (commit 229d8f6fcd109e6c8943ea7cbb41dab746c6d0ed)" + } +} +``` + diff --git a/versioned_docs/version-1.26.0/integrations/spark/extending.md b/versioned_docs/version-1.26.0/integrations/spark/extending.md new file mode 100644 index 0000000..4eb04f9 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/extending.md @@ -0,0 +1,232 @@ +--- +sidebar_position: 9 +title: Extending +--- + +The Spark library is intended to support extension via custom implementations of a handful +of interfaces. Nearly every extension interface extends or mimics Scala's `PartialFunction`. The +`isDefinedAt(Object x)` method determines whether a given input is a valid input to the function. +A default implementation of `isDefinedAt(Object x)` is provided, which checks the generic type +arguments of the concrete class, if concrete type arguments are given, and determines if the input +argument matches the generic type. For example, the following class is automatically defined for an +input argument of type `MyDataset`. + +``` +class MyDatasetDetector extends QueryPlanVisitor { +} +``` + +## API +The following APIs are still evolving and may change over time based on user feedback. + +### [`OpenLineageEventHandlerFactory`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java) +This interface defines the main entrypoint to the extension codebase. Custom implementations +are registered by following Java's [`ServiceLoader` conventions](https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html). +A file called `io.openlineage.spark.api.OpenLineageEventHandlerFactory` must exist in the +application or jar's `META-INF/service` directory. Each line of that file must be the fully +qualified class name of a concrete implementation of `OpenLineageEventHandlerFactory`. More than one +implementation can be present in a single file. This might be useful to separate extensions that +are targeted toward different environments - e.g., one factory may contain Azure-specific extensions, +while another factory may contain GCP extensions. + +The `OpenLineageEventHandlerFactory` interface makes heavy use of default methods. Implementations +may override any or all of the following methods +```java +/** + * Return a collection of QueryPlanVisitors that can generate InputDatasets from a LogicalPlan node + */ +Collection>> createInputDatasetQueryPlanVisitors(OpenLineageContext context); + +/** + * Return a collection of QueryPlanVisitors that can generate OutputDatasets from a LogicalPlan node + */ +Collection>> createOutputDatasetQueryPlanVisitors(OpenLineageContext context); + +/** + * Return a collection of PartialFunctions that can generate InputDatasets from one of the + * pre-defined Spark types accessible from SparkListenerEvents (see below) + */ +Collection>> createInputDatasetBuilder(OpenLineageContext context); + +/** + * Return a collection of PartialFunctions that can generate OutputDatasets from one of the + * pre-defined Spark types accessible from SparkListenerEvents (see below) + */ +Collection>> createOutputDatasetBuilder(OpenLineageContext context); + +/** + * Return a collection of CustomFacetBuilders that can generate InputDatasetFacets from one of the + * pre-defined Spark types accessible from SparkListenerEvents (see below) + */ +Collection> createInputDatasetFacetBuilders(OpenLineageContext context); + +/** + * Return a collection of CustomFacetBuilders that can generate OutputDatasetFacets from one of the + * pre-defined Spark types accessible from SparkListenerEvents (see below) + */ +Collection>createOutputDatasetFacetBuilders(OpenLineageContext context); + +/** + * Return a collection of CustomFacetBuilders that can generate DatasetFacets from one of the + * pre-defined Spark types accessible from SparkListenerEvents (see below) + */ +Collection> createDatasetFacetBuilders(OpenLineageContext context); + +/** + * Return a collection of CustomFacetBuilders that can generate RunFacets from one of the + * pre-defined Spark types accessible from SparkListenerEvents (see below) + */ +Collection> createRunFacetBuilders(OpenLineageContext context); + +/** + * Return a collection of CustomFacetBuilders that can generate JobFacets from one of the + * pre-defined Spark types accessible from SparkListenerEvents (see below) + */ +Collection> createJobFacetBuilders(OpenLineageContext context); +``` + +See the [`OpenLineageEventHandlerFactory` javadocs](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java) +for specifics on each method. + + +### [`QueryPlanVisitor`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/QueryPlanVisitor.java) +QueryPlanVisitors evaluate nodes of a Spark `LogicalPlan` and attempt to generate `InputDataset`s or +`OutputDataset`s from the information found in the `LogicalPlan` nodes. This is the most common +abstraction present in the OpenLineage Spark library, and many examples can be found in the +`io.openlineage.spark.agent.lifecycle.plan` package - examples include the +[`BigQueryNodeVisitor`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/BigQueryNodeVisitor.java), +the [`KafkaRelationVisitor`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/KafkaRelationVisitor.java) +and the [`InsertIntoHiveTableVisitor`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/InsertIntoHiveTableVisitor.java). + +`QueryPlanVisitor`s implement Scala's `PartialFunction` interface and are tested against every node +of a Spark query's optimized `LogicalPlan`. Each invocation will expect either an `InputDataset` +or an `OutputDataset`. If a node can be either an `InputDataset` or an `OutputDataset`, the +constructor should accept a `DatasetFactory` so that the correct dataset type is generated at +runtime. + +`QueryPlanVisitor`s can attach facets to the Datasets created, e.g., `SchemaDatasetFacet` and +`DatasourceDatasetFacet` are typically attached to the dataset when it is created. Custom facets +can also be attached, though `CustomFacetBuilder`s _may_ override facets attached directly to the +dataset. + +Spark job's naming logic appends output dataset's identifier as job suffix. In order to provide a job suffix, a `QueryPlanVisitor` +needs to implement [`JobNameSuffixProvider`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/JobNameSuffixProvider.java) +interface. Otherwise no suffix will be appended. Job suffix should contain human-readable name +of the dataset so that consumers of OpenLineage events can correlate events with particular +Spark actions within their code. The logic to extract dataset name should not depend on the existence +of the dataset as in case of creating new dataset it may not exist at the moment of assigning job suffix. +In most cases, the suffix should contain spark catalog, database and table separated by `.` which shall be +extracted from LogicalPlan nodes properties. + +### [`InputDatasetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/AbstractInputDatasetBuilder.java) and [`OutputDatasetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/common/java/io/openlineage/spark/api/AbstractOutputDatasetBuilder.java) +Similar to the `QueryPlanVisitor`s, `InputDatasetBuilder`s and `OutputDatasetBuilder`s are +`PartialFunction`s defined for a specific input (see below for the list of Spark listener events and +scheduler objects that can be passed to a builder) that can generate either an `InputDataset` or an +`OutputDataset`. Though not strictly necessary, the abstract base classes +[`AbstractInputDatasetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/AbstractInputDatasetBuilder.java) +and [`AbstractOutputDatasetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/AbstractOutputDatasetBuilder.java) +are available for builders to extend. + +Spark job's naming logic appends output dataset's identifier as job suffix. +In order to provide a job suffix, a `OutputDatasetBuilder` needs to implement [`JobNameSuffixProvider`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/JobNameSuffixProvider.java) +interface. Otherwise no suffix will be appended. Job suffix should contain human-readable name +of the dataset so that consumers of OpenLineage events can correlate events with particular +Spark actions within their code. The logic to extract dataset name should not depend on the existence +of the dataset as in case of creating new dataset it may not exist at the moment of assigning job suffix. +In most cases, the suffix should contain spark catalog, database and table separated by `.` which shall be +extracted from LogicalPlan nodes properties. + +### [`CustomFacetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/CustomFacetBuilder.java) +`CustomFacetBuilders` evaluate Spark event types and scheduler objects (see below) to construct custom +facets. `CustomFacetBuilders` are used to create `InputDatsetFacet`s, `OutputDatsetFacet`s, +`DatsetFacet`s, `RunFacet`s, and `JobFacet`s. A few examples can be found in the +[`io.openlineage.spark.agent.facets.builder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/builder) +package, including the [`ErrorFacetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/builder/ErrorFacetBuilder.java) +and the [`LogicalPlanRunFacetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/builder/LogicalPlanRunFacetBuilder.java). +`CustomFacetBuilder`s are not `PartialFunction` implementations, but do define the `isDefinedAt(Object)` +method to determine whether a given input is valid for the function. They implement the `BiConsumer` +interface, accepting the valid input argument, and a `BiConsumer` consumer, which +accepts the name and value of any custom facet that should be attached to the OpenLineage run. +There is no limit to the number of facets that can be reported by a given `CustomFacetBuilder`. +Facet names that conflict will overwrite previously reported facets if they are reported for the +same Spark event. +Though not strictly necessary, the following abstract base classes are available for extension: +* [`AbstractJobFacetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/AbstractJobFacetBuilder.java) +* [`AbstractRunFacetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/AbstractRunFacetBuilder.java) +* [`AbstractInputDatasetFacetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/AbstractInputDatasetFacetBuilder.java) +* [`AbstractOutputDatasetFacetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/AbstractOutputDatasetFacetBuilder.java) +* [`AbstractDatasetFacetBuilder`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/AbstractDatasetFacetBuilder.java) + +Input/Output/Dataset facets returned are attached to _any_ Input/Output Dataset found for a given +Spark event. Typically, a Spark job only has one `OutputDataset`, so any `OutputDatasetFacet` +generated will be attached to that `OutputDataset`. However, Spark jobs often have multiple +`InputDataset`s. Typically, an `InputDataset` is read within a single Spark `Stage`, and any metrics +pertaining to that dataset may be present in the `StageInfo#taskMetrics()` for that `Stage`. +Accumulators pertaining to a dataset should be reported in the task metrics for a stage so that the +`CustomFacetBuilder` can match against the `StageInfo` and retrieve the task metrics for that stage +when generating the `InputDatasetFacet`. Other facet information is often found by analyzing the +`RDD` that reads the raw data for a dataset. `CustomFacetBuilder`s that generate these facets should +be defined for the specific subclass of `RDD` that is used to read the target dataset - e.g., +`HadoopRDD`, `BigQueryRDD`, or `JdbcRDD`. + +### Function Argument Types +`CustomFacetBuilder`s and dataset builders can be defined for the following set of Spark listener +event types and scheduler types: + +* `org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart` +* `org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEnd` +* `org.apache.spark.scheduler.SparkListenerJobStart` +* `org.apache.spark.scheduler.SparkListenerJobEnd` +* `org.apache.spark.rdd.RDD` +* `org.apache.spark.scheduler.Stage` +* `org.apache.spark.scheduler.StageInfo` +* `org.apache.spark.scheduler.ActiveJob` + +Note that `RDD`s are "unwrapped" prior to being evaluated by builders, so there's no need to, e.g., +check a `MapPartitionsRDD`'s dependencies. The `RDD` for each `Stage` can be evaluated when a +`org.apache.spark.scheduler.SparkListenerStageCompleted` event occurs. When a +`org.apache.spark.scheduler.SparkListenerJobEnd` event is encountered, the last `Stage` for the +`ActiveJob` can be evaluated. + +## Spark extensions' built-in lineage extraction + +Spark ecosystem comes with a plenty of pluggable extensions like iceberg, delta or spark-bigquery-connector +to name a few. Extensions modify logical plan of the job and inject its own classes from which lineage shall be +extracted. This is adding extra complexity, as it makes `openlineage-spark` codebase +dependent on the extension packages. The complexity grows more when multiple versions +of the same extension need to be supported. + +### Spark DataSource V2 API Extensions + +Some extensions rely on Spark DataSource V2 API and implement TableProvider, Table, ScanBuilder etc. +that are used within Spark to create `DataSourceV2Relation` instances. + +A logical plan node `DataSourceV2Relation` contains `Table` field with a properties map of type +`Map`. `openlineage-spark` uses this map to extract dataset information for lineage +event from `DataSourceV2Relation`. It is checking for the properties `openlineage.dataset.name` and +`openlineage.dataset.namespace`. If they are present, it uses them to identify a dataset. Please +be aware that namespace and name need to conform to [naming convention](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md). + +Properties can be also used to pass any dataset facet. For example: +``` +openlineage.dataset.facets.customFacet={"property1": "value1", "property2": "value2"} +``` +will enrich dataset with `customFacet`: +```json +"inputs": [{ + "name": "...", + "namespace": "...", + "facets": { + "customFacet": { + "property1": "value1", + "property2": "value2", + "_producer": "..." + }, + "schema": { } +}] +``` + +The approach can be used for standard facets +from OpenLineage spec as well. `schema` does not need to be passed through the properties as +it is derived within `openlineage-spark` from `DataSourceV2Relation`. Custom facets are automatically +filled with `_producer` field. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/spark/installation.md b/versioned_docs/version-1.26.0/integrations/spark/installation.md new file mode 100644 index 0000000..bf187f0 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/installation.md @@ -0,0 +1,258 @@ +--- +sidebar_position: 2 +title: Installation +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +:::warning + +* Version `1.8.0` and earlier only supported Scala 2.12 variants of Apache Spark. +* Version `1.9.1` and later support both Scala 2.12 and 2.13 variants of Apache Spark. + +The above necessitates a change in the artifact identifier for `io.openlineage:openlineage-spark`. +After version `1.8.0`, the artifact identifier has been updated. For subsequent versions, utilize: +`io.openlineage:openlineage-spark_${SCALA_BINARY_VERSION}:{{PREPROCESSOR:OPENLINEAGE_VERSION}}`. +::: + +To integrate OpenLineage Spark with your application, you can: + +- [Bundle the package with your Apache Spark application project](#bundle-the-package-with-your-apache-spark-application-project). +- [Place the JAR in your `${SPARK_HOME}/jars` directory](#place-the-jar-in-your-spark_homejars-directory) +- [Use the `--jars` option with `spark-submit / spark-shell / pyspark`](#use-the---jars-option-with-spark-submit--spark-shell--pyspark) +- [Use the `--packages` option with `spark-submit / spark-shell / pyspark`](#use-the---packages-option-with-spark-submit--spark-shell--pyspark) + +#### Bundle the package with your Apache Spark application project + +:::info +This approach does not demonstrate how to configure the `OpenLineageSparkListener`. +Please refer to the [Configuration](configuration/usage.md) section. +::: + +For Maven, add the following to your `pom.xml`: + + + + +```xml + + io.openlineage + openlineage-spark_${SCALA_BINARY_VERSION} + {{PREPROCESSOR:OPENLINEAGE_VERSION}} + +``` + + + + +```xml + + io.openlineage + openlineage-spark + ${OPENLINEAGE_SPARK_VERSION} + +``` + + + + +For Gradle, add this to your `build.gradle`: + + + + +```groovy +implementation("io.openlineage:openlineage-spark_${SCALA_BINARY_VERSION}:{{PREPROCESSOR:OPENLINEAGE_VERSION}}") +``` + + + + +```groovy +implementation("io.openlineage:openlineage-spark:{{PREPROCESSOR:OPENLINEAGE_VERSION}}") +``` + + + + +#### Place the JAR in your `${SPARK_HOME}/jars` directory + +:::info +This approach does not demonstrate how to configure the `OpenLineageSparkListener`. +Please refer to the [Configuration](#configuration) section. +::: + +1. Download the JAR and its checksum from Maven Central. +2. Verify the JAR's integrity using the checksum. +3. Upon successful verification, move the JAR to `${SPARK_HOME}/jars`. + +This script automates the download and verification process: + + + + +```bash +#!/usr/bin/env bash + +if [ -z "$SPARK_HOME" ]; then + echo "SPARK_HOME is not set. Please define it as your Spark installation directory." + exit 1 +fi + +OPENLINEAGE_SPARK_VERSION='{{PREPROCESSOR:OPENLINEAGE_VERSION}}' +SCALA_BINARY_VERSION='2.13' # Example Scala version +ARTIFACT_ID="openlineage-spark_${SCALA_BINARY_VERSION}" +JAR_NAME="${ARTIFACT_ID}-${OPENLINEAGE_SPARK_VERSION}.jar" +CHECKSUM_NAME="${JAR_NAME}.sha512" +BASE_URL="https://repo1.maven.org/maven2/io/openlineage/${ARTIFACT_ID}/${OPENLINEAGE_SPARK_VERSION}" + +curl -O "${BASE_URL}/${JAR_NAME}" +curl -O "${BASE_URL}/${CHECKSUM_NAME}" + +echo "$(cat ${CHECKSUM_NAME}) ${JAR_NAME}" | sha512sum -c + +if [ $? -eq 0 ]; then + mv "${JAR_NAME}" "${SPARK_HOME}/jars" +else + echo "Checksum verification failed." + exit 1 +fi +``` + + + + +```bash +#!/usr/bin/env bash + +if [ -z "$SPARK_HOME" ]; then + echo "SPARK_HOME is not set. Please define it as your Spark installation directory." + exit 1 +fi + +OPENLINEAGE_SPARK_VERSION='1.8.0' # Example version +ARTIFACT_ID="openlineage-spark" +JAR_NAME="${ARTIFACT_ID}-${OPENLINEAGE_SPARK_VERSION}.jar" +CHECKSUM_NAME="${JAR_NAME}.sha512" +BASE_URL="https://repo1.maven.org/maven2/io/openlineage/${ARTIFACT_ID}/${OPENLINEAGE_SPARK_VERSION}" + +curl -O "${BASE_URL}/${JAR_NAME}" +curl -O "${BASE_URL}/${CHECKSUM_NAME}" + +echo "$(cat ${CHECKSUM_NAME}) ${JAR_NAME}" | sha512sum -c + +if [ $? -eq 0 ]; then + mv "${JAR_NAME}" "${SPARK_HOME}/jars" +else + echo "Checksum verification failed." + exit 1 +fi +``` + + + + +#### Use the `--jars` option with `spark-submit / spark-shell / pyspark` + +:::info +This approach does not demonstrate how to configure the `OpenLineageSparkListener`. +Please refer to the [Configuration](#configuration) section. +::: + +1. Download the JAR and its checksum from Maven Central. +2. Verify the JAR's integrity using the checksum. +3. Upon successful verification, submit a Spark application with the JAR using the `--jars` option. + +This script demonstrate this process: + + + + +```bash +#!/usr/bin/env bash + +OPENLINEAGE_SPARK_VERSION='{{PREPROCESSOR:OPENLINEAGE_VERSION}}' +SCALA_BINARY_VERSION='2.13' # Example Scala version +ARTIFACT_ID="openlineage-spark_${SCALA_BINARY_VERSION}" +JAR_NAME="${ARTIFACT_ID}-${OPENLINEAGE_SPARK_VERSION}.jar" +CHECKSUM_NAME="${JAR_NAME}.sha512" +BASE_URL="https://repo1.maven.org/maven2/io/openlineage/${ARTIFACT_ID}/${OPENLINEAGE_SPARK_VERSION}" + +curl -O "${BASE_URL}/${JAR_NAME}" +curl -O "${BASE_URL}/${CHECKSUM_NAME}" + +echo "$(cat ${CHECKSUM_NAME}) ${JAR_NAME}" | sha512sum -c + +if [ $? -eq 0 ]; then + spark-submit --jars "path/to/${JAR_NAME}" \ + # ... other options +else + echo "Checksum verification failed." + exit 1 +fi +``` + + + + +```bash +#!/usr/bin/env bash + +OPENLINEAGE_SPARK_VERSION='1.8.0' # Example version +ARTIFACT_ID="openlineage-spark" +JAR_NAME="${ARTIFACT_ID}-${OPENLINEAGE_SPARK_VERSION}.jar" +CHECKSUM_NAME="${JAR_NAME}.sha512" +BASE_URL="https://repo1.maven.org/maven2/io/openlineage/${ARTIFACT_ID}/${OPENLINEAGE_SPARK_VERSION}" + +curl -O "${BASE_URL}/${JAR_NAME}" +curl -O "${BASE_URL}/${CHECKSUM_NAME}" + +echo "$(cat ${CHECKSUM_NAME}) ${JAR_NAME}" | sha512sum -c + +if [ $? -eq 0 ]; then + spark-submit --jars "path/to/${JAR_NAME}" \ + # ... other options +else + echo "Checksum verification failed." + exit 1 +fi +``` + + + + +#### Use the `--packages` option with `spark-submit / spark-shell / pyspark` + +:::info +This approach does not demonstrate how to configure the `OpenLineageSparkListener`. +Please refer to the [Configuration](#configuration) section. +::: + +Spark allows you to add packages at runtime using the `--packages` option with `spark-submit`. This +option automatically downloads the package from Maven Central (or other configured repositories) +during runtime and adds it to the classpath of your Spark application. + + + + +```bash +OPENLINEAGE_SPARK_VERSION='{{PREPROCESSOR:OPENLINEAGE_VERSION}}' +SCALA_BINARY_VERSION='2.13' # Example Scala version + +spark-submit --packages "io.openlineage:openlineage-spark_${SCALA_BINARY_VERSION}:{{PREPROCESSOR:OPENLINEAGE_VERSION}}" \ + # ... other options +``` + + + + +```bash +OPENLINEAGE_SPARK_VERSION='1.8.0' # Example version + +spark-submit --packages "io.openlineage:openlineage-spark::{{PREPROCESSOR:OPENLINEAGE_VERSION}}" \ + # ... other options +``` + + + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/spark/job-hierarchy.md b/versioned_docs/version-1.26.0/integrations/spark/job-hierarchy.md new file mode 100644 index 0000000..8929759 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/job-hierarchy.md @@ -0,0 +1,20 @@ +--- +sidebar_position: 5 +title: Job Hierarchy +--- + +:::info +Please get familiar with [OpenLineage Job Hierarchy concept](../../spec/job-hierarchy.md) before reading this. +::: + +In contrast to some other systems, Spark's job hierarchy is more opaque. +While you might schedule "Spark jobs" through code or notebooks, these represent an entirely different concept than what Spark sees internally. +For Spark, the true job is an action, a single computation unit initiated by the driver. +These actions materialize data only when you, the user, instruct them to write to a data sink or visualize it. +This means what you perceive as a single job can, in reality, be multiple execution units within Spark. +OpenLineage follows Spark execution model, and emits START/COMPLETE (and RUNNING) events +for each action. However, those are not the only events we emit. + +Recognizing the disconnect between your understanding and Spark's internal workings, +OpenLineage introduces application-level events that mark the start and end of a Spark application. +Each action-level run then points its [ParentRunFacet](../../spec/facets/run-facets/parent_run.md) to the corresponding Spark application run, providing a complete picture of the lineage. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/integrations/spark/main_concept.md b/versioned_docs/version-1.26.0/integrations/spark/main_concept.md new file mode 100644 index 0000000..31ba4ff --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/main_concept.md @@ -0,0 +1,33 @@ +--- +sidebar_position: 1 +title: Main Concepts +--- + +Spark jobs typically run on clusters of machines. A single machine hosts the "driver" application, +which constructs a graph of jobs - e.g., reading data from a source, filtering, transforming, and +joining records, and writing results to some sink- and manages execution of those jobs. Spark's +fundamental abstraction is the Resilient Distributed Dataset (RDD), which encapsulates distributed +reads and modifications of records. While RDDs can be used directly, it is far more common to work +with Spark Datasets or Dataframes, which is an API that adds explicit schemas for better performance +and the ability to interact with datasets using SQL. The Dataframe's declarative API enables Spark +to optimize jobs by analyzing and manipulating an abstract query plan prior to execution. + +## Collecting Lineage in Spark + +Collecting lineage requires hooking into Spark's `ListenerBus` in the driver application and +collecting and analyzing execution events as they happen. Both raw RDD and Dataframe jobs post events +to the listener bus during execution. These events expose the structure of the job, including the +optimized query plan, allowing the Spark integration to analyze the job for datasets consumed and +produced, including attributes about the storage, such as location in GCS or S3, table names in a +relational database or warehouse, such as Redshift or Bigquery, and schemas. In addition to dataset +and job lineage, Spark SQL jobs also report logical plans, which can be compared across job runs to +track important changes in query plans, which may affect the correctness or speed of a job. + +A single Spark application may execute multiple jobs. The Spark OpenLineage integration maps one +Spark job to a single OpenLineage Job. The application will be assigned a Run id at startup and each +job that executes will report the application's Run id as its parent job run. Thus, an application +that reads one or more source datasets, writes an intermediate dataset, then transforms that +intermediate dataset and writes a final output dataset will report three jobs- the parent application +job, the initial job that reads the sources and creates the intermediate dataset, and the final job +that consumes the intermediate dataset and produces the final output. As an image: +![image](./spark-job-creation.dot.png) diff --git a/versioned_docs/version-1.26.0/integrations/spark/metrics.md b/versioned_docs/version-1.26.0/integrations/spark/metrics.md new file mode 100644 index 0000000..ddfa25b --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/metrics.md @@ -0,0 +1,28 @@ +--- +title: Spark Integration Metrics +sidebar_position: 8 +--- + +# Spark Integration Metrics + +The OpenLineage integration with Spark not only utilizes the Java client's metrics but also introduces its own set of metrics specific to Spark operations. Below is a list of these metrics. + +## Metrics Overview + +The following table provides the metrics added by the Spark integration, along with their definitions and types: + +| Metric | Definition | Type | +|--------------------------------------------------|------------------------------------------------------------------------|---------| +| `openlineage.spark.event.sql.start` | Number of SparkListenerSQLExecutionStart events received | Counter | +| `openlineage.spark.event.sql.end` | Number of SparkListenerSQLExecutionEnd events received | Counter | +| `openlineage.spark.event.job.start` | Number of SparkListenerJobStart events received | Counter | +| `openlineage.spark.event.job.end` | Number of SparkListenerJobEnd events received | Counter | +| `openlineage.spark.event.app.start` | Number of SparkListenerApplicationStart events received | Counter | +| `openlineage.spark.event.app.end` | Number of SparkListenerApplicationEnd events received | Counter | +| `openlineage.spark.event.app.start.memoryusage` | Percentage of used memory at the start of the application | Counter | +| `openlineage.spark.event.app.end.memoryusage` | Percentage of used memory at the end of the application | Counter | +| `openlineage.spark.unknownFacet.time` | Time spent building the UnknownEntryRunFacet | Timer | +| `openlineage.spark.dataset.input.execution.time` | Time spent constructing input datasets for execution | Timer | +| `openlineage.spark.facets.job.execution.time` | Time spent building job-specific facets | Timer | +| `openlineage.spark.facets.run.execution.time` | Time spent constructing run-specific facets | Timer | + diff --git a/versioned_docs/version-1.26.0/integrations/spark/quickstart/_category_.json b/versioned_docs/version-1.26.0/integrations/spark/quickstart/_category_.json new file mode 100644 index 0000000..1fc00be --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/quickstart/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Quickstart", + "position": 4 +} diff --git a/versioned_docs/version-1.26.0/integrations/spark/quickstart/glue_settings.png b/versioned_docs/version-1.26.0/integrations/spark/quickstart/glue_settings.png new file mode 100644 index 0000000..6ac3e5f Binary files /dev/null and b/versioned_docs/version-1.26.0/integrations/spark/quickstart/glue_settings.png differ diff --git a/versioned_docs/version-1.26.0/integrations/spark/quickstart/jupyter_home.png b/versioned_docs/version-1.26.0/integrations/spark/quickstart/jupyter_home.png new file mode 100644 index 0000000..f8ec0db Binary files /dev/null and b/versioned_docs/version-1.26.0/integrations/spark/quickstart/jupyter_home.png differ diff --git a/versioned_docs/version-1.26.0/integrations/spark/quickstart/jupyter_new_notebook.png b/versioned_docs/version-1.26.0/integrations/spark/quickstart/jupyter_new_notebook.png new file mode 100644 index 0000000..531794b Binary files /dev/null and b/versioned_docs/version-1.26.0/integrations/spark/quickstart/jupyter_new_notebook.png differ diff --git a/versioned_docs/version-1.26.0/integrations/spark/quickstart/quickstart_databricks.md b/versioned_docs/version-1.26.0/integrations/spark/quickstart/quickstart_databricks.md new file mode 100644 index 0000000..5f055f0 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/quickstart/quickstart_databricks.md @@ -0,0 +1,70 @@ +--- +sidebar_position: 2 +title: Quickstart with Databricks +--- + +OpenLineage's [Spark Integration](https://github.com/OpenLineage/OpenLineage/blob/a2d39a7a6f02474b2dfd1484f3a6d2810a5ffe30/integration/spark/README.md) can be installed on Databricks leveraging `init` scripts. Please note, Databricks on Google Cloud does not currently support the DBFS CLI, so the proposed solution will not work on Google Cloud until that feature is enabled. + +* [Azure Databricks Init Scripts](https://docs.microsoft.com/en-us/azure/databricks/clusters/init-scripts) +* [GCP Databricks Init Scripts](https://docs.gcp.databricks.com/clusters/init-scripts.html) +* [AWS Databricks Init Scripts](https://docs.databricks.com/clusters/init-scripts.html) + +## Enable OpenLineage + +Follow the steps below to enable OpenLineage on Databricks. + +* Build the jar via Gradle or download the [latest release](https://mvnrepository.com/artifact/io.openlineage/openlineage-spark). +* Configure the Databricks CLI with your desired workspace: + * [Azure Databricks CLI](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/) + * [GCP Databricks CLI](https://docs.gcp.databricks.com/dev-tools/cli/index.html) + * [AWS Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html) +* Run [`upload-to-databricks.sh`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/upload-to-databricks.sh) or [`upload-to-databricks.ps1`](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/upload-to-databricks.ps1). This will: + * create a folder in DBFS to store the OpenLineage jar. + * copy the jar to the DBFS folder + * copy the `init` script to the DBFS folder +* Create an interactive or job cluster with the relevant Spark configs: + ``` + spark.openlineage.transport.type console + spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener + spark.openlineage.version v1 + ``` +* Create manually `open-lineage-init-script.sh` through **Workspace** section in Databricks UI. Paste the script content from [this file](https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh). +* Make the cluster init script to point to previously created file. For example, if you create `open-lineage-init-script.sh` within **Shared**, then init scripts should point to `/Shared/open-lineage-init-script.sh`. User's workspace may be used as well. Alternatively, init script can be located in S3. Please mind that **DBFS** located init script are no longer supported (starting September 2023). + +:::info +Please note that the `init` script approach is currently obligatory to install OpenLineage on Databricks. The Openlineage integration relies on providing a custom extra listener class `io.openlineage.spark.agent.OpenLineageSparkListener` that has to be available on the classpath at the driver startup. Providing it with `spark.jars.packages` does not work on the Databricks platform as of August 2022. +::: + +## Verify Initialization + +A successful initialization will emit logs in the `Log4j output` that look similar to the following: + +``` +YY/MM/DD HH:mm:ss INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener + +YY/MM/DD HH:mm:ss INFO OpenLineageContext: Init OpenLineageContext: Args: ArgumentParser(host=https://YOURHOST, version=v1, namespace=YOURNAMESPACE, jobName=default, parentRunId=null, apiKey=Optional.empty) URI: https://YOURHOST/api/v1/lineage + +YY/MM/DD HH:mm:ss INFO AsyncEventQueue: Process of event SparkListenerApplicationStart(Databricks Shell,Some(app-XXX-0000),YYYY,root,None,None,None) by listener OpenLineageSparkListener took Xs. +``` + +## Create a Dataset + +Open a notebook and create an example dataset with: +```python +spark.createDataFrame([ + {'a': 1, 'b': 2}, + {'a': 3, 'b': 4} +]).write.mode("overwrite").saveAsTable("default.temp") +``` + +## Observe OpenLineage Events + +To troubleshoot or observe OpenLineage information in Databricks, see the `Log4j output` in the Cluster definition's `Driver Logs`. + +The `Log4j output` should contain entries starting with a message `INFO ConsoleTransport` that contain generated OpenLineage events: + +``` +{"eventType":"COMPLETE","eventTime":"2022-08-01T08:36:21.633Z","run":{"runId":"64537bbd-00ac-498d-ad49-1c77e9c2aabd","facets":{"spark_unknown":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","inputs":[{"description":{"@class":"org.apache.spark.sql.catalyst.analysis.ResolvedTableName","id":1,"traceEnabled":false,"streaming":false,"cacheId":{"id":2,"empty":true,"defined":false},"canonicalizedPlan":false,"defaultTreePatternBits":{"id":3}},"inputAttributes":[],"outputAttributes":[]},{"description":{"@class":"org.apache.spark.sql.execution.LogicalRDD","id":1,"streaming":false,"traceEnabled":false,"cacheId":{"id":2,"empty":true,"defined":false},"canonicalizedPlan":false,"defaultTreePatternBits":{"id":3}},"inputAttributes":[],"outputAttributes":[{"name":"a","type":"long","metadata":{}},{"name":"b","type":"long","metadata":{}}]}]},"spark.logicalPlan":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ReplaceTableAsSelect","num-children":2,"name":0,"partitioning":[],"query":1,"tableSpec":null,"writeOptions":null,"orCreate":true},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedTableName","num-children":0,"catalog":null,"ident":null},{"class":"org.apache.spark.sql.execution.LogicalRDD","num-children":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"a","dataType":"long","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":18,"jvmId":"481bebf6-f861-400e-bb00-ea105ed8afef"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"b","dataType":"long","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":19,"jvmId":"481bebf6-f861-400e-bb00-ea105ed8afef"},"qualifier":[]}]],"rdd":null,"outputPartitioning":{"product-class":"org.apache.spark.sql.catalyst.plans.physical.UnknownPartitioning","numPartitions":0},"outputOrdering":[],"isStreaming":false,"session":null}]},"spark_version":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","spark-version":"3.2.1","openlineage-spark-version":"0.12.0-SNAPSHOT"}}},"job":{"namespace":"default","name":"databricks_shell.atomic_replace_table_as_select","facets":{}},"inputs":[],"outputs":[{"namespace":"dbfs","name":"/user/hive/warehouse/temp","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"dbfs","uri":"dbfs"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"a","type":"long"},{"name":"b","type":"long"}]},"storage":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/StorageDatasetFacet.json#/$defs/StorageDatasetFacet","storageLayer":"delta","fileFormat":"parquet"},"lifecycleStateChange":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet","lifecycleStateChange":"OVERWRITE"}},"outputFacets":{}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} +``` + +The generated JSON contains the output dataset name and location `{"namespace":"dbfs","name":"/user/hive/warehouse/temp""` metadata, schema fields `[{"name":"a","type":"long"},{"name":"b","type":"long"}]`, and more. diff --git a/versioned_docs/version-1.26.0/integrations/spark/quickstart/quickstart_glue.md b/versioned_docs/version-1.26.0/integrations/spark/quickstart/quickstart_glue.md new file mode 100644 index 0000000..88cde71 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/quickstart/quickstart_glue.md @@ -0,0 +1,45 @@ +--- +sidebar_position: 2 +title: Quickstart with AWS Glue +--- + +:::info +The `DynamicFrames` API is currently not supported. Use `DataFrames`, `DataSets` or `RDD` instead. +::: + +## Enable OpenLineage + +:::caution +The configuration must be specified in the **Job details** tab. AWS Glue may ignore the properties if they are specified in the application source code. +::: + +Follow these steps to enable OpenLineage on AWS Glue: + +1. **Specify the OpenLineage JAR URL** + + In the **Job details** tab, navigate to **Advanced properties** → **Libraries** → **Dependent Jars path** + * Use the URL directly from **[Maven Central openlineage-spark](https://mvnrepository.com/artifact/io.openlineage/openlineage-spark)** + * Ensure you select the version for **Scala 2.12**, as Glue Spark is compiled with Scala 2.12 and version 2.13 won't be compatible. + * On the page for the specific OpenLineage version for Scala 2.12, copy the URL of the jar file from the Files row and use it in Glue. + * **Alternatively**, upload the jar to an **S3 bucket** and use its URL. The URL should use the `s3` scheme: `s3:///path/to/openlineage-spark_2.12-.jar` +2. **Add OpenLineage configuration in Job Parameters** + + In the same **Job details** tab, add a new property under **Job parameters**: + * Use the format **`param1=value1 --conf param2=value2 ... --conf paramN=valueN`**. + * Make sure every parameter except the first has an extra **`--conf`** in front of it. + * Example: `spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=http --conf spark.openlineage.transport.url=http://example.com --conf spark.openlineage.transport.endpoint=/api/v1/lineage --conf spark.openlineage.transport.auth.type=api_key --conf spark.openlineage.transport.auth.apiKey=aaaaa-bbbbb-ccccc-ddddd` + +3. **Set User Jars First Parameter** + + Add the **`--user-jars-first`** parameter and set its value to **`true`** + +![glue_settings.png](glue_settings.png) + +## Verification + +To confirm that OpenLineage registration has been successful, check the logs for the following entry: +``` +INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener +``` + +If you see this log message, it indicates that OpenLineage has been correctly registered with your AWS Glue job. diff --git a/versioned_docs/version-1.26.0/integrations/spark/quickstart/quickstart_local.md b/versioned_docs/version-1.26.0/integrations/spark/quickstart/quickstart_local.md new file mode 100644 index 0000000..5cd41de --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/quickstart/quickstart_local.md @@ -0,0 +1,71 @@ +--- +sidebar_position: 1 +title: Quickstart with Jupyter +--- + +Trying out the Spark integration is super easy if you already have Docker Desktop and git installed. + +:::info +If you're on macOS Monterey (macOS 12) you'll have to release port 5000 before beginning by disabling the [AirPlay Receiver](https://developer.apple.com/forums/thread/682332). +::: + +Check out the OpenLineage project into your workspace with: +``` +git clone https://github.com/OpenLineage/OpenLineage +``` + +From the spark integration directory ($OPENLINEAGE_ROOT/integration/spark) execute +```bash +docker-compose up +``` +This will start Marquez as an Openlineage client and Jupyter Spark notebook on localhost:8888. On startup, the notebook container logs will show a list of URLs +including an access token, such as +```bash +notebook_1 | To access the notebook, open this file in a browser: +notebook_1 | file:///home/jovyan/.local/share/jupyter/runtime/nbserver-9-open.html +notebook_1 | Or copy and paste one of these URLs: +notebook_1 | http://abc12345d6e:8888/?token=XXXXXX +notebook_1 | or http://127.0.0.1:8888/?token=XXXXXX +``` + +Copy the URL with 127.0.0.1 as the hostname from your own log (the token will be different from mine) and paste it into your browser window. You should have a blank Jupyter notebook environment ready to go. + +![image](jupyter_home.png) + +Once your notebook environment is ready, click on the notebooks directory, then click on the New button to create a new Python 3 notebook. + +![image](jupyter_new_notebook.png) + +In the first cell in the window paste the following text: + +```python +from pyspark.sql import SparkSession + +spark = (SparkSession.builder.master('local') + .appName('sample_spark') + .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') + .config('spark.jars.packages', 'io.openlineage:openlineage-spark:{{PREPROCESSOR:OPENLINEAGE_VERSION}}') + .config('spark.openlineage.transport.type', 'console') + .getOrCreate()) +``` +Once the Spark context is started, we adjust logging level to `INFO` with: +```python +spark.sparkContext.setLogLevel("INFO") +``` +and create some Spark table with: +```python +spark.createDataFrame([ + {'a': 1, 'b': 2}, + {'a': 3, 'b': 4} +]).write.mode("overwrite").saveAsTable("temp") +``` + +The command should output OpenLineage event in a form of log: +``` +22/08/01 06:15:49 INFO ConsoleTransport: {"eventType":"START","eventTime":"2022-08-01T06:15:49.671Z","run":{"runId":"204d9c56-6648-4d46-b6bd-f4623255d324","facets":{"spark_unknown":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","inputs":[{"description":{"@class":"org.apache.spark.sql.execution.LogicalRDD","id":1,"streaming":false,"traceEnabled":false,"canonicalizedPlan":false},"inputAttributes":[],"outputAttributes":[{"name":"a","type":"long","metadata":{}},{"name":"b","type":"long","metadata":{}}]}]},"spark.logicalPlan":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand","num-children":1,"table":{"product-class":"org.apache.spark.sql.catalyst.catalog.CatalogTable","identifier":{"product-class":"org.apache.spark.sql.catalyst.TableIdentifier","table":"temp"},"tableType":{"product-class":"org.apache.spark.sql.catalyst.catalog.CatalogTableType","name":"MANAGED"},"storage":{"product-class":"org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat","compressed":false,"properties":null},"schema":{"type":"struct","fields":[]},"provider":"parquet","partitionColumnNames":[],"owner":"","createTime":1659334549656,"lastAccessTime":-1,"createVersion":"","properties":null,"unsupportedFeatures":[],"tracksPartitionsInCatalog":false,"schemaPreservesCase":true,"ignoredProperties":null},"mode":null,"query":0,"outputColumnNames":"[a, b]"},{"class":"org.apache.spark.sql.execution.LogicalRDD","num-children":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"a","dataType":"long","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":6,"jvmId":"6a1324ac-917e-4e22-a0b9-84a5f80694ad"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"b","dataType":"long","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":7,"jvmId":"6a1324ac-917e-4e22-a0b9-84a5f80694ad"},"qualifier":[]}]],"rdd":null,"outputPartitioning":{"product-class":"org.apache.spark.sql.catalyst.plans.physical.UnknownPartitioning","numPartitions":0},"outputOrdering":[],"isStreaming":false,"session":null}]},"spark_version":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","spark-version":"3.1.2","openlineage-spark-version":"0.12.0-SNAPSHOT"}}},"job":{"namespace":"default","name":"sample_spark.execute_create_data_source_table_as_select_command","facets":{}},"inputs":[],"outputs":[{"namespace":"file","name":"/home/jovyan/notebooks/spark-warehouse/temp","facets":{"dataSource":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet","name":"file","uri":"file"},"schema":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet","fields":[{"name":"a","type":"long"},{"name":"b","type":"long"}]},"lifecycleStateChange":{"_producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet","lifecycleStateChange":"CREATE"}},"outputFacets":{}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.12.0-SNAPSHOT/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} +``` + +Generated JSON contains output dataset name and location `{"namespace":"file","name":"/home/jovyan/notebooks/spark-warehouse/temp"`, schema fields `[{"name":"a","type":"long"},{"name":"b","type":"long"}]`, etc. + + +More comprehensive demo, that integrates Spark events with Marquez backend can be found on our blog [Tracing Data Lineage with OpenLineage and Apache Spark](https://openlineage.io/blog/openlineage-spark/) diff --git a/versioned_docs/version-1.26.0/integrations/spark/spark-job-creation.dot.png b/versioned_docs/version-1.26.0/integrations/spark/spark-job-creation.dot.png new file mode 100644 index 0000000..87e6670 Binary files /dev/null and b/versioned_docs/version-1.26.0/integrations/spark/spark-job-creation.dot.png differ diff --git a/versioned_docs/version-1.26.0/integrations/spark/spark.md b/versioned_docs/version-1.26.0/integrations/spark/spark.md new file mode 100644 index 0000000..fe48285 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/spark.md @@ -0,0 +1,16 @@ +--- +sidebar_position: 1 +title: Apache Spark +--- + +:::info +This integration is known to work with latest Spark versions as well as Apache Spark 2.4. +Please refer [here](https://github.com/OpenLineage/OpenLineage/tree/main/integration#openlineage-integrations) +for up-to-date information on versions supported. +::: + +This integration employs the `SparkListener` interface through `OpenLineageSparkListener`, offering +a comprehensive monitoring solution. It examines SparkContext-emitted events to extract metadata +associated with jobs and datasets, utilizing the RDD and DataFrame dependency graphs. This method +effectively gathers information from various data sources, including filesystem sources (e.g., S3 +and GCS), JDBC backends, and data warehouses such as Redshift and Bigquery. diff --git a/versioned_docs/version-1.26.0/integrations/spark/spark_column_lineage.md b/versioned_docs/version-1.26.0/integrations/spark/spark_column_lineage.md new file mode 100644 index 0000000..5e53a76 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/spark_column_lineage.md @@ -0,0 +1,94 @@ +--- +sidebar_position: 7 +title: Column-Level Lineage +--- + +:::warning + +Column-level lineage works only with Spark 3. +::: + +:::info +Column-level lineage for Spark is turned on by default and requires no additional work to be done. The following documentation describes its internals. +::: + +:::info +Lineage contains information about what fields were used to create of influence the field but also how, see [Transformation Types](spec/facets/dataset-facets/column_lineage_facet.md#transformation-type) +::: + +Column-level lineage provides fine-grained information on datasets dependencies. Not only do we know the dependency exists, but we are also able to understand which input columns are used to produce output columns. This allows for answering questions like *Which root input columns are used to construct column x?* + +## Standard specification + +Collected information is sent in OpenLineage event within `columnLineage` dataset facet described [here](spec/facets/dataset-facets/column_lineage_facet.md). + +## Code architecture and its mechanics + +Column-level lineage has been implemented separately from the rest of builders and visitors extracting lineage information from Spark logical plans. As a result the codebase is stored in `io.openlineage.spark3.agent.lifecycle.plan.columnLineage` package within classes responsible only for this feature. + +* Class `ColumnLevelLineageUtils.java` is an entry point to run the mechanism and is used within `OpenLineageRunEventBuilder`. + +* Classes `ColumnLevelLineageUtilsNonV2CatalogTest` and `ColumnLevelLineageUtilsV2CatalogTest` contain real-life test cases which run Spark jobs and get an access to the last query plan executed. + They evaluate column-level lineage based on the plan and expected output schema. + Then, they verify if this meets the requirements. + This allows testing column-level lineage behavior in real scenarios. The more tests and scenarios put here, the better. + +* Class `ColumnLevelLineageBuilder` contains both the logic of building output facet (`ColumnLineageDatasetFacetFields`) +and datastructures containing necessary information: + * schema - `SchemaDatasetFacet` contains information about output schema + * inputs - map pointing from `ExprId` to column name and `DatasetIdentifier` identifying the datasource + * outputs - map pointing from output field name to its `ExprId` + * exprDependencies - map pointing from `ExprId` to set of its `Dependency` objects containing `ExprId` and information about type of the dependency. + * datasetDependencies - list of `ExprId` representing pseudo-expressions representing operations like `filter`, `join` etc. + * externalExpressionMappings - map pointing from `ColumnMeta` object to `ExprId` used for dependencies extracted by `sql-parser` + + +* Class `ColumnLevelLineageBuilder` is used when traversing logical plans to store all the information required to produce column-level lineage. + It allows storing input/output columns. It also stores dependencies between the expressions contained in query plan. + Once inputs, outputs and dependencies are filled, build method is used to produce output facet (`ColumnLineageDatasetFacetFields`). + +* `OutputFieldsCollector` class is used to traverse the plan to gather the `outputs`, +even though the information about output dataset is already in `schema`, it's not coupled information about the outputs `ExprId`. +The collector traverses the plan and matches the outputs existing there, inside `Aggregate` or `Project` objects, with the ones in `schema` by their name. + +* `InputFieldsCollector` class is used to collect the inputs which can be extracted from `DataSourceV2Relation`, `DataSourceV2ScanRelation`, `HiveTableRelation` or `LogicalRelation`. +Each input field has its `ExprId` within the plan. Each input is identified by `DatasetIdentifier`, which means it contains name and namespace, of a dataset and an input field. + +* `ExpressionDependenciesCollector` traverses the plan to identify dependencies between different expressions using their `ExprId`. Dependencies map parent expressions to its dependencies with additional information about the transformation type. +This is used evaluate which inputs influenced certain output and what kind of influence was it. + +### Expression dependency collection process + +For each node in `LogicalPlan` the `ExpressionDependencyCollector` attempts to extract the column lineage information based on its type. +First it goes through `ColumnLineageVisitors` to check if any applies to current node, if so then it extract dependencies from them. +Next if the node is `LogicalRelation` and relation type is `JDBCRelation`, the sql-parser extracts lineage data from query string itself. + +:::warning + +Because Sql parser only parses the query string in `JDBCRelation` it does not collect information about input field types or transformation types. +The only info collected is the name of the table/view and field, as it is mentioned in the query. +::: + +After that all that's left are following types of nodes: `Project`,`Aggregate`, `Join`, `Filter`, `Sort`. +Each of them contains dependency expressions that can be added to one of the lists `expressions` or `datasetDependencies`. + +When node is `Aggregate`, `Join`, `Filter` or `Sort` it contains dependencies that don't affect one single output but all the outputs, +so they need to be treated differently than normal dependencies. +For each of those nodes the new `ExprId` is created to represent "all outputs", all its dependencies will be of `INDIRECT` type. + +For each of the `expressions` the collector tries to go through it and possible children expressions and add them to `exprDependencies` map with appropriate transformation type and `masking` flag. +Most of the expressions represent `DIRECT` transformation, only exceptions are `If` and `CaseWhen` which contain condition expressions. + +### Facet building process + +For each of the outputs `ColumnLevelLineageBuilder` goes through the `exprDependencies` to build the list final dependencies, then using `inputs` maps them to fields in datasets. +During the process it also unravels the transformation type between the input and output. +To unravel two dependencies implement following logic: +- if current type is `INDIRECT` the result takes the type and subtype from current +- if current type is `DIRECT` and other one is null, result is null +- if current type is `DIRECT` and other is `INDIRECT` the result takes type and subtype from other +- if both are `DIRECT` the result is type `DIRECT`, subtype is the first existing from the order `AGGREGATION`, `TRANSFORMATION`, `IDENTITY` +- if any of the transformations is masking, the result is masking + +The inputs are also mapped for all dataset dependencies. The result is added to each output. +Finally, the list of outputs with all their inputs is mapped to `ColumnLineageDatasetFacetFields` object. diff --git a/versioned_docs/version-1.26.0/integrations/spark/testing.md b/versioned_docs/version-1.26.0/integrations/spark/testing.md new file mode 100644 index 0000000..0100317 --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/spark/testing.md @@ -0,0 +1,94 @@ +--- +title: Testing +sidebar_position: 8 +--- + +# Testing + +## Configurable Integration Test + +Starting of version 1.17, OpenLineage Spark integration provides a command line tooling to help +creating custom integration tests. `configurable-test.sh` script can be used to build +`openlineage-spark` from the current directory, script arguments are used to pass Spark +job. Then, emitted OpenLineage events are validated against JSON files with expected events' fields. Build process and +integration test run itself is performed within Docker environment which makes the command +Java environment agnostic. + +:::info +Quickstart: try running following command from OpenLineage project root directory: +```bash +./integration/spark/cli/configurable-test.sh --spark ./integration/spark/cli/spark-conf.yml --test ./integration/spark/cli/tests +``` +This should run four integration tests `./integration/spark/cli/tests` and store their output into +`./integration/spark/cli/runs`. Feel free to add extra test directories with custom tests. +::: + +What's happening when running `configurable-test.sh` command? + * At first, a docker container with Java 11 is created. It builds a docker image `openlineage-test:$OPENLINEAGE_VERSION`. During the build process, all the internal dependencies (like `openlineage-java`) are added to the image. It's because we don't want to build it in each run as it speeds up single command run. In case of subproject changes, a new image has to be built. + * Once the docker image is built, docker container is started and starts gradle `configurableIntegrationTest` task. Task depends on `shadowJar` to build `openlineage-spark` jar. The built jar should be also available on host machine. + * Gradle test task spawns additional Spark containers which run the Spark job and emit OpenLineage events to local file. A gradle test code has access to mounted event file location, fetches the events emitted and verifies them against expected JSON events. Matching is done through MockServer Json body matching with `ONLY_MATCHING_FIELDS` flag set, as it's happening within other integration tests. + * Test output is written into `./integration/spark/cli/runs` directories with subdirectories containing test definition and file with events that was emitted. + +:::info +Please be aware that first run of the command will download several gigabytes of docker images being used +as well as gradle dependencies required to build JAR from the source code. All of them are stored +within Docker volumes, which makes consecutive runs a way faster. +::: + +### Command details + +It is important to run command from the project root directory. This is the only way to let +created Docker containers get mounted volumes containing spark integration code, java client code, +sql integration code. Command has extra check to verify if work directory is correct. + +Try running: +```bash +./integration/spark/cli/configurable-test.sh --help +``` +to see all the options available within your version. These should include: + * `--spark` - to define spark environment configuration file, + * `--test` - location for the directory containing tests, + * `--clean` - flague marking docker image to be re-build from scratch. + +### Spark configuration file + +This an example Spark environment configuration file: +```yaml +appName: "CLI test application" +sparkVersion: 3.3.4 +scalaBinaryVersion: 2.12 +enableHiveSupport: true +packages: + - org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.2 +sparkConf: + spark.openlineage.debugFacet: enabled +``` + +* `sparkVersion` and `scalaBinaryVersion` are used to determine Spark and Scala version to be tested. Spark is run on docker from the images available in +[https://quay.io/repository/openlineage/spark?tab=tags](https://quay.io/repository/openlineage/spark?tab=tags). A combination of Spark and Scala version provided within +the config has to match images available. +* `appName` and `enableHiveSupport` parameters are used when starting Spark session. +* `sparkConf` can be used to pass any spark configuration entries. OpenLineage transport defined is file based with a specified file location and is set within the test being run. Those settings should not be overrider. +* `packages` lets define custom jar packages to be installed with `spark-submit` command. + +As of version 1.18, Spark configuration can accept instead of `sparkVersion`, a configuration +entries to determine Docker image to be run on: +```yaml +appName: "CLI test application" +docker: + image: "apache/spark:3.3.3-scala2.12-java11-python3-ubuntu" + sparkSubmit: /opt/spark/bin/spark-submit + waitForLogMessage: ".*ShutdownHookManager: Shutdown hook called.*" +scalaBinaryVersion: 2.12 +``` +where: + * `image` specifies docker image to be used to run Spark job, + * `sparkSubmit` is file location of `spark-submit` command, + * `waitForLogMessage` is regex for log entry determining a Spark job is finished. + +### Tests definition directories + + * Specified test directory should contain one or more directories and each of the subdirectories contains separate test definition. + * Each test directory should contain a single `.sql` or `.py` pySpark code file containing a job definition. For `.sql` file each line of the file is decorated with `spark.sql()` and transformed into pySpark script. +For pySpark scripts, a user should instantiate SparkSession with OpenLineage parameters configured properly. Please refer to existing tests for usage examples. + * Each test directory should contain on or more event definition file with `.json` extensions defining an expected content of any of the events emitted by the job run. diff --git a/versioned_docs/version-1.26.0/integrations/trino.md b/versioned_docs/version-1.26.0/integrations/trino.md new file mode 100644 index 0000000..b222c2f --- /dev/null +++ b/versioned_docs/version-1.26.0/integrations/trino.md @@ -0,0 +1,60 @@ +--- +sidebar_position: 7 +title: Trino +--- + +:::info +This integration is known to work with Trino 450 and later. +::: + +Trino is a distributed SQL query engine targeted for big data analytical workloads. Trino queries are typically run on +Trino `cluster`, where distributed set of Trino `workers` provides compute power and Trino `coordinator` is responsible +for query submission. By a rich set of available connectors, you can use Trino to execute SQL queries with the same exact +syntax [on different underlying systems](https://trino.io/docs/current/connector.html) - such as RDBMs databases, hive metastore, s3 and others. + +Trino enables running queries for fetching the data as well as creating new structures - such as tables, views or materialized views. + +To learn more about Trino, visit their [documentation site](https://trino.io/docs/current/). + +## How does Trino work with OpenLineage? + +Collecting lineage in Trino requires configuring a `plugin`, which will use `EventListener` interface of Trino to extract +lineage information from metadata available for this interface. + +Trino OpenLineage Event Listener plugin will yield 2 events for each executed query - one for STARTED and one for +SUCCEEDED/FAILED query. While first one already provides us with new job information, actual lineage information +(inlets/outlets) will be available in the latter event. + +This plugin supports both table and column level lineage. + +## Configuring Trino OpenLineage plugin + +1. Create configuration file named `openlineage-event-listener.properties` + +```properties +event-listener.name=openlineage +openlineage-event-listener.transport.type=HTTP +openlineage-event-listener.transport.url=__OPENLINEAGE_URL__ +openlineage-event-listener.trino.uri=__TRINO_URI__ +``` + +Make sure to set: +- `__OPENLINEAGE_URL__` - address where OpenLineage API is reachable so plugin can post lineage information. +- `__TRINO_URI__` - address (preferably DNS) of a Trino cluster. It will be used for rendering dataset namespace. + +2. Extend properties file used to configure Trino **coordinator** with following line: + +```properties +event-listener.config-files=etc/openlineage-event-listener.properties +``` + +Make sure that the path to `event-listener.config-files` is recognizable by Trino coordinator. + +### Official documentation + +Current documentation on Trino OpenLineage Event Listener with full list of available configuration options +[is maintained here](https://trino.io/docs/current/admin/event-listeners-openlineage.html). + +## Feedback + +What did you think of this guide? You can reach out to us on [slack](https://join.slack.com/t/openlineage/shared_invite/zt-2u4oiyz5h-TEmqpP4fVM5eCdOGeIbZvA) and leave us feedback! diff --git a/versioned_docs/version-1.26.0/model.png b/versioned_docs/version-1.26.0/model.png new file mode 100644 index 0000000..a552980 Binary files /dev/null and b/versioned_docs/version-1.26.0/model.png differ diff --git a/versioned_docs/version-1.26.0/model.svg b/versioned_docs/version-1.26.0/model.svg new file mode 100644 index 0000000..39f0ac6 --- /dev/null +++ b/versioned_docs/version-1.26.0/model.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_10_0.md b/versioned_docs/version-1.26.0/releases/0_10_0.md new file mode 100644 index 0000000..37df489 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_10_0.md @@ -0,0 +1,29 @@ +--- +title: 0.10.0 +sidebar_position: 9983 +--- + +# 0.10.0 - 2022-06-24 + +### Added + +* Add static code anlalysis tool [mypy](http://mypy-lang.org) to run in CI for against all python modules ([`#802`](https://github.com/openlineage/openlineage/issues/802)) [@howardyoo](https://github.com/howardyoo) +* Extend `SaveIntoDataSourceCommandVisitor` to extract schema from `LocalRelaiton` and `LogicalRdd` in spark integration ([`#794`](https://github.com/OpenLineage/OpenLineage/pull/794)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Add `InMemoryRelationInputDatasetBuilder` for `InMemory` datasets to Spark integration ([`#818`](https://github.com/OpenLineage/OpenLineage/pull/818)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Add copyright to source files [`#755`](https://github.com/OpenLineage/OpenLineage/pull/755) [@merobi-hub](https://github.com/merobi-hub) +* Add `SnowflakeOperatorAsync` extractor support to Airflow integration [`#869`](https://github.com/OpenLineage/OpenLineage/pull/869) [@merobi-hub](https://github.com/merobi-hub) +* Add PMD analysis to proxy project ([`#889`](https://github.com/OpenLineage/OpenLineage/pull/889)) [@howardyoo](https://github.com/howardyoo) + +### Changed + +* Skip `FunctionRegistry.class` serialization in Spark integration ([`#828`](https://github.com/OpenLineage/OpenLineage/pull/828)) [@mobuchowski](https://github.com/mobuchowski) +* Install new `rust`-based SQL parser by default in Airflow integration ([`#835`](https://github.com/OpenLineage/OpenLineage/pull/835)) [@mobuchowski](https://github.com/mobuchowski) +* Improve overall `pytest` and integration tests for Airflow integration ([`#851`](https://github.com/OpenLineage/OpenLineage/pull/851),[`#858`](https://github.com/OpenLineage/OpenLineage/pull/858)) [@denimalpaca](https://github.com/denimalpaca) +* Reduce OL event payload size by excluding local data and including output node in start events ([`#881`](https://github.com/OpenLineage/OpenLineage/pull/881)) [@collado-mike](https://github.com/collado-mike) +* Split spark integration into submodules ([`#834`](https://github.com/OpenLineage/OpenLineage/pull/834), [`#890`](https://github.com/OpenLineage/OpenLineage/pull/890)) [@tnazarew](https://github.com/tnazarew) [@mobuchowski](https://github.com/mobuchowski) + +### Fixed + +* Conditionally import `sqlalchemy` lib for Great Expectations integration ([`#826`](https://github.com/OpenLineage/OpenLineage/pull/826)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Add check for missing **class** `org.apache.spark.sql.catalyst.plans.logical.CreateV2Table` in Spark integration ([`#866`](https://github.com/OpenLineage/OpenLineage/pull/866)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Fix static code analysis issues ([`#867`](https://github.com/OpenLineage/OpenLineage/pull/867),[`#874`](https://github.com/OpenLineage/OpenLineage/pull/874)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_11_0.md b/versioned_docs/version-1.26.0/releases/0_11_0.md new file mode 100644 index 0000000..7a422a2 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_11_0.md @@ -0,0 +1,24 @@ +--- +title: 0.11.0 +sidebar_position: 9982 +--- + +# 0.11.0 - 2022-07-07 + +### Added + +* HTTP option to override timeout and properly close connections in `openlineage-java` lib. [`#909`](https://github.com/OpenLineage/OpenLineage/pull/909) [@mobuchowski](https://github.com/mobuchowski) +* Dynamic mapped tasks support to Airflow integration [`#906`](https://github.com/OpenLineage/OpenLineage/pull/906) [@JDarDagran](https://github.com/JDarDagran) +* `SqlExtractor` to Airflow integration [`#907`](https://github.com/OpenLineage/OpenLineage/pull/907) [@JDarDagran](https://github.com/JDarDagran) +* [PMD](https://pmd.github.io) to Java and Spark builds in CI [`#898`](https://github.com/OpenLineage/OpenLineage/pull/898) [@merobi-hub](https://github.com/merobi-hub) + +### Changed + +* When testing extractors in the Airflow integration, set the extractor length assertion dynamic [`#882`](https://github.com/OpenLineage/OpenLineage/pull/882) [@denimalpaca](https://github.com/denimalpaca) +* Render templates as start of integration tests for `TaskListener` in the Airflow integration [`#870`](https://github.com/OpenLineage/OpenLineage/pull/870) [@mobuchowski](https://github.com/mobuchowski) + +### Fixed + +* Dependencies bundled with `openlineage-java` lib. [`#855`](https://github.com/OpenLineage/OpenLineage/pull/855) [@collado-mike](https://github.com/collado-mike) +* [PMD](https://pmd.github.io) reported issues [`#891`](https://github.com/OpenLineage/OpenLineage/pull/891) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Spark casting error and session catalog support for `iceberg` in Spark integration [`#856`](https://github.com/OpenLineage/OpenLineage/pull/856) [@wslulciuc](https://github.com/wslulciuc) diff --git a/versioned_docs/version-1.26.0/releases/0_12_0.md b/versioned_docs/version-1.26.0/releases/0_12_0.md new file mode 100644 index 0000000..03a3f59 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_12_0.md @@ -0,0 +1,26 @@ +--- +title: 0.12.0 +sidebar_position: 9981 +--- + +# 0.12.0 - 2022-08-01 + +### Added + +* Add Spark 3.3.0 support [`#950`](https://github.com/OpenLineage/OpenLineage/pull/950) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Add Apache Flink integration [`#951`](https://github.com/OpenLineage/OpenLineage/pull/951) [@mobuchowski](https://github.com/mobuchowski) +* Add ability to extend column level lineage mechanism [`#922`](https://github.com/OpenLineage/OpenLineage/pull/922) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Add ErrorMessageRunFacet [`#897`](https://github.com/OpenLineage/OpenLineage/pull/897) [@mobuchowski](https://github.com/mobuchowski) +* Add SQLCheckExtractors [`#717`](https://github.com/OpenLineage/OpenLineage/pull/717) [@denimalpaca](https://github.com/denimalpaca) +* Add RedshiftSQLExtractor & RedshiftDataExtractor [`#930`](https://github.com/OpenLineage/OpenLineage/pull/930) [@JDarDagran](https://github.com/JDarDagran) +* Add dataset builder for AlterTableCommand [`#927`](https://github.com/OpenLineage/OpenLineage/pull/927) [@tnazarew](https://github.com/tnazarew) + +### Changed + +* Limit Delta events [`#905`](https://github.com/OpenLineage/OpenLineage/pull/905) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Airflow integration: allow lineage metadata to flow through inlets and outlets [`#914`](https://github.com/OpenLineage/OpenLineage/pull/914) [@fenil25](https://github.com/fenil25) + +### Fixed + +* Limit size of serialized plan [`#917`](https://github.com/OpenLineage/OpenLineage/pull/917) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Fix noclassdef error [`#942`](https://github.com/OpenLineage/OpenLineage/pull/942) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) diff --git a/versioned_docs/version-1.26.0/releases/0_13_0.md b/versioned_docs/version-1.26.0/releases/0_13_0.md new file mode 100644 index 0000000..47b8705 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_13_0.md @@ -0,0 +1,35 @@ +--- +title: 0.13.0 +sidebar_position: 9980 +--- + +# 0.13.0 - 2022-08-22 + +### Added + +* Add BigQuery check support [`#960`](https://github.com/OpenLineage/OpenLineage/pull/960) [@denimalpaca](https://github.com/denimalpaca) + *Adds logic and support for proper dynamic class inheritance for BigQuery-style operators. (BigQuery's extractor needed additional logic to support the forthcoming `BigQueryColumnCheckOperator` and `BigQueryTableCheckOperator`.)* +* Add `RUNNING` `EventType` in spec and Python client [`#972`](https://github.com/OpenLineage/OpenLineage/pull/972) [@mzareba382](https://github.com/mzareba382) + *Introduces a `RUNNING` event state in the OpenLineage spec to indicate a running task and adds a `RUNNING` event type in the Python API.* +* Use databases & schemas in SQL Extractors [`#974`](https://github.com/OpenLineage/OpenLineage/pull/974) [@JDarDagran](https://github.com/JDarDagran) + *Allows the Airflow integration to differentiate between databases and schemas. (There was no notion of databases and schemas when querying and parsing results from `information_schema` tables.)* +* Implement Event forwarding feature via HTTP protocol [`#995`](https://github.com/OpenLineage/OpenLineage/pull/995) [@howardyoo](https://github.com/howardyoo) + *Adds `HttpLineageStream` to forward a given OpenLineage event to any HTTP endpoint.* +* Introduce `SymlinksDatasetFacet` to spec [`#936`](https://github.com/OpenLineage/OpenLineage/pull/936) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Creates a new facet, the `SymlinksDatasetFacet`, to support the storing of alternative dataset names.* +* Add Azure Cosmos Handler to Spark integration [`#983`](https://github.com/OpenLineage/OpenLineage/pull/983) [@hmoazam](https://github.com/hmoazam) + *Defines a new interface, the `RelationHandler`, to support Spark data sources that do not have `TableCatalog`, `Identifier`, or `TableProperties` set, as is the case with the Azure Cosmos DB Spark connector.* +* Support OL Datasets in manual lineage inputs/outputs [`#1015`](https://github.com/OpenLineage/OpenLineage/pull/1015) [@conorbev](https://github.com/conorbev) + *Allows Airflow users to create OpenLineage Dataset classes directly in DAGs with no conversion necessary. (Manual lineage definition required users to create an `airflow.lineage.entities.Table`, which was then converted to an OpenLineage Dataset.)* +* Create ownership facets [`#996`](https://github.com/OpenLineage/OpenLineage/pull/996) [@julienledem](https://github.com/julienledem) + *Adds an ownership facet to both Dataset and Job in the OpenLineage spec to capture ownership of jobs and datasets.* + +### Changed +* Use `RUNNING` EventType in Flink integration for currently running jobs [`#985`](https://github.com/OpenLineage/OpenLineage/pull/985) [@mzareba382](https://github.com/mzareba382) + *Makes use of the new `RUNNING` event type in the Flink integration, changing events sent by Flink jobs from `OTHER` to this new type.* +* Convert task objects to JSON-encodable objects when creating custom Airflow version facets [`#1018`](https://github.com/OpenLineage/OpenLineage/pull/1018) [@fm100](https://github.com/fm100) + *Implements a `to_json_encodable` function in the Airflow integration to make task objects JSON-encodable.* + +### Fixed +* Add support for custom SQL queries in v3 Great Expectations API [`#1025`](https://github.com/OpenLineage/OpenLineage/pull/1025) [@collado-mike](https://github.com/collado-mike) + *Fixes support for custom SQL statements in the Great Expectations provider. (The Great Expectations custom SQL datasource was not applied to the support for the V3 checkpoints API.)* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_13_1.md b/versioned_docs/version-1.26.0/releases/0_13_1.md new file mode 100644 index 0000000..cbdfbf2 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_13_1.md @@ -0,0 +1,12 @@ +--- +title: 0.13.1 +sidebar_position: 9979 +--- + +# 0.13.1 - 2022-08-25 + +### Fixed +* Rename all `parentRun` occurrences to `parent` in Airflow integration [`1037`](https://github.com/OpenLineage/OpenLineage/pull/1037) [@fm100](https://github.com/fm100) + *Changes the `parentRun` property name to `parent` in the Airflow integration to match the spec.* +* Do not change task instance during `on_running` event [`1028`](https://github.com/OpenLineage/OpenLineage/pull/1028) [@JDarDagran](https://github.com/JDarDagran) + *Fixes an issue in the Airflow integration with the `on_running` hook, which was changing the `TaskInstance` object along with the `task` attribute.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_14_0.md b/versioned_docs/version-1.26.0/releases/0_14_0.md new file mode 100644 index 0000000..f53bbeb --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_14_0.md @@ -0,0 +1,32 @@ +--- +title: 0.14.0 +sidebar_position: 9978 +--- + +# 0.14.0 - 2022-09-06 + +### Added +* Support ABFSS and Hadoop Logical Relation in Column-level lineage [`#1008`](https://github.com/OpenLineage/OpenLineage/pull/1008) [@wjohnson](https://github.com/wjohnson) + *Introduces an `extractDatasetIdentifier` that uses similar logic to `InsertIntoHadoopFsRelationVisitor` to pull out the path on the HDFS compliant file system; tested on ABFSS and DBFS (Databricks FileSystem) to prove that lineage could be extracted using non-SQL commands.* +* Add Kusto relation visitor [`#939`](https://github.com/OpenLineage/OpenLineage/pull/939) [@hmoazam](https://github.com/hmoazam) + *Implements a `KustoRelationVisitor` to support lineage for Azure Kusto's Spark connector.* +* Add ColumnLevelLineage facet doc [`#1020`](https://github.com/OpenLineage/OpenLineage/pull/1020) [@julienledem](https://github.com/julienledem) + *Adds documentation for the Column-level lineage facet.* +* Include symlinks dataset facet [`#935`](https://github.com/OpenLineage/OpenLineage/pull/935) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Includes the recently introduced `SymlinkDatasetFacet` in generated OpenLineage events.* +* Add support for dbt 1.3 beta's metadata changes [`#1051`](https://github.com/OpenLineage/OpenLineage/pull/1051) [@mobuchowski](https://github.com/mobuchowski) + *Makes projects that are composed of only SQL models work on 1.3 beta (dbt 1.3 renamed the `compiled_sql` field to `compiled_code` to support Python models). Does not provide support for dbt's Python models.* +* Support Flink 1.15 [`#1009`](https://github.com/OpenLineage/OpenLineage/pull/1009) [@mzareba382](https://github.com/mzareba382) + *Adds support for Flink 1.15.* +* Add Redshift dialect to the SQL integration [`#1066`](https://github.com/OpenLineage/OpenLineage/pull/1066) [@mobuchowski](https://github.com/mobuchowski) + *Adds support for Redshift's SQL dialect in OpenLineage's SQL parser, including quirks such as the use of square brackets in JSON paths. (Note, this does not add support for all of Redshift's custom syntax.)* + +### Changed +* Make the timeout configurable in the Spark integration [`#1050`](https://github.com/OpenLineage/OpenLineage/pull/1050) [@tnazarew](https://github.com/tnazarew) + *Makes timeout configurable by the user. (In some cases, the time needed to send events was longer than 5 seconds, which exceeded the timeout value.)* + +### Fixed +* Add a dialect parameter to Great Expectations SQL parser calls [`#1049`](https://github.com/OpenLineage/OpenLineage/pull/1049) [@collado-mike](https://github.com/collado-mike) + *Specifies the dialect name from the SQL engine.* +* Fix Delta 2.1.0 with Spark 3.3.0 [`#1065`](https://github.com/OpenLineage/OpenLineage/pull/1065) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Allows delta support for Spark 3.3 and fixes potential issues. (The Openlineage integration for Spark 3.3 was turned on without delta support, as delta did not support Spark 3.3 at that time.)* diff --git a/versioned_docs/version-1.26.0/releases/0_14_1.md b/versioned_docs/version-1.26.0/releases/0_14_1.md new file mode 100644 index 0000000..de94c30 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_14_1.md @@ -0,0 +1,10 @@ +--- +title: 0.14.1 +sidebar_position: 9977 +--- + +# 0.14.1 - 2022-09-07 + +### Fixed +* Fix Spark integration issues including error when no `openlineage.timeout` [`#1069`](https://github.com/OpenLineage/OpenLineage/pull/1069) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *`OpenlineageSparkListener` was failing when no `openlineage.timeout` was provided.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_15_1.md b/versioned_docs/version-1.26.0/releases/0_15_1.md new file mode 100644 index 0000000..e80ecdb --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_15_1.md @@ -0,0 +1,36 @@ +--- +title: 0.15.1 +sidebar_position: 9976 +--- + +# 0.15.1 - 2022-10-05 + +### Added +* Airflow: improve development experience [`#1101`](https://github.com/OpenLineage/OpenLineage/pull/1101) [@JDarDagran](https://github.com/JDarDagran) + *Adds an interactive development environment to the Airflow integration and improves integration testing.* +* Spark: add description for URL parameters in readme, change `overwriteName` to `appName` [`#1130`](https://github.com/OpenLineage/OpenLineage/pull/1130) [@tnazarew](https://github.com/tnazarew) + *Adds more information about passing arguments with `spark.openlineage.url` and changes `overwriteName` to `appName` for clarity.* +* Documentation: update issue templates for proposal & add new integration template [`#1116`](https://github.com/OpenLineage/OpenLineage/pull/1116) [@rossturk](https://github.com/rossturk) + *Adds a YAML issue template for new integrations and fixes a bug in the proposal template.* + +### Changed +* Airflow: lazy load BigQuery client [`#1119`](https://github.com/OpenLineage/OpenLineage/pull/1119) [@mobuchowski](https://github.com/mobuchowski) + *Moves import of the BigQuery client from top level to local level to decrease DAG import time.* + +### Fixed +* Airflow: fix UUID generation conflict for Airflow DAGs with same name [`#1056`](https://github.com/OpenLineage/OpenLineage/pull/1056) [@collado-mike](https://github.com/collado-mike) + *Adds a namespace to the UUID calculation to avoid conflicts caused by DAGs having the same name in different namespaces in Airflow deployments.* +* Spark/BigQuery: fix issue with spark-bigquery-connector >=0.25.0 [`#1111`](https://github.com/OpenLineage/OpenLineage/pull/1111) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Makes the Spark integration compatible with the latest connector.* +* Spark: fix column lineage [`#1069`](https://github.com/OpenLineage/OpenLineage/pull/1069) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Fixes a null pointer exception error and an error when `openlineage.timeout` is not provided.* +* Spark: set log level of `Init OpenLineageContext` to DEBUG [`#1064`](https://github.com/OpenLineage/OpenLineage/pull/1064) [@varuntestaz](https://github.com/varuntestaz) + *Prevents sensitive information from being logged unless debug mode is used.* +* Java client: update version of SnakeYAML [`#1090`](https://github.com/OpenLineage/OpenLineage/pull/1090) [@TheSpeedding](https://github.com/TheSpeedding) + *Bumps the SnakeYAML library version to include a key bug fix.* +* dbt: remove requirement for `OPENLINEAGE_URL` to be set [`#1107`](https://github.com/OpenLineage/OpenLineage/pull/1107) [@mobuchowski](https://github.com/mobuchowski) + *Removes erroneous check for `OPENLINEAGE_URL` in the dbt integration.* +* Python client: remove potentially cyclic import [`#1126`](https://github.com/OpenLineage/OpenLineage/pull/1126) [@mobuchowski](https://github.com/mobuchowski) + *Hides imports to remove potentially cyclic import.* +* CI: build macos release package on medium resource class [`#1131`](https://github.com/OpenLineage/OpenLineage/pull/1131) [@mobuchowski](https://github.com/mobuchowski) + *Fixes failing build due to resource class being too large.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_16_1.md b/versioned_docs/version-1.26.0/releases/0_16_1.md new file mode 100644 index 0000000..3fb5ea3 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_16_1.md @@ -0,0 +1,40 @@ +--- +title: 0.16.1 +sidebar_position: 9975 +--- + +# 0.16.1 - 2022-11-03 + +### Added +* Airflow: add `dag_run` information to Airflow version run facet [`#1133`](https://github.com/OpenLineage/OpenLineage/pull/1133) [@fm100](https://github.com/fm100) + *Adds the Airflow DAG run ID to the `taskInfo` facet, making this additional information available to the integration.* +* Airflow: add `LoggingMixin` to extractors [`#1149`](https://github.com/OpenLineage/OpenLineage/pull/1149) [@JDarDagran](https://github.com/JDarDagran) + *Adds a `LoggingMixin` class to the custom extractor to make the output consistent with general Airflow and OpenLineage logging settings.* +* Airflow: add default extractor [`#1162`](https://github.com/OpenLineage/OpenLineage/pull/1162) [@mobuchowski](https://github.com/mobuchowski) + *Adds a `DefaultExtractor` to support the default implementation of OpenLineage for external operators without the need for custom extractors.* +* Airflow: add `on_complete` argument in `DefaultExtractor` [`#1188`](https://github.com/OpenLineage/OpenLineage/pull/1188) [@JDarDagran](https://github.com/JDarDagran) + *Adds support for running another method on `extract_on_complete`.* +* SQL: reorganize the library into multiple packages [`#1167`](https://github.com/OpenLineage/OpenLineage/pull/1167) [@StarostaGit](https://github.com/StarostaGit) [@mobuchowski](https://github.com/mobuchowski) + *Splits the SQL library into a Rust implementation and foreign language bindings, easing the process of adding language interfaces. Also contains CI fix.* + +### Changed +* Airflow: move `get_connection_uri` as extractor's classmethod [`#1169`](https://github.com/OpenLineage/OpenLineage/pull/1169) [@JDarDagran](https://github.com/JDarDagran) + *The `get_connection_uri` method allowed for too many params, resulting in unnecessarily long URIs. This changes the logic to whitelisting per extractor.* +* Airflow: change `get_openlineage_facets_on_start/complete` behavior [`#1201`](https://github.com/OpenLineage/OpenLineage/pull/1201) [@JDarDagran](https://github.com/JDarDagran) + *Splits up the method for greater legibility and easier maintenance.* + +### Fixed +* Airflow: always send SQL in `SqlJobFacet` as a string [`#1143`](https://github.com/OpenLineage/OpenLineage/pull/1143) [@mobuchowski](https://github.com/mobuchowski) + *Changes the data type of `query` from array to string to an fix error in the `RedshiftSQLOperator`.* +* Airflow: include `__extra__` case when filtering URI query params [`#1144`](https://github.com/OpenLineage/OpenLineage/pull/1144) [@JDarDagran](https://github.com/JDarDagran) + *Includes the `conn.EXTRA_KEY` in the `get_connection_uri` method to avoid exposing secrets in URIs via the `__extra__` key.* +* Airflow: enforce column casing in `SQLCheckExtractor`s [`#1159`](https://github.com/OpenLineage/OpenLineage/pull/1159) [@denimalpaca](https://github.com/denimalpaca) + *Uses the parent extractor's `_is_uppercase_names` property to determine if the column should be upper cased in the `SQLColumnCheckExtractor`'s `_get_input_facets()` method.* +* Spark: prevent exception when no schema provided [`#1180`](https://github.com/OpenLineage/OpenLineage/pull/1180) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Prevents evaluation of column lineage when the `schemaFacet` is `null`.* +* Great Expectations: add V3 API compatibility [`#1194`](https://github.com/OpenLineage/OpenLineage/pull/1194) [@denimalpaca](https://github.com/denimalpaca) + *Fixes the Pandas datasource to make it V3 API-compatible.* + +### Removed +* Airflow: remove support for Airflow 1.10 [`#1128`](https://github.com/OpenLineage/OpenLineage/pull/1128) [@mobuchowski](https://github.com/mobuchowski) + *Removes the code structures and tests enabling support for Airflow 1.10.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_17_0.md b/versioned_docs/version-1.26.0/releases/0_17_0.md new file mode 100644 index 0000000..e614c09 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_17_0.md @@ -0,0 +1,48 @@ +--- +title: 0.17.0 +sidebar_position: 9974 +--- + +# 0.17.0 - 2022-11-16 + +### Added +* Spark: support latest Spark 3.3.1 [`#1183`](https://github.com/OpenLineage/OpenLineage/pull/1183) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds support for the latest version of Spark.* +* Spark: add Kinesis Transport and support config Kinesis in Spark integration [`#1200`](https://github.com/OpenLineage/OpenLineage/pull/1200) [@yogayang](https://github.com/yogyang) + *Adds support for sending to Kinesis from the Spark integration.* +* Spark: Disable specified facets [`#1271`](https://github.com/OpenLineage/OpenLineage/pull/1271) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds the ability to disable specified facets from generated OpenLineage events.* +* Python: add facets implementation to Python client [`#1233`](https://github.com/OpenLineage/OpenLineage/pull/1233) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds missing facets to the Python client.* +* SQL: add Rust parser interface [`#1172`](https://github.com/OpenLineage/OpenLineage/pull/1172) [@StarostaGit](https://github.com/StarostaGit) [@mobuchowski](https://github.com/mobuchowski) + *Implements a Java interface in the Rust SQL parser, including a build script, native library loading mechanism, CI support and build fixes.* +* Proxy: add helm chart for the proxy backed [`#1068`](https://github.com/OpenLineage/OpenLineage/pull/1068) [@wslulciuc](https://github.com/wslulciuc) + *Adds a helm chart for deploying the proxy backend on Kubernetes.* +* Spec: include possible facets usage in spec [`#1249`](https://github.com/OpenLineage/OpenLineage/pull/1249) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Extends the `facets` definition with a list of available facets.* +* Website: publish YML version of spec to website [`#1300`](https://github.com/OpenLineage/OpenLineage/pull/1300) [@rossturk](https://github.com/rossturk) + *Adds configuration necessary to make the OpenLineage website auto-generate openAPI docs when the spec is published there.* +* Docs: update language on nominating new committers [`#1270`](https://github.com/OpenLineage/OpenLineage/pull/1270) [@rossturk](https://github.com/rossturk) + *Updates the governance language to reflect the new policy on nominating committers.* + +### Changed +* Website: publish spec into new website repo location [`#1295`](https://github.com/OpenLineage/OpenLineage/pull/1295) [@rossturk](https://github.com/rossturk) + *Creates a new deploy key, adds it to CircleCI & GitHub, and makes the necessary changes to the `release.sh` script.* +* Airflow: change how pip installs packages in tox environments [`#1302`](https://github.com/OpenLineage/OpenLineage/pull/1302) [@JDarDagran](https://github.com/JDarDagran) + *Use deprecated resolver and constraints files provided by Airflow to avoid potential issues caused by pip's new resolver.* + +### Fixed +* Airflow: fix README for running integration test [`#1238`](https://github.com/OpenLineage/OpenLineage/pull/1238) [@sekikn](https://github.com/sekikn) + *Updates the README for consistency with supported Airflow versions.* +* Airflow: add `task_instance` argument to `get_openlineage_facets_on_complete` [`#1269`](https://github.com/OpenLineage/OpenLineage/pull/1269) [@JDarDagran](https://github.com/JDarDagran) + *Adds the `task_instance` argument to `DefaultExtractor`.* +* Java client: fix up all artifactory paths [`#1290`](https://github.com/OpenLineage/OpenLineage/pull/1290) [@harels](https://github.com/harels) + *Not all artifactory paths were changed in the build CI script in a previous PR.* +* Python client: fix Mypy errors and adjust to PEP 484 [`#1264`](https://github.com/OpenLineage/OpenLineage/pull/1264) [@JDarDagran](https://github.com/JDarDagran) + *Adds a `--no-namespace-packages` argument to the Mypy command and adjusts code to PEP 484.* +* Website: release all specs since `last_spec_commit_id`, not just HEAD~1 [`#1298`](https://github.com/OpenLineage/OpenLineage/pull/1298) [@rossturk](https://github.com/rossturk) + *The script now ships all specs that have changed since `.last_spec_commit_id`.* + +### Removed +* Deprecate HttpTransport.Builder in favor of HttpConfig [`#1287`](https://github.com/OpenLineage/OpenLineage/pull/1287) [@collado-mike](https://github.com/collado-mike) + *Deprecates the Builder in favor of HttpConfig only and replaces the existing Builder implementation by delegating to the HttpConfig.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_18_0.md b/versioned_docs/version-1.26.0/releases/0_18_0.md new file mode 100644 index 0000000..194faae --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_18_0.md @@ -0,0 +1,30 @@ +--- +title: 0.18.0 +sidebar_position: 9973 +--- + +# 0.18.0 - 2022-12-08 + +### Added +* Airflow: support `SQLExecuteQueryOperator` [`#1379`](https://github.com/OpenLineage/OpenLineage/pull/1379) [@JDarDagran](https://github.com/JDarDagran) + *Changes the `SQLExtractor` and adds support for the dynamic assignment of extractors based on `conn_type`.* +* Airflow: introduce a new extractor for `SFTPOperator` [`#1263`](https://github.com/OpenLineage/OpenLineage/pull/1263) [@sekikn](https://github.com/sekikn) + *Adds an extractor for tracing file transfers between local file systems.* +* Airflow: add Sagemaker extractors [`#1136`](https://github.com/OpenLineage/OpenLineage/pull/1136) [@fhoda](https://github.com/fhoda) + *Creates extractors for `SagemakeProcessingOperator` and `SagemakerTransformOperator`.* +* Airflow: add S3 extractor for Airflow operators [`#1166`](https://github.com/OpenLineage/OpenLineage/pull/1166) [@fhoda](https://github.com/fhoda) + *Creates an extractor for the `S3CopyObject` in the Airflow integration.* +* Spec: add spec file for `ExternalQueryRunFacet` [`#1262`](https://github.com/OpenLineage/OpenLineage/pull/1262) [@howardyoo](https://github.com/howardyoo) + *Adds a spec file to make this facet available for the Java client. Includes a README.* +* Docs: add a TSC doc [`#1303`](https://github.com/OpenLineage/OpenLineage/pull/1303) [@merobi-hub](https://github.com/merobi-hub) + *Adds a document listing the members of the Technical Steering Committee.* + +### Fixed +* Spark: improve Databricks to send better events [`#1330`](https://github.com/OpenLineage/OpenLineage/pull/1330) [@pawel-big-lebowski](https://github.com/@pawel-big-lebowski) + *Filters unwanted events and provides a meaningful job name.* +* Spark-Bigquery: fix a few of the common errors [`#1377`](https://github.com/OpenLineage/OpenLineage/pull/1377) [@mobuchowski](https://github.com/mobuchowski) + *Fixes a few of the common issues with the Spark-Bigquery integration and adds an integration test and configures CI.* +* Python: validate `eventTime` field in Python client [`#1355`](https://github.com/OpenLineage/OpenLineage/pull/1355) [@pawel-big-lebowski](https://github.com/@pawel-big-lebowski) + *Validates the `eventTime` of a `RunEvent` within the client library.* +* Databricks: Handle Databricks Runtime 11.3 changes to `DbFsUtils` constructor [`#1351`](https://github.com/OpenLineage/OpenLineage/pull/1351) [@wjohnson](https://github.com/@wjohnson) + *Recaptures lost mount point information from the `DatabricksEnvironmentFacetBuilder` and environment-properties facet by looking at the number of parameters in the `DbFsUtils` constructor to determine the runtime version.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_19_2.md b/versioned_docs/version-1.26.0/releases/0_19_2.md new file mode 100644 index 0000000..d0258e0 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_19_2.md @@ -0,0 +1,36 @@ +--- +title: 0.19.2 +sidebar_position: 9971 +--- + +# 0.19.2 - 2023-01-04 + +### Added +* Airflow: add Trino extractor [`#1288`](https://github.com/OpenLineage/OpenLineage/pull/1288) [@sekikn](https://github.com/sekikn) + *Adds a Trino extractor to the Airflow integration.* +* Airflow: add `S3FileTransformOperator` extractor [`#1450`](https://github.com/OpenLineage/OpenLineage/pull/1450) [@sekikn](https://github.com/sekikn) + *Adds an `S3FileTransformOperator` extractor to the Airflow integration.* +* Airflow: add standardized run facet [`#1413`](https://github.com/OpenLineage/OpenLineage/pull/1413) [@JDarDagran](https://github.com/JDarDagran) + *Creates one standardized run facet for the Airflow integration.* +* Airflow: add `NominalTimeRunFacet` and `OwnershipJobFacet` [`#1410`](https://github.com/OpenLineage/OpenLineage/pull/1410) [@JDarDagran](https://github.com/JDarDagran) + *Adds `nominalEndTime` and `OwnershipJobFacet` fields to the Airflow integration.* +* dbt: add support for postgres datasources [`#1417`](https://github.com/OpenLineage/OpenLineage/pull/1417) [@julienledem](https://github.com/julienledem) + *Adds the previously unsupported postgres datasource type.* +* Proxy: add client-side proxy (skeletal version) [`#1439`](https://github.com/OpenLineage/OpenLineage/pull/1439) [`#1420`](https://github.com/OpenLineage/OpenLineage/pull/1420) [@fm100](https://github.com/fm100) + *Implements a skeletal version of a client-side proxy.* +* Proxy: add CI job to publish Docker image [`#1086`](https://github.com/OpenLineage/OpenLineage/pull/1086) [@wslulciuc](https://github.com/wslulciuc) + *Includes a script to build and tag the image plus jobs to verify the build on every CI run and publish to Docker Hub.* +* SQL: add `ExtractionErrorRunFacet` [`#1442`](https://github.com/OpenLineage/OpenLineage/pull/1442) [@mobuchowski](https://github.com/mobuchowski) + *Adds a facet to the spec to reflect internal processing errors, especially failed or incomplete parsing of SQL jobs.* +* SQL: add column-level lineage to SQL parser [`#1432`](https://github.com/OpenLineage/OpenLineage/pull/1432) [`#1461`](https://github.com/OpenLineage/OpenLineage/pull/1461) [@mobuchowski](https://github.com/mobuchowski) [@StarostaGit](https://github.com/StarostaGit) + *Adds support for extracting column-level lineage from SQL statements in the parser, including adjustments to Rust-Python and Rust-Java interfaces and the Airflow integration's SQL extractor to make use of the feature. Also includes more tests, removal of the old parser, and removal of the common-build cache in CI (which was breaking the parser).* +* Spark: pass config parameters to the OL client [`#1383`](https://github.com/OpenLineage/OpenLineage/pull/1383) [@tnazarew](https://github.com/tnazarew) + *Adds a mechanism for making new lineage consumers transparent to the integration, easing the process of setting up new types of consumers.* + +### Fixed +* Airflow: fix `collect_ignore`, add flags to Pytest for cleaner output [`#1437`](https://github.com/OpenLineage/OpenLineage/pull/1437) [@JDarDagran](https://github.com/JDarDagran) + *Removes the `extractors` directory from the ignored list, improving unit testing.* +* Spark & Java client: fix README typos [@versaurabh](https://github.com/versaurabh) + *Fixes typos in the SPDX license headers.* + + diff --git a/versioned_docs/version-1.26.0/releases/0_1_0.md b/versioned_docs/version-1.26.0/releases/0_1_0.md new file mode 100644 index 0000000..3af0f4a --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_1_0.md @@ -0,0 +1,15 @@ +--- +title: 0.1.0 +sidebar_position: 10000 +--- + +# 0.1.0 - 2021-08-13 + +OpenLineage is an _Open Standard_ for lineage metadata collection designed to record metadata for a job in execution. The initial public release includes: + +* **An initial specification.** The the initial version [`1-0-0`](https://github.com/OpenLineage/OpenLineage/blob/0.1.0/spec/OpenLineage.md) of the OpenLineage specification defines the core model and facets. +* **Integrations** that collect lineage metadata as OpenLineage events: + * [`Apache Airflow`](https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow) with support for BigQuery, Great Expectations, Postgres, Redshift, Snowflake + * [`Apache Spark`](https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark) + * [`dbt`](https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt) +* **Clients** that send OpenLineage events to an HTTP backend. Both [`java`](https://github.com/OpenLineage/OpenLineage/tree/main/client/java) and [`python`](https://github.com/OpenLineage/OpenLineage/tree/main/client/python) are initially supported. diff --git a/versioned_docs/version-1.26.0/releases/0_20_4.md b/versioned_docs/version-1.26.0/releases/0_20_4.md new file mode 100644 index 0000000..d2811f3 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_20_4.md @@ -0,0 +1,38 @@ +--- +title: 0.20.4 +sidebar_position: 9970 +--- + +# 0.20.4 - 2023-02-07 + +### Added +* Airflow: add new extractor for `GCSToGCSOperator` [`#1495`](https://github.com/OpenLineage/OpenLineage/pull/1495) [@sekikn](https://github.com/sekikn) + *Adds a new extractor for this operator.* +* Flink: resolve topic names from regex, support 1.16.0 [`#1522`](https://github.com/OpenLineage/OpenLineage/pull/1522) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds support for Flink 1.16.0 and makes the integration resolve topic names from Kafka topic patterns.* +* Proxy: implement lineage event validator for client proxy [`#1469`](https://github.com/OpenLineage/OpenLineage/pull/1469) [@fm100](https://github.com/fm100) + *Implements logic in the proxy (which is still in development) for validating and handling lineage events.* + +### Changed +* CI: use `ruff` instead of flake8, isort, etc., for linting and formatting [`#1526`](https://github.com/OpenLineage/OpenLineage/pull/1526) [@mobuchowski](https://github.com/mobuchowski) + *Adopts the `ruff` package, which combines several linters and formatters into one fast binary.* + +### Fixed +* Airflow: make the Trino catalog non-mandatory [`#1572`](https://github.com/OpenLineage/OpenLineage/pull/1572) [@JDarDagran](https://github.com/JDarDagran) + *Makes the Trino catalog optional in the Trino extractor.* +* Common: add explicit SQL dependency [`#1532`](https://github.com/OpenLineage/OpenLineage/pull/1532) [@mobuchowski](https://github.com/mobuchowski) + *Addresses a 0.19.2 breaking change to the GE integration by including the SQL dependency explicitly.* +* DBT: adjust `tqdm` logging in `dbt-ol` [`#1549`](https://github.com/OpenLineage/OpenLineage/pull/1549) [@JdarDagran](https://github.com/JDarDagran) + *Adjusts `tqdm` to show the correct number of iterations and adds START events for parent runs.* +* DBT: fix typo in log output [`#1493`](https://github.com/OpenLineage/OpenLineage/pull/1493) [@denimalpaca](https://github.com/denimalpaca) + *Fixes 'emittled' typo in log output.* +* Great Expectations/Airflow: follow Snowflake dataset naming rules [`#1527`](https://github.com/OpenLineage/OpenLineage/pull/1527) [@mobuchowski](https://github.com/mobuchowski) + *Normalizes Snowflake dataset and datasource naming rules among DBT/Airflow/GE; canonizes old Snowflake account paths around making them all full-size with account, region and cloud names.* +* Java and Python Clients: Kafka does not initialize properties if they are empty; check and notify about Confluent-Kafka requirement [`#1556`](https://github.com/OpenLineage/OpenLineage/pull/1556) [@mobuchowski](https://github.com/mobuchowski) + *Fixes the failure to initialize `KafkaTransport` in the Java client and adds an exception if the required `confluent-kafka` module is missing from the Python client.* +* Spark: add square brackets for list-based Spark configs [`#1507`](https://github.com/OpenLineage/OpenLineage/pull/1507) [@Varunvaruns9](https://github.com/Varunvaruns9) + *Adds a condition to treat configs with `[]` as lists. Note: `[]` will be required for list-based configs starting with 0.21.0.* +* Spark: fix several Spark/BigQuery-related issues [`#1557`](https://github.com/OpenLineage/OpenLineage/pull/1557) [@mobuchowski](https://github.com/mobuchowski) + *Fixes the assumption that a version is always a number; adds support for `HadoopMapReduceWriteConfigUtil`; makes the integration access `BigQueryUtil` and `getTableId` using reflection, which supports all BigQuery versions; makes logs provide the full serialized LogicalPlan on `debug`.* +* SQL: only report partial failures [`#1479](https://github.com/OpenLineage/OpenLineage/pull/1479) [@mobuchowski](https://github.com/mobuchowski) + *Changes the parser so it reports partial failures instead of failing the whole extraction.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_20_6.md b/versioned_docs/version-1.26.0/releases/0_20_6.md new file mode 100644 index 0000000..106404d --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_20_6.md @@ -0,0 +1,20 @@ +--- +title: 0.20.6 +sidebar_position: 9969 +--- + +# 0.20.6 - 2023-02-10 + +### Added +* Airflow: add new extractor for `FTPFileTransmitOperator` [`#1603`](https://github.com/OpenLineage/OpenLineage/pull/1601) [@sekikn](https://github.com/sekikn) + *Adds a new extractor for this Airflow operator serving legacy systems.* + +### Changed +* Airflow: make extractors for async operators work [`#1601`](https://github.com/OpenLineage/OpenLineage/pull/1601) [@JDarDagran](https://github.com/JDarDagran) + *Sends a deterministic Run UUID for Airflow runs.* + +### Fixed +* dbt: render actual profile only in profiles.yml [`#1599`](https://github.com/OpenLineage/OpenLineage/pull/1599) [@mobuchowski](https://github.com/mobuchowski) + *Adds an `include_section` argument for the Jinja render method to include only one profile if needed.* +* dbt: make `compiled_code` optional [`#1595`](https://github.com/OpenLineage/OpenLineage/pull/1595) [@JDarDagran](https://github.com/JDarDagran) + *Makes `compiled_code` optional for manifest > v7.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_21_1.md b/versioned_docs/version-1.26.0/releases/0_21_1.md new file mode 100644 index 0000000..2f1a128 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_21_1.md @@ -0,0 +1,44 @@ +--- +title: 0.21.1 +sidebar_position: 9968 +--- + +# 0.21.1 - 2023-03-02 + +### Added +* **Clients: add `DEBUG` logging of events to transports** [`#1633`](https://github.com/OpenLineage/OpenLineage/pull/1633) [@mobuchowski](https://github.com/mobuchowski) + *Ensures that the `DEBUG` loglevel on properly configured loggers will always log events, regardless of the chosen transport.* +* **Spark: add `CustomEnvironmentFacetBuilder` class** [`#1545`](https://github.com/OpenLineage/OpenLineage/pull/1545) ***New contributor*** [@Anirudh181001](https://github.com/Anirudh181001) + *Enables the capture of custom environment variables from Spark.* +* **Spark: introduce the new output visitors `AlterTableAddPartitionCommandVisitor` and `AlterTableSetLocationCommandVisitor`** [`#1629`](https://github.com/OpenLineage/OpenLineage/pull/1629) ***New contributor*** [@nataliezeller1](https://github.com/nataliezeller1) + *Adds visitors for extracting table names from the Spark commands `AlterTableAddPartitionCommand` and `AlterTableSetLocationCommand`. The intended use case is a custom transport for the OpenMetadata lineage API.* +* **Spark: add column lineage for JDBC relations** [`#1636`](https://github.com/OpenLineage/OpenLineage/pull/1636) [@tnazarew](https://github.com/tnazarew) + *Adds column lineage information to JDBC events with data extracted from query by the SQL parser.* +* **SQL: add linux-aarch64 native library to Java SQL parser** [`#1664`](https://github.com/OpenLineage/OpenLineage/pull/1664) [@mobuchowski](https://github.com/mobuchowski) + *Adds a Linux-ARM version of the native library. The Java SQL parser interface had only Linux-x64 and MacOS universal binary variants previously.* + +### Changed +* **Airflow: get table database in Athena extractor** [`#1631`](https://github.com/OpenLineage/OpenLineage/pull/1631) ***New contributor*** [@rinzool](https://github.com/rinzool) + *Changes the extractor to get a table's database from the `table.schema` field or the operator default if the field is `None`.* + +### Fixed +* **dbt: add dbt `seed` to the list of dbt-ol events** [`#1649`](https://github.com/OpenLineage/OpenLineage/pull/1649) ***New contributor*** [@pohek321](https://github.com/pohek321) + *Ensures that `dbt-ol test` no longer fails when run against an event seed.* +* **Spark: make column lineage extraction in Spark support caching** [`#1634`](https://github.com/OpenLineage/OpenLineage/pull/1634) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Collect column lineage from Spark logical plans that contain cached datasets.* +* **Spark: add support for a deprecated config** [`#1586`](https://github.com/OpenLineage/OpenLineage/pull/1586) [@tnazarew](https://github.com/tnazarew) + *Maps the deprecated `spark.openlineage.url` to `spark.openlineage.transport.url`.* +* **Spark: add error message in case of null in url** [`#1590`](https://github.com/OpenLineage/OpenLineage/pull/1590) [@tnazarew](https://github.com/tnazarew) + *Improves error logging in the case of undefined URLs.* +* **Spark: collect complete event for really quick Spark jobs** [`#1650`](https://github.com/OpenLineage/OpenLineage/pull/1650) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Improves the collecting of OpenLineage events on SQL complete in the case of quick operations.* +* **Spark: fix input/outputs for one node `LogicalRelation` plans** [`#1668`](https://github.com/OpenLineage/OpenLineage/pull/1668) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *For simple queries like `select col1, col2 from my_db.my_table` that do not write output, + the Spark plan contained just a single node, which was wrongly treated as both + an input and output dataset.* +* **SQL: fix file existence check in build script for openlineage-sql-java** [`#1613`](https://github.com/OpenLineage/OpenLineage/pull/1613) [@sekikn](https://github.com/sekikn) + *Ensures that the build script works if the library is compiled solely for Linux.* + +### Removed +* **Airflow: remove `JobIdMapping` and update macros to better support Airflow version 2+** [`#1645`](https://github.com/OpenLineage/OpenLineage/pull/1645) [@JDarDagran](https://github.com/JDarDagran) + *Updates macros to use `OpenLineageAdapter`'s method to generate deterministic run UUIDs because using the `JobIdMapping` utility is incompatible with Airflow 2+.* diff --git a/versioned_docs/version-1.26.0/releases/0_22_0.md b/versioned_docs/version-1.26.0/releases/0_22_0.md new file mode 100644 index 0000000..5282464 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_22_0.md @@ -0,0 +1,36 @@ +--- +title: 0.22.0 +sidebar_position: 9967 +--- + +# 0.22.0 - 2023-04-03 + +### Added +* **Spark: properties facet** [`#1717`](https://github.com/OpenLineage/OpenLineage/pull/1717) [@tnazarew](https://github.com/tnazarew) + *Adds a new facet to capture specified Spark properties.* +* **SQL: SQLParser supports `alter`, `truncate` and `drop` statements** [`#1695`](https://github.com/OpenLineage/OpenLineage/pull/1695) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds support for the statements to the parser.* +* **Common/SQL: provide public interface for openlineage_sql package** [`#1727`](https://github.com/OpenLineage/OpenLineage/pull/1727) [@JDarDagran](https://github.com/JDarDagran) + *Provides a `.pyi` public interface file for providing typing hints.* +* **Java client: add configurable headers to HTTP transport** [`#1718`](https://github.com/OpenLineage/OpenLineage/pull/1718) [@tnazarew](https://github.com/tnazarew) + *Adds custom header handling to `HttpTransport` and the Spark integration.* +* **Python client: create client from dictionary** [`#1745`](https://github.com/OpenLineage/OpenLineage/pull/1745) [@JDarDagran](https://github.com/JDarDagran) + *Adds a new `from_dict` method to the Python client to support creating it from a dictionary.* + +### Changed +* **Spark: remove URL parameters for JDBC namespaces** [`#1708`](https://github.com/OpenLineage/OpenLineage/pull/1708) [@tnazarew](https://github.com/tnazarew) + *Makes the namespace value from an event conform to the naming convention specified in* [Naming.md](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md). +* **Make `OPENLINEAGE_DISABLED` case-insensitive** [`#1705`](https://github.com/OpenLineage/OpenLineage/pull/1705) [@jedcunningham](https://github.com/jedcunningham) + *Makes the environment variable for disabling OpenLineage in the Python client and Airflow integration case-insensitive.* + +### Fixed +* **Spark: fix missing BigQuery class in column lineage** [`#1698`](https://github.com/OpenLineage/OpenLineage/pull/1698) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *The Spark integration now checks if the BigQuery classes are available on the classpath before attempting to use them.* +* **DBT: throw `UnsupportedDbtCommand` when finding unsupported entry in `args.which`** [`#1724`](https://github.com/OpenLineage/OpenLineage/pull/1724) [@JDarDagran](https://github.com/JDarDagran) + *Adjusts the `dbt-ol` script to detect DBT commands in `run_results.json` only.* + +### Removed +* **Spark: remove unnecessary warnings for column lineage** [`#1700`](https://github.com/OpenLineage/OpenLineage/pull/1700) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Removes the warnings about `OneRowRelation` and `LocalRelation` nodes.* +* **Spark: remove deprecated configs** [`#1711`](https://github.com/OpenLineage/OpenLineage/pull/1711) [@tnazarew](https://github.com/tnazarew) + *Removes support for deprecated configs.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_23_0.md b/versioned_docs/version-1.26.0/releases/0_23_0.md new file mode 100644 index 0000000..b465319 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_23_0.md @@ -0,0 +1,28 @@ +--- +title: 0.23.0 +sidebar_position: 9966 +--- + +# 0.23.0 - 2023-04-20 + +### Added +* **SQL: parser improvements to support: `copy into`, `create stage`, `pivot`** [`#1742`](https://github.com/OpenLineage/OpenLineage/pull/1742) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds support for additional syntax available in sqlparser-rs.* +* **dbt: add support for snapshots** [`#1787`](https://github.com/OpenLineage/OpenLineage/pull/1787) [@JDarDagran](https://github.com/JDarDagran) + *Adds support for this special kind of table representing type-2 Slowly Changing Dimensions.* + +### Changed +* **Spark: change custom column lineage visitors** [`#1788`](https://github.com/OpenLineage/OpenLineage/pull/1788) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Makes the `CustomColumnLineageVisitor` interface public to support custom column lineage.* + +### Fixed +* **Spark: fix null pointer in `JobMetricsHolder`** [`#1786`](https://github.com/OpenLineage/OpenLineage/pull/1786) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds a null check before running `put` to fix a NPE occurring in `JobMetricsHolder`* +* **SQL: fix query with table generator** [`#1783`](https://github.com/OpenLineage/OpenLineage/pull/1783) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Allows `TableFactor::TableFunction` to support queries containing table functions.* +* **SQL: fix rust code style bug** [`#1785`](https://github.com/OpenLineage/OpenLineage/pull/1785) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Fixes a minor style issue in `visitor.rs`.* + +### Removed +* **Airflow: Remove explicit `pass` from several `extract_on_complete` methods** [`#1771`](https://github.com/OpenLineage/OpenLineage/pull/1771) [@JDarDagran](https://github.com/JDarDagran) + *Removes the code from three extractors.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_24_0.md b/versioned_docs/version-1.26.0/releases/0_24_0.md new file mode 100644 index 0000000..8540a65 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_24_0.md @@ -0,0 +1,22 @@ +--- +title: 0.24.0 +sidebar_position: 9965 +--- + +# 0.24.0 - 2023-05-03 + +### Added +* **Support custom transport types** [`#1795`](https://github.com/OpenLineage/OpenLineage/pull/1795) [@nataliezeller1](https://github.com/nataliezeller1) + *Adds a new interface, `TransportBuilder`, for creating custom transport types without having to modify core components of OpenLineage.* +* **Airflow: dbt Cloud integration** [`#1418`](https://github.com/OpenLineage/OpenLineage/pull/1418) [@howardyoo](https://github.com/howardyoo) + *Adds a new OpenLineage extractor for dbt Cloud that uses the dbt Cloud hook provided by Airflow to communicate with dbt Cloud via its API.* +* **Spark: support dataset name modification using regex** [`#1796`](https://github.com/OpenLineage/OpenLineage/pull/1796) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *It is a common scenario to write Spark output datasets with a location path ending with `/year=2023/month=04`. The Spark parameter `spark.openlineage.dataset.removePath.pattern` introduced here allows for removing certain elements from a path with a regex pattern.* + +### Fixed +* **Spark: catch exception when trying to obtain details of non-existing table.** [`#1798`](https://github.com/OpenLineage/OpenLineage/pull/1798) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *This mostly happens when getting table details on START event while the table is still not created.* +* **Spark: LogicalPlanSerializer** [`#1792`](https://github.com/OpenLineage/OpenLineage/pull/1792) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Changes `LogicalPlanSerializer` to make use of non-shaded Jackson classes in order to serialize `LogicalPlans`. Note: class names are no longer serialized.* +* **Flink: fix Flink CI** [`#1801`](https://github.com/OpenLineage/OpenLineage/pull/1801) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Specifies an older image version that succeeds on CI in order to fix the Flink integration.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_25_0.md b/versioned_docs/version-1.26.0/releases/0_25_0.md new file mode 100644 index 0000000..f4ce46a --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_25_0.md @@ -0,0 +1,20 @@ +--- +title: 0.25.0 +sidebar_position: 9964 +--- + +# 0.25.0 - 2023-05-15 + +### Added +* **Spark: add Spark/Delta `merge into` support** [`#1823`](https://github.com/OpenLineage/OpenLineage/pull/1823) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds support for `merge into` queries.* + +### Fixed +* **Spark: fix JDBC query handling** [`#1808`](https://github.com/OpenLineage/OpenLineage/pull/1808) [@nataliezeller1](https://github.com/nataliezeller1) + *Makes query handling more tolerant of variations in syntax and formatting.* +* **Spark: filter Delta adaptive plan events** [`#1830`](https://github.com/OpenLineage/OpenLineage/pull/1830) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Extends the `DeltaEventFilter` class to filter events in cases where rewritten queries in adaptive Spark plans generate extra events.* +* **Spark: fix Java class cast exception** [`#1844`](https://github.com/OpenLineage/OpenLineage/pull/1844) [@Anirudh181001](https://github.com/Anirudh181001) + *Fixes the error caused by the `OpenLineageRunEventBuilder` when it cast the Spark scheduler's `ShuffleMapStage` to boolean.* +* **Flink: include missing fields of Openlineage events** [`#1840`](https://github.com/OpenLineage/OpenLineage/pull/1840) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Enriches Flink events so that missing `eventTime`, `runId` and `job` elements no longer produce errors.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_26_0.md b/versioned_docs/version-1.26.0/releases/0_26_0.md new file mode 100644 index 0000000..9faa898 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_26_0.md @@ -0,0 +1,20 @@ +--- +title: 0.26.0 +sidebar_position: 9963 +--- + +# 0.26.0 - 2023-05-18 + +### Added +* **Proxy: Fluentd proxy support (experimental)** [`#1757`](https://github.com/OpenLineage/OpenLineage/pull/1757) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds a Fluentd data collector as a proxy to buffer Openlineage events and send them to multiple backends (among many other purposes). Also implements a Fluentd Openlineage parser to validate incoming HTTP events at the beginning of the pipeline. See the [readme](https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd) file for more details.* + +### Changed +* **Python client: use Hatchling over setuptools to orchestrate Python env setup** [`#1856`](https://github.com/OpenLineage/OpenLineage/pull/1856) [@gaborbernat](https://github.com/gaborbernat) + *Replaces setuptools with Hatchling for building the backend. Also includes a number of fixes, including to type definitions in `transport` and elsewhere.* + +### Fixed +* **Spark: support single file datasets** [`#1855`](https://github.com/OpenLineage/OpenLineage/pull/1855) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Fixes the naming of single file datasets so they are no longer named using the parent directory's path: `spark.read.csv('file.csv')`.* +* **Spark: fix `logicalPlan` serialization issue on Databricks** [`#1858`](https://github.com/OpenLineage/OpenLineage/pull/1858) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Disables the `spark_unknown` facet by default to turn off serialization of `logicalPlan`.* diff --git a/versioned_docs/version-1.26.0/releases/0_27_1.md b/versioned_docs/version-1.26.0/releases/0_27_1.md new file mode 100644 index 0000000..5d1cb83 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_27_1.md @@ -0,0 +1,16 @@ +--- +title: 0.27.1 +sidebar_position: 9962 +--- + +# 0.27.1 - 2023-06-05 + +### Added +* **Python client: add emission filtering mechanism and exact, regex filters** [`#1878`](https://github.com/OpenLineage/OpenLineage/pull/1878) [@mobuchowski](https://github.com/mobuchowski) + *Adds configurable job-name filtering to the Python client. Filters can be exact-match- or regex-based. Events will not be sent in the case of matches.* + +### Fixed +* **Spark: fix column lineage for aggregate queries on databricks** [`#1867`](https://github.com/OpenLineage/OpenLineage/pull/1867) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Aggregate queries on databricks did not return column lineage.* +* **Airflow: fix unquoted `[` and `]` in Snowflake URIs** [`#1883`](https://github.com/OpenLineage/OpenLineage/pull/1883) [@JDarDagran](https://github.com/JDarDagran) + *Snowflake connections containing one of `[` or `]` were causing `urllib.parse.urlparse` to fail.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_27_2.md b/versioned_docs/version-1.26.0/releases/0_27_2.md new file mode 100644 index 0000000..1c8035c --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_27_2.md @@ -0,0 +1,10 @@ +--- +title: 0.27.2 +sidebar_position: 9961 +--- + +# 0.27.2 - 2023-06-06 + +### Fixed +* **Python client: deprecate `client.from_environment`, do not skip loading config** [`#1908`](https://github.com/OpenLineage/OpenLineage/pull/1908) [@mobuchowski](https://github.com/mobuchowski) + *Deprecates the `OpenLineage.from_environment` method and recommends using the constructor instead.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_28_0.md b/versioned_docs/version-1.26.0/releases/0_28_0.md new file mode 100644 index 0000000..ea7da0a --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_28_0.md @@ -0,0 +1,16 @@ +--- +title: 0.28.0 +sidebar_position: 9960 +--- + +# 0.28.0 - 2023-06-12 + +### Added +* **dbt: add Databricks compatibility** [`#1829`](https://github.com/OpenLineage/OpenLineage/pull/1829) [@Ines70](https://github.com/Ines70) + *Enables launching OpenLineage with a Databricks profile.* + +### Fixed +* **Fix type-checked marker and packaging** [`#1913`](https://github.com/OpenLineage/OpenLineage/pull/1913) [@gaborbernat](https://github.com/gaborbernat) + *The client was not marking itself as type-annotated.* +* **Python client: add `schemaURL` to run event** [`#1917`](https://github.com/OpenLineage/OpenLineage/pull/1917) [@gaborbernat](https://github.com/gaborbernat) + *Adds the missing `schemaURL` to the client's `RunState` class.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_29_2.md b/versioned_docs/version-1.26.0/releases/0_29_2.md new file mode 100644 index 0000000..26b63fa --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_29_2.md @@ -0,0 +1,23 @@ +--- +title: 0.29.2 +sidebar_position: 9959 +--- + +# 0.29.2 - 2023-06-30 + +### Added +* **Flink: support Flink version 1.17.1** [`#1947`](https://github.com/OpenLineage/OpenLineage/pull/1947) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Support Flink versions: 1.15.4, 1.16.2 and 1.17.1.* +* **Spark: support Spark 3.4** [`#1790`](https://github.com/OpenLineage/OpenLineage/pull/1790) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Introduce support for latest Spark version 3.4.0, along with 3.2.4 and 3.3.2.* +* **Spark: add Databricks platform integration test** [`#1928`](https://github.com/OpenLineage/OpenLineage/pull/1928) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Spark integration test to verify behaviour on databricks platform to be run manually in CircleCI when needed.* +* **Spec: add static lineage event types** [`#1880`](https://github.com/OpenLineage/OpenLineage/pull/1880) @pawel-big-lebowski + *As a first step in implementing static lineage, this adds new `DatasetEvent` and `JobEvent` types to the spec, along with support for the new types in the Python client.* + +### Removed +* **Proxy: remove unused Golang client approach** [`#1926`](https://github.com/OpenLineage/OpenLineage/pull/1926) [@mobuchowski](https://github.com/mobuchowski) + *Removes the unused Golang proxy, rendered redundant by the fluentd proxy.* +* **Req: bump minimum supported Python version to 3.8** [`#1950`](https://github.com/OpenLineage/OpenLineage/pull/1950) [@mobuchowski](https://github.com/mobuchowski) + *Python 3.7 is at EOL. This bumps the minimum supported version to 3.8 to keep the project aligned with the Python EOL schedule.* + diff --git a/versioned_docs/version-1.26.0/releases/0_2_0.md b/versioned_docs/version-1.26.0/releases/0_2_0.md new file mode 100644 index 0000000..3f284fe --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_2_0.md @@ -0,0 +1,26 @@ +--- +title: 0.2.0 +sidebar_position: 9999 +--- + +# 0.2.0 - 2021-08-23 + +### Added + +* Parse dbt command line arguments when invoking `dbt-ol` [@mobuchowski](https://github.com/mobuchowski). For example: + + ``` + $ dbt-ol run --project-dir path/to/dir + ``` + +* Set `UnknownFacet` for spark (captures metadata about unvisited nodes from spark plan not yet supported) [@OleksandrDvornik](https://github.com/OleksandrDvornik) + +### Changed + +* Remove `model` from dbt job name [@mobuchowski](https://github.com/mobuchowski) +* Default dbt job namespace to output dataset namespace [@mobuchowski](https://github.com/mobuchowski) +* Rename `openlineage.spark.*` to `io.openlineage.spark.*` [@OleksandrDvornik](https://github.com/OleksandrDvornik) + +### Fixed + +* Remove instance references to extractors from DAG and avoid copying log property for serializability [@collado-mike](https://github.com/collado-mike) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_2_1.md b/versioned_docs/version-1.26.0/releases/0_2_1.md new file mode 100644 index 0000000..c82039b --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_2_1.md @@ -0,0 +1,10 @@ +--- +title: 0.2.1 +sidebar_position: 9998 +--- + +# 0.2.1 - 2021-08-27 + +### Fixed + +* dbt: default `--project-dir` argument to current directory in `dbt-ol` script [@mobuchowski](https://github.com/mobuchowski) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_2_2.md b/versioned_docs/version-1.26.0/releases/0_2_2.md new file mode 100644 index 0000000..9d19514 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_2_2.md @@ -0,0 +1,14 @@ +--- +title: 0.2.2 +sidebar_position: 9997 +--- + +# 0.2.2 - 2021-09-08 + +### Added +* Implement OpenLineageValidationAction for Great Expectations [@collado-mike](https://github.com/collado-mike) +* facet: add expectations assertions facet [@mobuchowski](https://github.com/mobuchowski) + +### Fixed +* airflow: pendulum formatting fix, add tests [@mobuchowski](https://github.com/mobuchowski) +* dbt: do not emit events if run_result file was not updated [@mobuchowski](https://github.com/mobuchowski) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_2_3.md b/versioned_docs/version-1.26.0/releases/0_2_3.md new file mode 100644 index 0000000..d108a62 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_2_3.md @@ -0,0 +1,10 @@ +--- +title: 0.2.3 +sidebar_position: 9996 +--- + +# 0.2.3 - 2021-10-07 + +### Fixed + +* dbt: add dbt `v3` manifest support [@mobuchowski](https://github.com/mobuchowski) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_30_1.md b/versioned_docs/version-1.26.0/releases/0_30_1.md new file mode 100644 index 0000000..fdbc491 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_30_1.md @@ -0,0 +1,40 @@ +--- +title: 0.30.1 +sidebar_position: 9958 +--- + +# 0.30.1 - 2023-07-25 + +### Added +* **Flink: support Iceberg sinks** [`#1960`](https://github.com/OpenLineage/OpenLineage/pull/1960) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Detects output datasets when using an Iceberg table as a sink.* +* **Spark: column-level lineage for `merge into` on Delta tables** [`#1958`](https://github.com/OpenLineage/OpenLineage/pull/1958) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Makes column-level lineage support `merge into` on Delta tables. Also refactors column-level lineage to deal with multiple Spark versions.* +* **Spark: column-level lineage for `merge into` on Iceberg tables** [`#1971`](https://github.com/OpenLineage/OpenLineage/pull/1971) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Makes column-level lineage support `merge into` on Iceberg tables.* +* **Spark: add support for Iceberg REST catalog** [`#1963`](https://github.com/OpenLineage/OpenLineage/pull/1963) [@juancappi](https://github.com/juancappi) + *Adds `rest` to the existing options of `hive` and `hadoop` in `IcebergHandler.getDatasetIdentifier()` to add support for Iceberg's `RestCatalog`.* +* **Airflow: add possibility to force direct-execution based on environment variable** [`#1934`](https://github.com/OpenLineage/OpenLineage/pull/1934) [@mobuchowski](https://github.com/mobuchowski) + *Adds the option to use the direct-execution method on the Airflow listener when the existence of a non-SQLAlchemy-based Airflow event mechanism is confirmed. This happens when using Airflow 2.6 or when the `OPENLINEAGE_AIRFLOW_ENABLE_DIRECT_EXECUTION` environment variable exists.* +* **SQL: add support for Apple Silicon to `openlineage-sql-java`** [`#1981`](https://github.com/OpenLineage/OpenLineage/pull/1981) [@davidjgoss](https://github.com/davidjgoss) + *Expands the OS/architecture checks when compiling to produce a specific file for Apple Silicon. Also expands the corresponding OS/architecture checks when loading the binary at runtime from Java code.* +* **Spec: add facet deletion** [`#1975`](https://github.com/OpenLineage/OpenLineage/pull/1975) [@julienledem](https://github.com/julienledem) + *In order to add a mechanism for deleting job and dataset facets, adds a `{ _deleted: true }` object that can take the place of any job or dataset facet (but not run or input/output facets, which are valid only for a specific run).* +* **Client: add a file transport** [`#1891`](https://github.com/OpenLineage/OpenLineage/pull/1891) [@Alexkuva](https://github.com/Alexkuva) + *Creates a `FileTransport` and its configuration classes supporting append mode or write-new-file mode, which is especially useful when an object store does not support append mode, e.g. in the case of Databricks DBFS FUSE.* + +### Changed +* **Airflow: do not run plugin if OpenLineage provider is installed** [`#1999`](https://github.com/OpenLineage/OpenLineage/pull/1999) [@JDarDagran](https://github.com/JDarDagran) + *Sets `OPENLINEAGE_DISABLED` to `true` if the provider is installed.* +* **Python: rename `config` to `config_class`** [`#1998`](https://github.com/OpenLineage/OpenLineage/pull/1998) [@mobuchowski](https://github.com/mobuchowski) + *Renames the `config` class variable to `config_class` to avoid potential conflict with the config instance.* + +### Fixed +* **Airflow: add workaround for airflow-sqlalchemy event mechanism bug** [`#1959`](https://github.com/OpenLineage/OpenLineage/pull/1959) [@mobuchowski](https://github.com/mobuchowski) + *Due to known issues with the fork and thread model in the Airflow-SQLAlchemy-based event-delivery mechanism, a Kafka producer left alone does not emit a `COMPLETE`` event. This creates a producer for each event when we detect that we're under Airflow 2.3 - 2.5.* +* **Spark: fix custom environment variables facet** [`#1973`](https://github.com/OpenLineage/OpenLineage/pull/1973) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Enables sending the Spark environment variables facet in a non-deterministic way.* +* **Spark: filter unwanted Delta events** [`#1968`](https://github.com/OpenLineage/OpenLineage/pull/1968) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Clears events generated by logical plans having `Project` node as root.* +* **Python: allow modification of `openlineage.*` logging levels via environment variables** [`#1974`](https://github.com/OpenLineage/OpenLineage/pull/1974) [@JDarDagran](https://github.com/JDarDagran) + *Adds `OPENLINEAGE_{CLIENT/AIRFLOW/DBT}_LOGGING` environment variables that can be set according to module logging levels and cleans up some logging calls in `openlineage-airflow`.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_3_0.md b/versioned_docs/version-1.26.0/releases/0_3_0.md new file mode 100644 index 0000000..15b2c97 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_3_0.md @@ -0,0 +1,19 @@ +--- +title: 0.3.0 +sidebar_position: 9995 +--- + +# 0.3.0 - 2021-12-03 + +### Added +* Spark3 support [@OleksandrDvornik](https://github.com/OleksandrDvornik) / [@collado-mike](https://github.com/collado-mike) +* LineageBackend for Airflow 2 [@mobuchowski](https://github.com/mobuchowski) +* Adding custom spark version facet to spark integration [@OleksandrDvornik](https://github.com/OleksandrDvornik) +* Adding dbt version facet [@mobuchowski](https://github.com/mobuchowski) +* Added support for Redshift profile [@AlessandroLollo](https://github.com/AlessandroLollo) + +### Fixed + +* Sanitize JDBC URLs [@OleksandrDvornik](https://github.com/OleksandrDvornik) +* strip openlineage url in python client [@OleksandrDvornik](https://github.com/OleksandrDvornik) +* deploy spec if spec file changes [@mobuchowski](https://github.com/mobuchowski) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_3_1.md b/versioned_docs/version-1.26.0/releases/0_3_1.md new file mode 100644 index 0000000..b5c698d --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_3_1.md @@ -0,0 +1,9 @@ +--- +title: 0.3.1 +sidebar_position: 9994 +--- + +# 0.3.1 - 2021-12-03 + +### Fixed +* fix import in spark3 visitor [@mobuchowski](https://github.com/mobuchowski) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_4_0.md b/versioned_docs/version-1.26.0/releases/0_4_0.md new file mode 100644 index 0000000..25d2f99 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_4_0.md @@ -0,0 +1,24 @@ +--- +title: 0.4.0 +sidebar_position: 9993 +--- + +# 0.4.0 - 2021-12-13 + +### Added +* Spark output metrics [@OleksandrDvornik](https://github.com/OleksandrDvornik) +* Separated tests between Spark 2 & 3 [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Databricks install README and init scripts [@wjohnson](https://github.com/wjohnson) +* Iceberg integration with unit tests [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Kafka read and write support [@OleksandrDvornik](https://github.com/OleksandrDvornik) / [@collado-mike](https://github.com/collado-mike) +* Arbitrary parameters supported in HTTP URL construction [@wjohnson](https://github.com/wjohnson) +* Increased visitor coverage for Spark commands [@mobuchowski](https://github.com/mobuchowski) / [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + +### Fixed +* dbt: column descriptions are properly filled from metadata.json [@mobuchowski](https://github.com/mobuchowski) +* dbt: allow parsing artifacts with version higher than officially supported [@mobuchowski](https://github.com/mobuchowski) +* dbt: dbt build command is supported [@mobuchowski](https://github.com/mobuchowski) +* dbt: fix crash when build command is used with seeds in dbt 1.0.0rc3 [@mobuchowski](https://github.com/mobuchowski) +* spark: increase logical plan visitor coverage [@mobuchowski](https://github.com/mobuchowski) +* spark: fix logical serialization recursion issue [@OleksandrDvornik](https://github.com/OleksandrDvornik) +* Use URL#getFile to fix build on Windows [@mobuchowski](https://github.com/mobuchowski) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_5_1.md b/versioned_docs/version-1.26.0/releases/0_5_1.md new file mode 100644 index 0000000..5f981da --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_5_1.md @@ -0,0 +1,15 @@ +--- +title: 0.5.1 +sidebar_position: 9992 +--- + +# 0.5.1 - 2022-01-18 + +### Added +* Support for dbt-spark adapter [@mobuchowski](https://github.com/mobuchowski) +* **New** `backend` to proxy OpenLineage events to one or more event streams 🎉 [@mandy-chessell](https://github.com/mandy-chessell) [@wslulciuc](https://github.com/wslulciuc) +* Add Spark extensibility API with support for custom Dataset and custom facet builders [@collado-mike](https://github.com/collado-mike) + +### Fixed +* airflow: fix import failures when dependencies for bigquery, dbt, great_expectations extractors are missing [@lukaszlaszko](https://github.com/lukaszlaszko) +* Fixed openlineage-spark jar to correctly rename bundled dependencies [@collado-mike](https://github.com/collado-mike) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_5_2.md b/versioned_docs/version-1.26.0/releases/0_5_2.md new file mode 100644 index 0000000..17d785e --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_5_2.md @@ -0,0 +1,22 @@ +--- +title: 0.5.2 +sidebar_position: 9991 +--- + +# 0.5.2 - 2022-02-10 + +### Added + +* Proxy backend example using `Kafka` [@wslulciuc](https://github.com/wslulciuc) +* Support Databricks Delta Catalog naming convention with DatabricksDeltaHandler [@wjohnson](https://github.com/wjohnson) +* Add javadoc as part of build task [@mobuchowski](https://github.com/mobuchowski) +* Include TableStateChangeFacet in non V2 commands for Spark [@mr-yusupov](https://github.com/mr-yusupov) +* Support for SqlDWRelation on Databricks' Azure Synapse/SQL DW Connector [@wjohnson](https://github.com/wjohnson) +* Implement input visitors for v2 commands [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Enabled SparkListenerJobStart events to trigger open lineage events [@collado-mike](https://github.com/collado-mike) + +### Fixed + +* dbt: job namespaces for given dbt run match each other [@mobuchowski](https://github.com/mobuchowski) +* Fix Breaking SnowflakeOperator Changes from OSS Airflow [@denimalpaca](https://github.com/denimalpaca) +* Made corrections to account for DeltaDataSource handling [@collado-mike](https://github.com/collado-mike) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_6_0.md b/versioned_docs/version-1.26.0/releases/0_6_0.md new file mode 100644 index 0000000..d100232 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_6_0.md @@ -0,0 +1,21 @@ +--- +title: 0.6.0 +sidebar_position: 9990 +--- + +# 0.6.0 - 2022-03-04 + +### Added +* Extract source code of PythonOperator code similar to SQL facet [@mobuchowski](https://github.com/mobuchowski) +* Add DatasetLifecycleStateDatasetFacet to spec [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Airflow: extract source code from BashOperator [@mobuchowski](https://github.com/mobuchowski) +* Add generic facet to collect environmental properties (EnvironmentFacet) [@harishsune](https://github.com/harishsune) +* OpenLineage sensor for OpenLineage-Dagster integration [@dalinkim](https://github.com/dalinkim) +* Java-client: make generator generate enums as well [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Added `UnknownOperatorAttributeRunFacet` to Airflow integration to record operators that don't produce lineage [@collado-mike](https://github.com/collado-mike) + +### Fixed +* Airflow: increase import timeout in tests, fix exit from integration [@mobuchowski](https://github.com/mobuchowski) +* Reduce logging level for import errors to info [@rossturk](https://github.com/rossturk) +* Remove AWS secret keys and extraneous Snowflake parameters from connection uri [@collado-mike](https://github.com/collado-mike) +* Convert to LifecycleStateChangeDatasetFacet [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) diff --git a/versioned_docs/version-1.26.0/releases/0_6_1.md b/versioned_docs/version-1.26.0/releases/0_6_1.md new file mode 100644 index 0000000..1f8ecda --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_6_1.md @@ -0,0 +1,10 @@ +--- +title: 0.6.1 +sidebar_position: 9989 +--- + +# 0.6.1 - 2022-03-07 + +### Fixed +* Catch possible failures when emitting events and log them [@mobuchowski](https://github.com/mobuchowski) +* dbt: jinja2 code using do extensions does not crash [@mobuchowski](https://github.com/mobuchowski) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_6_2.md b/versioned_docs/version-1.26.0/releases/0_6_2.md new file mode 100644 index 0000000..0ad1460 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_6_2.md @@ -0,0 +1,16 @@ +--- +title: 0.6.2 +sidebar_position: 9988 +--- + +# 0.6.2 - 2022-03-16 + +### Added +* CI: add integration tests for Airflow's SnowflakeOperator and dbt-snowflake [@mobuchowski](https://github.com/mobuchowski) +* Introduce DatasetVersion facet in spec [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Airflow: add external query id facet [@mobuchowski](https://github.com/mobuchowski) + +### Fixed +* Complete Fix of Snowflake Extractor get_hook() Bug [@denimalpaca](https://github.com/denimalpaca) +* Update artwork [@rossturk](https://github.com/rossturk) +* Airflow tasks in a DAG now report a common ParentRunFacet [@collado-mike](https://github.com/collado-mike) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_7_1.md b/versioned_docs/version-1.26.0/releases/0_7_1.md new file mode 100644 index 0000000..a7335c6 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_7_1.md @@ -0,0 +1,23 @@ +--- +title: 0.7.1 +sidebar_position: 9987 +--- + +# 0.7.1 - 2022-04-19 + +### Added +* Python implements Transport interface - HTTP and Kafka transports are available ([#530](https://github.com/OpenLineage/OpenLineage/pull/530)) [@mobuchowski](https://github.com/mobuchowski) +* Add UnknownOperatorAttributeRunFacet and support in lineage backend ([#547](https://github.com/OpenLineage/OpenLineage/pull/547)) [@collado-mike](https://github.com/collado-mike) +* Support Spark 3.2.1 ([#607](https://github.com/OpenLineage/OpenLineage/pull/607)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Add StorageDatasetFacet to spec ([#620](https://github.com/OpenLineage/OpenLineage/pull/620)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Airflow: custom extractors lookup uses only get_operator_classnames method ([#656](https://github.com/OpenLineage/OpenLineage/pull/656)) [@mobuchowski](https://github.com/mobuchowski) +* README.md created at OpenLineage/integrations for compatibility matrix ([#663](https://github.com/OpenLineage/OpenLineage/pull/663)) [@howardyoo](https://github.com/howardyoo) + +### Fixed +* Dagster: handle updated PipelineRun in OpenLineage sensor unit test ([#624](https://github.com/OpenLineage/OpenLineage/pull/624)) [@dominiquetipton](https://github.com/dominiquetipton) +* Delta improvements ([#626](https://github.com/OpenLineage/OpenLineage/pull/626)) [@collado-mike](https://github.com/collado-mike) +* Fix SqlDwDatabricksVisitor for Spark2 ([#630](https://github.com/OpenLineage/OpenLineage/pull/630)) [@wjohnson](https://github.com/wjohnson) +* Airflow: remove redundant logging from GE import ([#657](https://github.com/OpenLineage/OpenLineage/pull/657)) [@mobuchowski](https://github.com/mobuchowski) +* Fix Shebang issue in Spark's wait-for-it.sh ([#658](https://github.com/OpenLineage/OpenLineage/pull/658)) [@mobuchowski](https://github.com/mobuchowski) +* Update parent_run_id to be a uuid from the dag name and run_id ([#664](https://github.com/OpenLineage/OpenLineage/pull/664)) [@collado-mike](https://github.com/collado-mike) +* Spark: fix time zone inconsistency in testSerializeRunEvent ([#681](https://github.com/OpenLineage/OpenLineage/pull/681)) [@sekikn](https://github.com/sekikn) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_8_1.md b/versioned_docs/version-1.26.0/releases/0_8_1.md new file mode 100644 index 0000000..a96db47 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_8_1.md @@ -0,0 +1,15 @@ +--- +title: 0.8.1 +sidebar_position: 9986 +--- + +# 0.8.1 - 2022-04-29 + +### Added +* Airflow integration uses [new TaskInstance listener API](https://github.com/apache/airflow/blob/main/docs/apache-airflow/listeners.rst) for Airflow 2.3+ ([#508](https://github.com/OpenLineage/OpenLineage/pull/508)) [@mobuchowski](https://github.com/mobuchowski) +* Support for HiveTableRelation as input source in Spark integration ([#683](https://github.com/OpenLineage/OpenLineage/pull/683)) [@collado-mike](https://github.com/collado-mike) +* Add HTTP and Kafka Client to `openlineage-java` lib ([#480](https://github.com/OpenLineage/OpenLineage/pull/480)) [@wslulciuc](https://github.com/wslulciuc), [@mobuchowski](https://github.com/mobuchowski) +* New SQL parser, used by Postgres, Snowflake, Great Expectations integrations ([#644](https://github.com/OpenLineage/OpenLineage/pull/644)) [@mobuchowski](https://github.com/mobuchowski) + +### Fixed +* GreatExpectations: Fixed bug when invoking GreatExpectations using v3 API ([#683](https://github.com/OpenLineage/OpenLineage/pull/689)) [@collado-mike](https://github.com/collado-mike) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/0_8_2.md b/versioned_docs/version-1.26.0/releases/0_8_2.md new file mode 100644 index 0000000..f981473 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_8_2.md @@ -0,0 +1,16 @@ +--- +title: 0.8.2 +sidebar_position: 9985 +--- + +# 0.8.2 - 2022-05-19 + +### Added +* `openlineage-airflow` now supports getting credentials from [Airflows secrets backend](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/secrets-backend/index.html) ([#723](https://github.com/OpenLineage/OpenLineage/pull/723)) [@mobuchowski](https://github.com/mobuchowski) +* `openlineage-spark` now supports [Azure Databricks Credential Passthrough](https://docs.microsoft.com/en-us/azure/databricks/security/credential-passthrough) ([#595](https://github.com/OpenLineage/OpenLineage/pull/595)) [@wjohnson](https://github.com/wjohnson) +* `openlineage-spark` detects datasets wrapped by `ExternalRDD`s ([#746](https://github.com/OpenLineage/OpenLineage/pull/746)) [@collado-mike](https://github.com/collado-mike) + +### Fixed +* `PostgresOperator` fails to retrieve host and conn during extraction ([#705](https://github.com/OpenLineage/OpenLineage/pull/705)) [@sekikn](https://github.com/sekikn) +* SQL parser accepts lists of sql statements ([#734](https://github.com/OpenLineage/OpenLineage/issues/734)) [@mobuchowski](https://github.com/mobuchowski) +* Missing schema when writing to Delta tables in Databricks ([#748](https://github.com/OpenLineage/OpenLineage/pull/748)) [@collado-mike](https://github.com/collado-mike) diff --git a/versioned_docs/version-1.26.0/releases/0_9_0.md b/versioned_docs/version-1.26.0/releases/0_9_0.md new file mode 100644 index 0000000..89d887d --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/0_9_0.md @@ -0,0 +1,29 @@ +--- +title: 0.9.0 +sidebar_position: 9984 +--- + +# 0.9.0 - 2022-06-03 + +### Added + +* Add static code anlalysis tool [mypy](http://mypy-lang.org) to run in CI for against all python modules ([`#802`](https://github.com/openlineage/openlineage/issues/802)) [@howardyoo](https://github.com/howardyoo) +* Extend `SaveIntoDataSourceCommandVisitor` to extract schema from `LocalRelaiton` and `LogicalRdd` in spark integration ([`#794`](https://github.com/OpenLineage/OpenLineage/pull/794)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Add `InMemoryRelationInputDatasetBuilder` for `InMemory` datasets to Spark integration ([`#818`](https://github.com/OpenLineage/OpenLineage/pull/818)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Add copyright to source files [`#755`](https://github.com/OpenLineage/OpenLineage/pull/755) [@merobi-hub](https://github.com/merobi-hub) +* Add `SnowflakeOperatorAsync` extractor support to Airflow integration [`#869`](https://github.com/OpenLineage/OpenLineage/pull/869) [@merobi-hub](https://github.com/merobi-hub) +* Add PMD analysis to proxy project ([`#889`](https://github.com/OpenLineage/OpenLineage/pull/889)) [@howardyoo](https://github.com/howardyoo) + +### Changed + +* Skip `FunctionRegistry.class` serialization in Spark integration ([`#828`](https://github.com/OpenLineage/OpenLineage/pull/828)) [@mobuchowski](https://github.com/mobuchowski) +* Install new `rust`-based SQL parser by default in Airflow integration ([`#835`](https://github.com/OpenLineage/OpenLineage/pull/835)) [@mobuchowski](https://github.com/mobuchowski) +* Improve overall `pytest` and integration tests for Airflow integration ([`#851`](https://github.com/OpenLineage/OpenLineage/pull/851),[`#858`](https://github.com/OpenLineage/OpenLineage/pull/858)) [@denimalpaca](https://github.com/denimalpaca) +* Reduce OL event payload size by excluding local data and including output node in start events ([`#881`](https://github.com/OpenLineage/OpenLineage/pull/881)) [@collado-mike](https://github.com/collado-mike) +* Split spark integration into submodules ([`#834`](https://github.com/OpenLineage/OpenLineage/pull/834), [`#890`](https://github.com/OpenLineage/OpenLineage/pull/890)) [@tnazarew](https://github.com/tnazarew) [@mobuchowski](https://github.com/mobuchowski) + +### Fixed + +* Conditionally import `sqlalchemy` lib for Great Expectations integration ([`#826`](https://github.com/OpenLineage/OpenLineage/pull/826)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Add check for missing **class** `org.apache.spark.sql.catalyst.plans.logical.CreateV2Table` in Spark integration ([`#866`](https://github.com/OpenLineage/OpenLineage/pull/866)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) +* Fix static code analysis issues ([`#867`](https://github.com/OpenLineage/OpenLineage/pull/867),[`#874`](https://github.com/OpenLineage/OpenLineage/pull/874)) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_0_0.md b/versioned_docs/version-1.26.0/releases/1_0_0.md new file mode 100644 index 0000000..120f4ee --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_0_0.md @@ -0,0 +1,26 @@ +--- +title: 1.0.0 +sidebar_position: 9957 +--- + +# 1.0.0 - 2023-08-01 + +### Added +* **Airflow: convert lineage from legacy `File` definition** [`#2006`](https://github.com/OpenLineage/OpenLineage/pull/2006) [@mobuchowski](https://github.com/mobuchowski) + *Adds coverage for `File` entity definition to enhance backwards compatibility.* + +### Removed +* **Spec: remove facet ref from core** [`#1997`](https://github.com/OpenLineage/OpenLineage/pull/1997) [@JDarDagran](https://github.com/JDarDagran) + *Removes references to facets from the core spec that broke compatibility with JSON schema specification.* + +### Changed +* **Airflow: change log level to `DEBUG` when extractor isn't found** [`#2012`](https://github.com/OpenLineage/OpenLineage/pull/2012) [@kaxil](https://github.com/kaxil) + *Changes log level from `WARNING` to `DEBUG` when an extractor is not available.* +* **Airflow: make sure we cannot fail in thread despite direct execution** [`#2010`](https://github.com/OpenLineage/OpenLineage/pull/2010) [@mobuchowski](https://github.com/mobuchowski) + *Ensures the listener is not failing tasks, even in unlikely scenarios.* + +### Fixed +* **Airflow: stop using reusable session by default, do not send full event on Snowflake complete** [`#2025`](https://github.com/OpenLineage/OpenLineage/pull/2025) [@mobuchowski](https://github.com/mobuchowski) + *Fixes the issue of the Snowflake connector clashing with `HttpTransport` by disabling automatic `requests` session reuse and not running `SnowflakeExtractor` again on job completion.* +* **Client: fix error message to avoid confusion** [`#2001`](https://github.com/OpenLineage/OpenLineage/pull/2001) [@mars-lan](https://github.com/mars-lan) + *Fixes the error message in `HttpTransport` in the case of a null URL.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_10_2.md b/versioned_docs/version-1.26.0/releases/1_10_2.md new file mode 100644 index 0000000..93ca821 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_10_2.md @@ -0,0 +1,46 @@ +--- +title: 1.10.2 +sidebar_position: 9947 +--- + +# 1.10.2 - 2024-03-15 + +### Added +* **Dagster: add new provider for version 1.6.10** [`#2518`](https://github.com/OpenLineage/OpenLineage/pull/2518) [@JDarDagran](https://github.com/JDarDagran) + *Adds the new provider required by the latest version of Dagster.* +* **Flink: support lineage for a hybrid source** [`#2491`](https://github.com/OpenLineage/OpenLineage/pull/2491) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Adds support for hybrid source lineage for users of Kafka and Iceberg sources in backfill usecases.* +* **Flink: improve Cassandra lineage metadata** [`#2479`](https://github.com/OpenLineage/OpenLineage/pull/2479) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Cassandra cluster info to be used as the dataset namespace, and the keyspace to be combined with the table name as the dataset name.* +* **Flink: bump Flink JDBC connector version** [`#2472`](https://github.com/OpenLineage/OpenLineage/pull/2472) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Bumps the Flink JDBC connector version to 3.1.2-1.18 for Flink 1.18.* +* **Java: add a `OpenLineageClientUtils#loadOpenLineageJson(InputStream)` and change `OpenLineageClientUtils#loadOpenLineageYaml(InputStream)` methods** [`#2490`](https://github.com/OpenLineage/OpenLineage/pull/2490) [@d-m-h](https://github.com/d-m-h) + *This improves the explicitness of the methods. Previously, `loadOpenLineageYaml(InputStream)` wanted the `InputStream` to contain bytes that represented JSON.* +* **Java: add info from the HTTP response to the client exception** [`#2486`](https://github.com/OpenLineage/OpenLineage/pull/2486) [@davidjgoss](https://github.com/davidjgoss) + *Adds the status code and body as properties on the thrown exception when a non-success response is encountered in the HTTP transport.* +* **Python: add support for MSK IAM authentication with a new transport** [`#2478`](https://github.com/OpenLineage/OpenLineage/pull/2478) [@mattiabertorello](https://github.com/mattiabertorello) + *Eases publication of events to MSK with IAM authentication.* + +### Removed +* **Airflow: remove redundant information from facets** [`#2524`](https://github.com/OpenLineage/OpenLineage/pull/2524) [@kacpermuda](https://github.com/kacpermuda) + *Refines the operator's attribute inclusion logic in facets to include only those known to be important or compact, ensuring that custom operator attributes with substantial data do not inflate the event size.* + +### Fixed +* **Airflow: proceed without rendering templates if `task_instance` copy fails** [`#2492`](https://github.com/OpenLineage/OpenLineage/pull/2492) [@kacpermuda](https://github.com/kacpermuda) + *Airflow will now proceed without rendering templates if `task_instance` copy fails in `listener.on_task_instance_running`.* +* **Spark: fix the `HttpTransport` timeout** [`#2475`](https://github.com/OpenLineage/OpenLineage/pull/2475) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *The existing `timeout` config parameter is ambiguous: implementation treats the value as double in seconds, although the documentation claims it's milliseconds. A new config param `timeoutInMillis` has been added. the Existing `timeout` has been removed from docs and will be deprecated in 1.13.* +* **Spark: prevent NPE if the context is null** [`#2515`](https://github.com/OpenLineage/OpenLineage/pull/2515) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds a check for a null context before executing `end(jobEnd)`.* +* **Flink: fix class not found issue for Cassandra** [`#2507`](https://github.com/OpenLineage/OpenLineage/pull/2507) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Fixes the class not found issue when checking for Cassandra classes. Also fixes the Maven POM dependency on subprojects.* +* **Flink: refine the JDBC table name** [`#2512`](https://github.com/OpenLineage/OpenLineage/pull/2512) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Enables the JDBC table name with a schema prefix.* +* **Flink: fix JDBC dataset naming** [`#2508`](https://github.com/OpenLineage/OpenLineage/pull/2508) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *For JDBC, the Flink integration is not adjusted to the Openlineage naming convention. There is code that extracts the dataset namespace/name from the JDBC connection url, but it's in the Spark integration. As a solution, this code has to be extracted into the Java client and reused by the Spark and Flink integrations.* +* **Flink: fix failure due to missing Cassandra classes** [`#2507`](https://github.com/OpenLineage/OpenLineage/pull/2507) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Flink is failing when no Cassandra classes are present on the class path. This is happening because of `CassandraUtils` class which has a static `hasClasses` method, but it imports Cassandra-related classes in the header. Also, the Flink subproject contains an unnecessary `maven-publish` plugin.* +* **Flink: fix release runtime dependencies** [`#2504`](https://github.com/OpenLineage/OpenLineage/pull/2504) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *The shadow jar of Flink is not minimized, so some internal jars are listed as runtime dependences. This removes them from the final pom.xml file in the Flink module.* +* **Spec: improve Cassandra lineage metadata** [`#2479`](https://github.com/OpenLineage/OpenLineage/pull/2479) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Following the namespace definition, we should use `cassandra://host:port`.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_11_3.md b/versioned_docs/version-1.26.0/releases/1_11_3.md new file mode 100644 index 0000000..f3c1707 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_11_3.md @@ -0,0 +1,50 @@ +--- +title: 1.11.3 +sidebar_position: 9946 +--- + +# 1.11.3 - 2024-04-04 + +### Added +* **Common: add support for `SCRIPT`-type jobs in BigQuery** [`#2564`](https://github.com/OpenLineage/OpenLineage/pull/2564) [@kacpermuda](https://github.com/kacpermuda) + In the case of `SCRIPT`-type jobs in BigQuery, no lineage was being extracted because the `SCRIPT` job had no lineage information - it only spawned child jobs that had that information. With this change, the integration extracts lineage information from child jobs when dealing with `SCRIPT`-type jobs. +* **Spark: support for built-in lineage extraction** [`#2272`](https://github.com/OpenLineage/OpenLineage/pull/2272) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *This PR adds a `spark-interfaces-scala` package that allows lineage extraction to be implemented within Spark extensions (Iceberg, Delta, GCS, etc.). The Openlineage integration, when traversing the query plan, verifies if nodes implement defined interfaces. If so, interface methods are used to extract lineage. Refer to the [README](https://github.com/OpenLineage/OpenLineage/tree/spark/built-in-lineage/integration/spark-interfaces-scala#readme) for more details.* +* **Spark/Java: add support for Micrometer metrics** [`#2496`](https://github.com/OpenLineage/OpenLineage/pull/2496) [@mobuchowski](https://github.com/mobuchowski) + *Adds a mechanism for forwarding metrics to any [Micrometer-compatible implementation](https://docs.micrometer.io/micrometer/reference/implementations.html). Included: `MeterRegistryyFactory`, `MicrometerProvider`, `StatsDMetricsBuilder`, metrics config in OpenLineage config, and a Java client implementation.* +* **Spark: add support for telemetry mechanism** [`#2528`](https://github.com/OpenLineage/OpenLineage/pull/2528) [@mobuchowski](https://github.com/mobuchowski) + *Adds timers, counters and additional instrumentation in order to implement Micrometer metrics collection.* +* **Spark: support query option on table read** [`#2556`](https://github.com/OpenLineage/OpenLineage/pull/2556) [@mobuchowski](https://github.com/mobuchowski) + *Adds support for the Spark-BigQuery connector's query input option, which executes a query directly on BigQuery, storing the result in an intermediate dataset, bypassing Spark's computation layer. Due to this, the lineage is retrieved using the SQL parser, similarly to `JDBCRelation`.* +* **Spark: change `SparkPropertyFacetBuilder` to support recording Spark runtime** [`#2523`](https://github.com/OpenLineage/OpenLineage/pull/2523) [@Ruihua98](https://github.com/Ruihua98) + *Modifies `SparkPropertyFacetBuilder` to capture the `RuntimeConfig` of the Spark session because the existing `SparkPropertyFacet` can only capture the static config of the Spark context. This facet will be added in both RDD-related and SQL-related runs.* +* **Spec: add `fileCount` to dataset stat facets** [`#2562`](https://github.com/OpenLineage/OpenLineage/pull/2562) [@dolfinus](https://github.com/dolfinus) + *Adds a `fileCount` field to `DataQualityMetricsInputDatasetFacet` and `OutputStatisticsOutputDatasetFacet` specification.* + +### Fixed +* **dbt: `dbt-ol` should transparently exit with the same exit code as the child `dbt` process** [`#2560`](https://github.com/OpenLineage/OpenLineage/pull/2560) [@blacklight](https://github.com/blacklight) + *Makes `dbt-ol` transparently exit with the same exit code as the child `dbt` process.* +* **Flink: disable module metadata generation** [`#2531`](https://github.com/OpenLineage/OpenLineage/pull/2531) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Disables the module metadata generation for Flink to fix the problem of having gradle dependencies to submodules within `openlineage-flink.jar`.* +* **Flink: fixes to version 1.19** [`#2507`](https://github.com/OpenLineage/OpenLineage/pull/2507) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Fixes the class not found issue when checking for Cassandra classes. Also fixes the Maven pom dependency on subprojects.* +* **Python: small improvements to `.emit()` method logging & annotations** [`#2539`](https://github.com/OpenLineage/OpenLineage/pull/2539) [@dolfinus](https://github.com/dolfinus) + *Updates OpenLineage.emit debug messages and annotations.* +* **SQL: show error message when OpenLineageSql cannot find native library** [`#2547`](https://github.com/OpenLineage/OpenLineage/pull/2547) [@dolfinus](https://github.com/dolfinus) + *When the `OpenLineageSql` class could not load a native library, if returned `None` for all operations. But because the error message was suppressed, the user could not determine the reason.* +* **SQL: update code to conform to upstream sqlparser-rs changes** [`#2510`](https://github.com/OpenLineage/OpenLineage/pull/2510) [@mobuchowski](https://github.com/mobuchowski) + *Includes tests and cosmetic improvements.* +* **Spark: fix access to active Spark session** [`#2535`](https://github.com/OpenLineage/OpenLineage/pull/2535) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Changes behavior so `IllegalStateException` is always caught when accessing `SparkSession`.* +* **Spark: fix Databricks environment** [`#2537`](https://github.com/OpenLineage/OpenLineage/pull/2537) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Fixes the `ClassNotFoundError` occurring on Databricks runtime and extends the integration test to verify `DatabricksEnvironmentFacet`.* +* **Spark: fixed memory leak in JobMetricsHolder** [`#2565`](https://github.com/OpenLineage/OpenLineage/pull/2565) [@d-m-h](https://github.com/d-m-h) + *The `JobMetricsHolder#cleanUp(int)` method now correctly purges unneeded state from both maps.* +* **Spark: fixed memory leak in `UnknownEntryFacetListener`** [`#2557`](https://github.com/OpenLineage/OpenLineage/pull/2557) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Prevents storing the state when a facet is disabled, purging the state after populating run facets.* +* **Spark: fix parsing `JDBCOptions(table=...)` containing subquery** [`#2546`](https://github.com/OpenLineage/OpenLineage/pull/2546) [@dolfinus](https://github.com/dolfinus) + *Prevents `openlineage-spark` from producing datasets with names like `database.(select * from table)` for JDBC sources.* +* **Spark/Snowflake: support query option via SQL parser** [`#2563`](https://github.com/OpenLineage/OpenLineage/pull/2563) [@mobuchowski](https://github.com/mobuchowski) + *When a Snowflake job is bypassing Spark's computation layer, now the SQL parser will be used to get the lineage.* +* **Spark: always catch `IllegalStateException` when accessing `SparkSession`** [`#2535`](https://github.com/OpenLineage/OpenLineage/pull/2535) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *`IllegalStateException` was not being caught.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_12_0.md b/versioned_docs/version-1.26.0/releases/1_12_0.md new file mode 100644 index 0000000..afe49b9 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_12_0.md @@ -0,0 +1,29 @@ +--- +title: 1.12.0 +sidebar_position: 9945 +--- + +# 1.12.0 - 2024-04-09 + +### Added +* **Airflow: add `lineage_job_namespace` and `lineage_job_name` macros** [`#2582`](https://github.com/OpenLineage/OpenLineage/pull/2582) [@dolfinus](https://github.com/dolfinus) + *Adds new Airflow macros `lineage_job_namespace()`, `lineage_job_name(task)` that return an Airflow namespace and Airflow job name, respectively.* +* **Spec: Allow nested struct fields in `SchemaDatasetFacet`** [`#2548`](https://github.com/OpenLineage/OpenLineage/pull/2548) [@dolfinus](https://github.com/dolfinus) + *Allows nested fields support to `SchemaDatasetFacet`.* + +### Fixed +* **Spark: fix PMD for test** [`#2588`](https://github.com/OpenLineage/OpenLineage/pull/2588) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Clears `pmdTestScala212` from warnings that clutter the logs.* +* **Dbt: propagate the dbt return code also when no OpenLineage events are emitted** [`#2591`](https://github.com/OpenLineage/OpenLineage/pull/2591) [@blacklight](https://github.com/blacklight) + *`dbt-ol` now propagates the exit code of the underlying dbt process even if no lineage events are emitted.* +* **Java: make sure string isn't empty to prevent going out of bounds** [`#2585`](https://github.com/OpenLineage/OpenLineage/pull/2585) [@harels](https://github.com/harels) + *String lookup was not accounting for empty strings and causing a `java.lang.StringIndexOutOfBoundsException`.* +* **Spark: use `HashSet` in column-level lineage instead of iterating through `LinkedList`** [`#2584`](https://github.com/OpenLineage/OpenLineage/pull/2584) [@mobuchowski](https://github.com/mobuchowski) + *Takes advantage of performance gains available from using `HashSet` for collection.* +* **Python: fix missing pkg_resources module on Python 3.12** [`#2572`](https://github.com/OpenLineage/OpenLineage/pull/2572) [@dolfinus](https://github.com/dolfinus) + *Removes `pkg_resources` dependency and replaces it with the [packaging](https://packaging.pypa.io/en/latest/version.html) lib.* +* **Airflow: fix format returned by `airflow.macros.lineage_parent_id`** [`#2578`](https://github.com/OpenLineage/OpenLineage/pull/2578) [@blacklight](https://github.com/blacklight) + *Fixes the run format returned by the `lineage_parent_id` Airflow macro and simplifies the format of the `lineage_parent_id` and `lineage_run_id` macros.* +* **Dagster: limit Dagster version to 1.6.9** [`#2579`](https://github.com/OpenLineage/OpenLineage/pull/2579) [@JDarDagran](https://github.com/JDarDagran) + *Adds an upper limit on supported versions of Dagster as the integration is no longer actively maintained and recent releases introduce breaking changes.* + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_13_1.md b/versioned_docs/version-1.26.0/releases/1_13_1.md new file mode 100644 index 0000000..b3e664e --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_13_1.md @@ -0,0 +1,38 @@ +--- +title: 1.13.1 +sidebar_position: 9944 +--- + +# 1.13.1 - 2024-04-26 + +### Added +* **Java: allow timeout for circuit breakers** [`#2609`](https://github.com/OpenLineage/OpenLineage/pull/2609) @pawel-big-lebowski + *Extends the circuit breaker mechanism to contain a global timeout that stops running OpenLineage integration code when a specified amount of time has elapsed.* +* **Java: handle `DataSetEvent` and `JobEvent` in `Transport.emit`** [`#2611`](https://github.com/OpenLineage/OpenLineage/pull/2611) @dolfinus + *Adds overloads `Transport.emit(OpenLineage.DatasetEvent)` and `Transport.emit(OpenLineage.JobEvent)`, reusing the implementation of `Transport.emit(OpenLineage.RunEvent)`. **Please note**: `Transport.emit(String)` is now deprecated and will be removed in 1.16.0.* +* **Java/Python: add `GZIP` compression to `HttpTransport`** [`#2603`](https://github.com/OpenLineage/OpenLineage/pull/2603) [`#2604`](https://github.com/OpenLineage/OpenLineage/pull/2604) @dolfinus + *Adds a `compression` option to `HttpTransport` config in the Java and Python clients, with `gzip` implementation.* +* **Java/Python/Proxy: properly set Kafka message key** [`#2571`](https://github.com/OpenLineage/OpenLineage/pull/2571) [`#2597`](https://github.com/OpenLineage/OpenLineage/pull/2597) [`#2598`](https://github.com/OpenLineage/OpenLineage/pull/2598) @dolfinus + *Adds a new `messageKey` option to `KafkaTransport` config in the Python and Java clients, as well as the Proxy. This option replaces the `localServerId` option, which is now deprecated. Default value is generated using the run id (for `RunEvent`), job name (for `JobEvent`) or dataset name (for `DatasetEvent`). This value is used by the Kafka producer to distribute messages along topic partitions, instead of sending all the events to the same partition. This allows for full utilization of Kafka performance advantages.* +* **Flink: add support for Micrometer metrics** [`#2633`](https://github.com/OpenLineage/OpenLineage/pull/2633) @mobuchowski + *Adds a mechanism for forwarding metrics to any [Micrometer-compatible implementation](https://docs.micrometer.io/micrometer/reference/implementations.html) for Flink as has been implemented for Spark. Included: `MeterRegistry`, `CompositeMeterRegistry`, `SimpleMeterRegistry`, and `MicrometerProvider`.* +* **Python: generate Python facets from JSON schemas** [`#2520`](https://github.com/OpenLineage/OpenLineage/pull/2520) @JDarDagran + *Objects specified with JSON Schema needed to be manually developed and checked in Python, leading to many discrepancies, including wrong schema URLs. This adds a `datamodel-code-generator` for parsing JSON Schema and generating Pydantic or dataclasses classes, etc. In order to use `attrs` (a more modern version of dataclasses) and overcome some limitations of the tool, a number of steps have been added in order to customize code to meet OpenLineage requirements. Included: updated references to the latest base JSON Schema spec for all child facets. **Please note**: newly generated code creates a v2 interface that will be implemented in existing integrations in a future release. The v2 interface introduces some breaking changes: facets are put into separate modules per JSON Schema spec file, some names are changed, and several classes are now `kw_only`.* +* **Spark/Flink/Java: support YAML config files together with SparkConf/FlinkConf** [`#2583`](https://github.com/OpenLineage/OpenLineage/pull/2583) @pawel-big-lebowski + *Creates a `SparkOpenlineageConfig` and `FlinkOpenlineageConfig` for a more uniform configuration experience for the user. Renames `OpenLineageYaml` to `OpenLineageConfig` and modifies the code to use only `OpenLineageConfig` classes. Includes a doc update to mention that both ways can be used interchangeably and final documentation will merge all values provided.* +* **Spark: add custom token provider support** [`#2613`](https://github.com/OpenLineage/OpenLineage/pull/2613) @tnazarew + *Adds a `TokenProviderTypeIdResolver` to handle both `FQCN` and (for backward compatibility) `api_key` types in `spark.openlineage.transport.auth.type`.* +* **Spark/Flink: job ownership facet** [`#2533`](https://github.com/OpenLineage/OpenLineage/pull/2533) @pawel-big-lebowski + *Enables configuration entries specifying ownership of the job that will result in an `OwnershipJobFacet` being attached to job facets.* + +### Changed +* **Java: sync Kinesis `partitionKey` format with Kafka implementation** [`#2620`](https://github.com/OpenLineage/OpenLineage/pull/2620) @dolfinus + *Changes the format of Kinesis `partitionKey` from `{jobNamespace}:{jobName}` to `run:{jobNamespace}/{jobName}` to match the Kafka transport implementation.* + +### Fixed +* **Python: make `load_config` return an empty dict instead of `None` when file empty** [`#2596`](https://github.com/OpenLineage/OpenLineage/pull/2596) @kacpermuda + *`utils.load_config()` now returns an empty dict instead of `None` in the case of an empty file to prevent an `OpenLineageClient` crash.* +* **Java: render lombok-generated methods in javadoc** [`#2614`](https://github.com/OpenLineage/OpenLineage/pull/2614) @dolfinus + *Fixes rendering of javadoc for methods generated by `lombok` annotations by adding a `delombok` step.* +* **Spark/Snowflake: parse NPE when query option is used and table is empty** [`#2599`](https://github.com/OpenLineage/OpenLineage/pull/2599) @mobuchowski + *Fixes NPE when using query option when reading from Snowflake.* diff --git a/versioned_docs/version-1.26.0/releases/1_14_0.md b/versioned_docs/version-1.26.0/releases/1_14_0.md new file mode 100644 index 0000000..5e4c7f6 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_14_0.md @@ -0,0 +1,38 @@ +--- +title: 1.14.0 +sidebar_position: 9943 +--- + +# 1.14.0 - 2024-05-09 + +### Added +* **Common/dbt: add DREMIO to supported dbt profile types** [`#2674`](https://github.com/OpenLineage/OpenLineage/pull/2674) [@surisimran](https://github.com/surisimran) + *Adds support for dbt-dremio, resolving [`#2668`](https://github.com/OpenLineage/OpenLineage/issues/2668). +* **Flink: support Protobuf format for sources and sinks** [`#2482`](https://github.com/OpenLineage/OpenLineage/pull/2482) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds schema extraction from Protobuf classes. Includes support for nested object types, `array` type, `map` type, `oneOf` and `any`.* +* **Java: add facet conversion test** [`#2663`](https://github.com/OpenLineage/OpenLineage/pull/2663) [@julienledem](https://github.com/julienledem) + *Adds a simple test that shows how to deserialize a facet in the server model.* +* **Spark: job type facet to distinguish RDD jobs from Spark SQL jobs** [`#2652`](https://github.com/OpenLineage/OpenLineage/pull/2652) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Sets the `jobType` property of `JobTypeJobFacet` to either `SQL_JOB` or `RDD_JOB`.* +* **Spark: add Glue symlink if reading from Glue catalog table** [`#2646`](https://github.com/OpenLineage/OpenLineage/pull/2646) [@mobuchowski](https://github.com/mobuchowski) + *The dataset symlink now points to the Glue catalog table name if the Glue catalog table is used.* +* **Spark: add spark_jobDetails facet** [`#2662`](https://github.com/OpenLineage/OpenLineage/pull/2662) [@dolfinus](https://github.com/dolfinus) + *Adds a `SparkJobDetailsFacet`, capturing information about Spark application jobs -- e.g. `jobId`, `jobDescription`, `jobGroup`, `jobCallSite`. This allows for tracking an OpenLineage `RunEvent` with a specific Spark job in SparkUI.* + +### Removed +* **Airflow: drop old ParentRunFacet key** [`#2660`](https://github.com/OpenLineage/OpenLineage/pull/2660) [@dolfinus](https://github.com/dolfinus) + *Changes the integration to use the `parent` key for `ParentFacet`, dropping the outdated `parentRun`.* +* **Spark: drop SparkVersionFacet** [`#2659`](https://github.com/OpenLineage/OpenLineage/pull/2659) [@dolfinus](https://github.com/dolfinus) + *Drops the `SparkVersion` facet, deprecated since 1.2.0 and planned for removal since 1.4.0.* +* **Python: allow relative paths in URI formats for Python facets** [`#2679`](https://github.com/OpenLineage/OpenLineage/pull/2679) [@JDarDagran](https://github.com/JDarDagran) + *Removes a URI validator that checked if scheme and netloc were present, allowing relative paths in URI formats for Python facets.* + +### Changed +* **GreatExpectations: rename `ParentRunFacet` key** [`#2661`](https://github.com/OpenLineage/OpenLineage/pull/2661) [@dolfinus](https://github.com/dolfinus) + *The OpenLineage spec defined the `ParentRunFacet` with the property name parent but the Great Expectations integration created a lineage event with `parentRun`. This renames `ParentRunFacet` key from `parentRun` to `parent`. For backwards compatibility, keep the old name.* + +### Fixed +* **dbt: support a less ambiguous logic to generate job names** [`#2658`](https://github.com/OpenLineage/OpenLineage/pull/2658) [@blacklight](https://github.com/blacklight) + *Includes profile and models in the dbt job name to make it more unique.* +* **Spark: update to use org.apache.commons.lang3 instead of org.apache.commons.lang** [`#2676`](https://github.com/OpenLineage/OpenLineage/pull/2676) [@harels](https://github.com/harels) + *Updates Apache Commons Lang to the latest version. We were mixing two versions, and the old one was not present in many places.* diff --git a/versioned_docs/version-1.26.0/releases/1_15_0.md b/versioned_docs/version-1.26.0/releases/1_15_0.md new file mode 100644 index 0000000..e0683db --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_15_0.md @@ -0,0 +1,41 @@ +--- +title: 1.15.0 +sidebar_position: 9942 +--- + +# 1.15.0 - 2024-05-24 + +### Added +* **Flink: handle Iceberg tables with nested and complex field types** [`#2706`](https://github.com/OpenLineage/OpenLineage/pull/2706) [@dolfinus](https://github.com/dolfinus) + *Creates `SchemaDatasetFacet` with nested fields for Iceberg tables with list, map and struct columns.* +* **Flink: handle Avro schema with nested and complex field types** [`#2711`](https://github.com/OpenLineage/OpenLineage/pull/2711) [@dolfinus](https://github.com/dolfinus) + *Creates `SchemaDatasetFacet` with nested fields for Avro schemas with complex types (union, record, map, array, fixed).* +* **Spark: add facets to Spark application events** [`#2677`](https://github.com/OpenLineage/OpenLineage/pull/2677) [@dolfinus](https://github.com/dolfinus) + *Adds support for Spark application start and stop events in the `ExecutionContext` interface.* +* **Spark: add nested fields to `SchemaDatasetFieldsFacet`** [`#2689`](https://github.com/OpenLineage/OpenLineage/pull/2689) [@dolfinus](https://github.com/dolfinus) + *Adds nested Spark Dataframe fields support to `SchemaDatasetFieldsFacet`. Also include field comment as `description`.* +* **Spark: add `SparkApplicationDetailsFacet`** [`#2688`](https://github.com/OpenLineage/OpenLineage/pull/2688) [@dolfinus](https://github.com/dolfinus) + *Adds `SparkApplicationDetailsFacet` to `runEvent`s emitted on Spark application start.* + +### Removed +* **Airflow: remove Airflow < 2.3.0 support** [`#2710`](https://github.com/OpenLineage/OpenLineage/pull/2710) [@kacpermuda](https://github.com/kacpermuda) + *Removes Airflow < 2.3.0 support.* +* **Integration: use v2 Python facets** [`#2693`](https://github.com/OpenLineage/OpenLineage/pull/2693) [@JDarDagran](https://github.com/JDarDagran) + *Migrates integrations from removed v1 facets to v2 Python facets.* + +### Fixed +* **Spark: improve job suffix assigning mechanism** [`#2665`](https://github.com/OpenLineage/OpenLineage/pull/2665) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *For some catalog handlers, the mechanism was creating different dataset identifiers on START and COMPLETE depending on whether a dataset was created or not. This improves the mechanism to assign a deterministic job suffix based on the output dataset at the moment of a start event. **Note**: this may change job names in some scenarios.* +* **Airflow: fix empty dataset name for `AthenaExtractor`** [`#2700`](https://github.com/OpenLineage/OpenLineage/pull/2700) [@kacpermuda](https://github.com/kacpermuda) + *The dataset name should not be empty when passing only a bucket as S3 output in Athena.* +* **Flink: fix `SchemaDatasetFacet` for Protobuf repeated primitive types** [`#2685`](https://github.com/OpenLineage/OpenLineage/pull/2685) [@dolfinus](https://github.com/dolfinus) + *Fixes issues with the Protobuf schema converter.* +* **Python: clean up Python client code, add logging.** [`#2653`](https://github.com/OpenLineage/OpenLineage/pull/2653) [@kacpermuda](https://github.com/kacpermuda) + *Cleans up client code, refactors logging in all Python modules.* +* **SQL: catch `TokenizerError`s, `PanicException`** [`#2703`](https://github.com/OpenLineage/OpenLineage/pull/2703) [@mobuchowski](https://github.com/mobuchowski) + *The SQL parser now catches and handles these errors.* +* **Python: suppress warning on importing v1 module in __init__.py.** [`#2713`](https://github.com/OpenLineage/OpenLineage/pull/2713) [@JDarDagran](https://github.com/JDarDagran) + *Suppresses the deprecation warning when v1 facets are used.* +* **Integration/Java/Python: use UUIDv7 instead of UUIDv4** [`#2686`](https://github.com/OpenLineage/OpenLineage/pull/2686) [`#2687`](https://github.com/OpenLineage/OpenLineage/pull/2687) [@dolfinus](https://github.com/dolfinus) + *Uses UUIDv7 instead of UUIDv4 for `runEvent`s. The new UUID version produces monotonically increasing values, which leads to more performant queries on the OL consumer side. **Note**: UUID version is an implementation detail and can be changed in the future.* + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_16_0.md b/versioned_docs/version-1.26.0/releases/1_16_0.md new file mode 100644 index 0000000..4c1269b --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_16_0.md @@ -0,0 +1,16 @@ +--- +title: 1.16.0 +sidebar_position: 9941 +--- + +# 1.16.0 - 2024-05-28 + +### Added +* **Spark: add `jobType` facet to Spark application events** [`#2719`](https://github.com/OpenLineage/OpenLineage/pull/2719) [@dolfinus](https://github.com/dolfinus) + *Add `jobType` facet to `runEvent`s emitted by `SparkListenerApplicationStart`.* + +### Fixed +* **dbt: fix swapped namespace and name in dbt integration** [`#2735`](https://github.com/OpenLineage/OpenLineage/pull/2735) [@JDarDagran](https://github.com/JDarDagran) + *Fixes variable names.* +* **Python: override debug level** [`#2727`](https://github.com/OpenLineage/OpenLineage/pull/2735) [@mobuchowski](https://github.com/mobuchowski) + *Removes debug-level logging of HTTP requests.* diff --git a/versioned_docs/version-1.26.0/releases/1_17_1.md b/versioned_docs/version-1.26.0/releases/1_17_1.md new file mode 100644 index 0000000..5cbd0fd --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_17_1.md @@ -0,0 +1,56 @@ +--- +title: 1.17.1 +sidebar_position: 9940 +--- + +# 1.17.1 - 2024-06-21 + +### Added +* **Java: dataset namespace resolver feature** [`#2720`](https://github.com/OpenLineage/OpenLineage/pull/2720) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds a dataset namespace resolving mechanism that resolves dataset namespaces based on the resolvers configured. The core mechanism is implemented in openlineage-java and can be used within the Flink and Spark integrations.* +* **Spark: add transformation extraction** [`#2758`](https://github.com/OpenLineage/OpenLineage/pull/2758) [@tnazarew](https://github.com/tnazarew) + *Adds a transformation type extraction mechanism.* +* **Spark: add GCP run and job facets** [`#2643`](https://github.com/OpenLineage/OpenLineage/pull/2643) [@codelixir](https://github.com/codelixir) + *Adds `GCPRunFacetBuilder` and `GCPJobFacetBuilder` to report additional facets when running on Google Cloud Platform.* +* **Spark: improve namespace format for SQLServer** [`#2773`](https://github.com/OpenLineage/OpenLineage/pull/2773) [@dolfinus](https://github.com/dolfinus) + *Improves the namespace format for SQLServer.* +* **Spark: verify jar content after build** [`#2698`](https://github.com/OpenLineage/OpenLineage/pull/2698) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds a tool to verify `shadowJar` content and prevent reported issues. These are hard to prevent currently and require manual verification of manually unpacked jar content.* +* **Spec: add transformation type info** [`#2756`](https://github.com/OpenLineage/OpenLineage/pull/2756) [@tnazarew](https://github.com/tnazarew) + *Adds information about the transformation type in `ColumnLineageDatasetFacet`. `transformationType` and `transformationDescription` are marked as deprecated.* +* **Spec: implementing facet registry (following #2161)** [`#2729`](https://github.com/OpenLineage/OpenLineage/pull/2729) [@harels](https://github.com/harels) + *Introduces the foundations of the new facet Registry into the repo.* +* **Spec: register GCP common job facet** [`#2740`](https://github.com/OpenLineage/OpenLineage/pull/2740) [@ngorchakova](https://github.com/ngorchakova) + *Registers the GCP job facet that contains common attributes that will improve the way lineage is parsed and displayed by the GCP platform. Based on the [proposal](https://github.com/OpenLineage/OpenLineage/pull/2228/files), GCP Lineage would like to define facets that are expected from integrations. The list of support facets is not final and will be extended further by next PR.* + +### Removed +* **Java: remove deprecated `localServerId` option from Kafka config** [`#2738`](https://github.com/OpenLineage/OpenLineage/pull/2738) [@dolfinus](https://github.com/dolfinus) + *Removes `localServerId` from Kafka config, deprecated since 1.13.0.* +* **Java: remove deprecated `Transport.emit(String)`** [`#2737`](https://github.com/OpenLineage/OpenLineage/pull/2737) [@dolfinus](https://github.com/dolfinus) + *Removes `Transport.emit(String)` support, deprecated since 1.13.0.* +* **Spark: remove `spark-interfaces-scala` module** [`#2781`](https://github.com/OpenLineage/OpenLineage/pull/2781)[@ddebowczyk92](https://github.com/ddebowczyk92) + *Replaces the existing `spark-interfaces-scala` interfaces with new ones decoupled from the Scala binary version. Allows for improved integration in environments where one cannot guarantee the same version of `openlineage-java`.* + +### Changed +* **Spark: add log info when emitting lineage from Spark (following #2650)** [`#2769`](https://github.com/OpenLineage/OpenLineage/pull/2769) [@algorithmy1](https://github.com/algorithmy1) + *Enhances logging.* + +### Fixed +* **Flink: use `namespace.name` as Avro complex field type** [`#2763`](https://github.com/OpenLineage/OpenLineage/pull/2763) [@dolfinus](https://github.com/dolfinus) + *`namespace.name` is now used as Avro `"type"` of complex fields (record, enum, fixed).* +* **Java: repair empty dataset name** [`#2776`](https://github.com/OpenLineage/OpenLineage/pull/2776) [@kacpermuda](https://github.com/kacpermuda) + *The dataset name should not be empty.* +* **Spark: fix events emitted for `drop table` for Spark 3.4 and above** [`#2745`](https://github.com/OpenLineage/OpenLineage/pull/2745) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski)[@savannavalgi](https://github.com/savannavalgi) + *Includes dataset being dropped within the event, as it used to be prior to Spark 3.4.* +* **Spark, Flink: fix S3 dataset names** [`#2782`](https://github.com/OpenLineage/OpenLineage/pull/2782) [@dolfinus](https://github.com/dolfinus) + *Drops the leading slash from the object storage dataset name. Converts `s3a://` and `s3n://` schemes to `s3://`.* +* **Spark: fix Hive metastore namespace** [`#2761`](https://github.com/OpenLineage/OpenLineage/pull/2761) [@dolfinus](https://github.com/dolfinus) + *Fixes the dataset namespace for cases when the Hive metastore URL is set using `$SPARK_CONF_DIR/hive-site.xml`.* +* **Spark: fix NPE in column-level lineage** [`#2749`](https://github.com/OpenLineage/OpenLineage/pull/2749) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *The Spark agent now checks to determine if `cur.getDependencies()` is not null before adding dependencies.* +* **Spark: refactor `OpenLineageRunEventBuilder`** [`#2754`](https://github.com/OpenLineage/OpenLineage/pull/2754) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds a separate class containing all the input arguments to call `OpenLineageRunEventBuilder::buildRun`.* +* **Spark: fix `historyUrl` format** [`#2741`](https://github.com/OpenLineage/OpenLineage/pull/2741) [@dolfinus](https://github.com/dolfinus) + *Fixes the `historyUrl` format in `spark_applicationDetails`.* +* **SQL: allow self-recursive aliases** [`#2753`](https://github.com/OpenLineage/OpenLineage/pull/2753) [@mobuchowski](https://github.com/mobuchowski) + *Expressions like `select * from test_orders as test_orders` are now parsed properly.* diff --git a/versioned_docs/version-1.26.0/releases/1_18_0.md b/versioned_docs/version-1.26.0/releases/1_18_0.md new file mode 100644 index 0000000..4a069df --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_18_0.md @@ -0,0 +1,52 @@ +--- +title: 1.18.0 +sidebar_position: 9939 +--- + +# 1.18.0 - 2024-07-11 + +### Added +* **Spark: configurable integration test** [`#2755`](https://github.com/OpenLineage/OpenLineage/pull/2755) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Provides command line tool capable of running Spark integration tests that can be created without Java.* +* **Spark: OpenLineage Spark extension interfaces without runtime dependency hell** [`#2809`](https://github.com/OpenLineage/OpenLineage/pull/2809) [`#2837`](https://github.com/OpenLineage/OpenLineage/pull/2837) [@ddebowczyk92](https://github.com/ddebowczyk92) + *New Spark extension interfaces without runtime dependency hell. Includes a test to verify the integration is working properly.* +* **Spark: support latest versions 3.4.3 and 3.5.1.** [`#2743`](https://github.com/OpenLineage/OpenLineage/pull/2743) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Upgrades CI workflows to run tests against latest Spark versions: 3.4.2 -> 3.4.3 and 3.5.0 -> 3.5.1.* +* **Spark: add extraction of the masking property in column-level lineage** [`#2789`](https://github.com/OpenLineage/OpenLineage/pull/2789) [@tnazarew](https://github.com/tnazarew) + *Adds extraction of the masking property during collection of dependencies for `ColumnLineageDatasetFacet` creation.* +* **Spark: collect table name from `InsertIntoHadoopFsRelationCommand`** [`#2794`](https://github.com/OpenLineage/OpenLineage/pull/2794) [@dolfinus](https://github.com/dolfinus) + *Collects a table name for `INSERT INTO` command for tables created with `USING $fileFormat` syntax, like `USING orc`.* +* **Spark, Flink: add `PostgresJdbcExtractor`** [`#2806`](https://github.com/OpenLineage/OpenLineage/pull/2806) [@dolfinus](https://github.com/dolfinus) + *Adds the default `5432` port to Postgres namespaces.* +* **Spark, Flink: add `TeradataJdbcExtractor`** [`#2826`](https://github.com/OpenLineage/OpenLineage/pull/2826) [@dolfinus](https://github.com/dolfinus) + *Converts JDBC URLs like `jdbc:teradata/host/DBS_PORT=1024,DATABASE=somedb` to datasets with namespace `teradata://host:1024` and name `somedb.table`.* +* **Spark, Flink: add `MySqlJdbcExtractor`** [`#2825`](https://github.com/OpenLineage/OpenLineage/pull/2825) [@dolfinus](https://github.com/dolfinus) + *Handles different formats of MySQL JDBC URL, and produces datasets with consistent namespaces, like `mysql://host:port`.* +* **Spark, Flink: add `OracleJdbcExtractor`** [`#2824`](https://github.com/OpenLineage/OpenLineage/pull/2824) [@dolfinus](https://github.com/dolfinus) + *Handles simple Oracle JDBC URLs, like `oracle:thin:@//host:port/serviceName` and `oracle:thin@host:port:sid`, and converts each to a dataset with namespace `oracle://host:port` and name `sid.schema.table` or `serviceName.schema.table`.* +* **Spark: configurable test with Docker image provided** [`#2822`](https://github.com/OpenLineage/OpenLineage/pull/2822) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Extends the configurable integration test feature to enable getting the Docker image name as a name.* +* **Spark: Support Iceberg 1.4 on Spark 3.5.1.** [`#2838`](https://github.com/OpenLineage/OpenLineage/pull/2838) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Include Iceberg support for Spark 3.5. Fix column level lineage facet for `UNION` queries.* +* **Spec: add example for change in `#2756`** [`#2801`](https://github.com/OpenLineage/OpenLineage/pull/2801) [@Sheeri](https://github.com/Sheeri) + *Updates the `customLineage` facet test for the new syntax created in `#2756`.* + +### Changed +* **Spark: fallback to `spark.sql.warehouse.dir` as table namespace** [`#2767`](https://github.com/OpenLineage/OpenLineage/pull/2767) [@dolfinus](https://github.com/dolfinus) + *In cases when a metastore is not used, falls back to `spark.sql.warehouse.dir` or `hive.metastore.warehouse.dir` as table namespace, instead of duplicating the table's location.* + +### Fixed +* **Java: handle dashes in hostname for `JdbcExtractors`** [`#2830`](https://github.com/OpenLineage/OpenLineage/pull/2830) [@dolfinus](https://github.com/dolfinus) + *Proper handling of dashes in JDBC URL hosts.* +* **Spark: fix Glue symlinks formatting bug** [`#2807`](https://github.com/OpenLineage/OpenLineage/pull/2807) [@Akash2351](https://github.com/Akash2351) + *Fixes Glue symlinks with config parsing for Glue `catalogid`.* +* **Spark, Flink: fix DBFS namespace format** [`#2800`](https://github.com/OpenLineage/OpenLineage/pull/2800) [@dolfinus](https://github.com/dolfinus) + *Fixes the DBFS namespace format.* +* **Spark: fix Glue naming format** [`#2766`](https://github.com/OpenLineage/OpenLineage/pull/2766) [@dolfinus](https://github.com/dolfinus) + *Changes the AWS Glue namespace to match Glue ARN documentation.* +* **Spark: fix Iceberg dataset location** [`#2797`](https://github.com/OpenLineage/OpenLineage/pull/2797) [@dolfinus](https://github.com/dolfinus) + *Fixes Iceberg dataset namespace: instead of `file:/some/path/database.table` uses `file:/some/path/database/table`. For dataset TABLE symlink, uses warehouse location instead of database location.* +* **Spark: fix NPE and incorrect comment** [`#2827`](https://github.com/OpenLineage/OpenLineage/pull/2827) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Fixes an error caused by a recent upgrade of Spark versions that did not break existing tests.* +* **Spark: convert scheme and authority to lowercase in `JdbcLocation`** [`#2831`](https://github.com/OpenLineage/OpenLineage/pull/2831) [@dolfinus](https://github.com/dolfinus) + *Converts valid JDBC URL scheme and authority to lowercase, leaving intact instance/database name, as different databases have different default case and case-sensitivity rules.* diff --git a/versioned_docs/version-1.26.0/releases/1_19_0.md b/versioned_docs/version-1.26.0/releases/1_19_0.md new file mode 100644 index 0000000..a123273 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_19_0.md @@ -0,0 +1,31 @@ +--- +title: 1.19.0 +sidebar_position: 9938 +--- + +# 1.19.0 - 2024-07-22 + +### Added +* **Airflow: add `log_url` to `AirflowRunFacet`** [`#2852`](https://github.com/OpenLineage/OpenLineage/pull/2852) [@dolfinus](https://github.com/dolfinus) + *Adds taskinstance's `log_url` field to `AirflowRunFacet`.* +* **Spark: add handling for `Generate`** [`#2856`](https://github.com/OpenLineage/OpenLineage/pull/2856) [@tnazarew](https://github.com/tnazarew) + *Adds handling for `Generate`-type nodes of a logical plan (e.g., explode operations).* +* **Java: add `DerbyJdbcExtractor`** [`#2869`](https://github.com/OpenLineage/OpenLineage/pull/2869) [@dolfinus](https://github.com/dolfinus) + *Adds `JdbcExtractor` implementation for Derby database. As this is a file-based DBMS, its Dataset namespace is `file` and name is an absolute path to a database file.* +* **Spark: verify bytecode version of the built jar.** [`#2859`](https://github.com/OpenLineage/OpenLineage/pull/2859) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Extends the `JarVerifier` plugin to ensure all compiled classes have a bytecode version of Java 8 or lower.* +* **Spark: add Kafka streaming source support** [`#2851`](https://github.com/OpenLineage/OpenLineage/pull/2851) [@d-m-h](https://github.com/d-m-h) [@imbruced](https://github.com/Imbruced) + *Adds support for Kafka streaming sources to Kafka streaming sinks. Inputs and outputs are now included in lineage events.* + +### Fixed +* **Airflow: replace datetime.now with airflow.utils.timezone.utcnow** [`#2865`](https://github.com/OpenLineage/OpenLineage/pull/2865) [@kacpermuda](https://github.com/kacpermuda) + *Fixes missing timezone information in task FAIL events.* +* **Spark: remove shaded dependency in `ColumnLevelLineageBuilder`** [`#2850`](https://github.com/OpenLineage/OpenLineage/pull/2850) [@tnazarew](https://github.com/tnazarew) + *Removes the shaded `Streams` dependency in `ColumnLevelLineageBuilder` causing a `ClassNotFoundException`.* +* **Spark: make Delta dataset symlink consistent with non-Delta tables** [`#2863`](https://github.com/OpenLineage/OpenLineage/pull/2863) [@dolfinus](https://github.com/dolfinus) + *Makes dataset symlinks for Delta and non-Delta tables consistent.* +* **Spark: use Table's properties during column-level lineage construction** [`#2855`](https://github.com/OpenLineage/OpenLineage/pull/2855) [@ddebowczyk92](https://github.com/ddebowczyk92) + *Fixes `PlanUtils3` so Dataset identifier information based on a Table's properties is also retrieved during the construction of column-level lineage.* +* **Spark: extract job name creation to providers** [`#2861`](https://github.com/OpenLineage/OpenLineage/pull/2861) [@arturowczarek](https://github.com/arturowczarek) + *The integration now detects if the `spark.app.name` was autogenerated by Glue and uses the Glue job name in such cases. Also, each job name provisioning strategy is now extracted to a separate provider.* + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_1_0.md b/versioned_docs/version-1.26.0/releases/1_1_0.md new file mode 100644 index 0000000..338e3e7 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_1_0.md @@ -0,0 +1,31 @@ +--- +title: 1.1.0 +sidebar_position: 9956 +--- + +# 1.1.0 - 2023-08-23 + +### Added +* **Flink: create Openlineage configuration based on Flink configuration** [`#2033`](https://github.com/OpenLineage/OpenLineage/pull/2033) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Flink configuration entries starting with* `openlineage.*` *are passed to the Openlineage client.* +* **Java: add Javadocs to the Java client** [`#2004`](https://github.com/OpenLineage/OpenLineage/pull/2004) [@julienledem](https://github.com/julienledem) + *The client was missing some Javadocs.* +* **Spark: append output dataset name to a job name** [`#2036`](https://github.com/OpenLineage/OpenLineage/pull/2036) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Solves the problem of multiple jobs writing to different datasets while having the same job name. The feature is enabled by default and results in different job names. It can be disabled by setting `spark.openlineage.jobName.appendDatasetName` to `false`.* + *Unifies job names generated on the Databricks platform (using a dot job part separator instead of an underscore). The default behaviour can be altered with `spark.openlineage.jobName.replaceDotWithUnderscore`.* +* **Spark: support Spark 3.4.1** [`#2057`](https://github.com/OpenLineage/OpenLineage/pull/2057) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Bumps the latest Spark version to be covered in integration tests.* + +### Fixed +* **Airflow: do not use database as fallback when no schema parsed** [`#2023`](https://github.com/OpenLineage/OpenLineage/pull/2023) [@mobuchowski](https://github.com/mobuchowski) + *Sets the schema to `None` in `TablesHierarchy` to skip filtering on the schema level in the information schema query.* +* **Flink: fix a bug when getting schema for `KafkaSink`** [`#2042`](https://github.com/OpenLineage/OpenLineage/pull/2042) [@pentium3](https://github.com/pentium3) + *Fixes the incomplete schema from `KafkaSinkVisitor` by changing the `KafkaSinkWrapper` to catch schemas of type `AvroSerializationSchema`.* +* **Spark: filter `CreateView` events** [`#1968`](https://github.com/OpenLineage/OpenLineage/pull/1968)[`#1987`](https://github.com/OpenLineage/OpenLineage/pull/1987) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Clears events generated by logical plans having `CreateView` nodes as root.* +* **Spark: fix `MERGE INTO` for delta tables identified by physical locations** [`#2026`](https://github.com/OpenLineage/OpenLineage/pull/2026) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Delta tables identified by physical locations were not properly recognized.* +* **Spark: fix incorrect naming of JDBC datasets** [`#2035`](https://github.com/OpenLineage/OpenLineage/pull/2035) [@mobuchowski](https://github.com/mobuchowski) + *Makes the namespace generated by the JDBC/Spark connector conform to the naming schema in the spec.* +* **Spark: fix ignored event `adaptive_spark_plan` in Databricks** [`#2061`](https://github.com/OpenLineage/OpenLineage/pull/2061) [@algorithmy1](https://github.com/algoithmy1) + *Removes `adaptive_spark_plan` from the `excludedNodes` in `DatabricksEventFilter`.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_20_5.md b/versioned_docs/version-1.26.0/releases/1_20_5.md new file mode 100644 index 0000000..57bf53f --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_20_5.md @@ -0,0 +1,50 @@ +--- +title: 1.20.5 +sidebar_position: 9937 +--- + +# 1.20.5 - 2024-08-23 + +### Added +* **Python: add `CompositeTransport`** [`#2925`](https://github.com/OpenLineage/OpenLineage/pull/2925) [@JDarDagran](https://github.com/JDarDagran) + *Adds a `CompositeTransport` that can accept other transport configs to instantiate transports and use them to emit events.* +* **Spark: compile & test Spark integration on Java 17** [`#2828`](https://github.com/OpenLineage/OpenLineage/pull/2828) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *The Spark integration is always compiled with Java 17, while tests are running on both Java 8 and Java 17 according to the configuration.* +* **Spark: support preview release of Spark 4.0** [`#2854`](https://github.com/OpenLineage/OpenLineage/pull/2854) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Includes the Spark 4.0 preview release in the integration tests.* +* **Spark: add handling for `Window`** [`#2901`](https://github.com/OpenLineage/OpenLineage/pull/2901) [@tnazarew](https://github.com/tnazarew) + *Adds handling for `Window`-type nodes of a logical plan.* +* **Spark: extract and send events with raw SQL from Spark** [`#2913`](https://github.com/OpenLineage/OpenLineage/pull/2913) [@Imbruced](https://github.com/Imbruced) + *Adds a parser that traverses `QueryExecution` to get the SQL query used from the SQL field with a BFS algorithm.* +* **Spark: support Mongostream source** [`#2887`](https://github.com/OpenLineage/OpenLineage/pull/2887) [@Imbruced](https://github.com/Imbruced) + *Adds a Mongo streaming visitor and tests.* +* **Spark: new mechanism for disabling facets** [`#2912`](https://github.com/OpenLineage/OpenLineage/pull/2912) [@arturowczarek](https://github.com/arturowczarek) + *The mechanism makes `FacetConfig` accept the disabled flag for any facet instead of passing them as a list.* +* **Spark: support Kinesis source** [`#2906`](https://github.com/OpenLineage/OpenLineage/pull/2906) [@Imbruced](https://github.com/Imbruced) + *Adds a Kinesis class handler in the streaming source builder.* +* **Spark: extract `DatasetIdentifier` from extension `LineageNode`** [`#2900`](https://github.com/OpenLineage/OpenLineage/pull/2900) [@ddebowczyk92](https://github.com/ddebowczyk92) + *Adds support for cases in which `LogicalRelation` has a grandChild node that implements the `LineageRelation` interface.* +* **Spark: extract Dataset from underlying `BaseRelation`** [`#2893`](https://github.com/OpenLineage/OpenLineage/pull/2893) [@ddebowczyk92](https://github.com/ddebowczyk92) + *`DatasetIdentifier` is now extracted from the underlying node of `LogicalRelation`.* +* **Spark: add descriptions and Marquez UI to Docker Compose file** [`#2889`](https://github.com/OpenLineage/OpenLineage/pull/2889) [@jonathanlbt1](https://github.com/jonathanlbt1) + *Adds the `marquez-web` service to docker-compose.yml.* + +### Fixed +* **Proxy: bug fixed on error messages descriptions** [`#2880`](https://github.com/OpenLineage/OpenLineage/pull/2880) [@jonathanlbt1](https://github.com/jonathanlbt1) + *Improves error logging.* +* **Proxy: update Docker image for Fluentd 1.17** [`#2877`](https://github.com/OpenLineage/OpenLineage/pull/2877) [@jonathanlbt1](https://github.com/jonathanlbt1) + *Upgrades the Fluentd version.* +* **Spark: fix issue with Kafka source when saving with `for each` batch method** [`#2868`](https://github.com/OpenLineage/OpenLineage/pull/2868) [@imbruced](https://github.com/Imbruced) + *Fixes an issue when Spark is in streaming mode and input for Kafka was not present in the event.* +* **Spark: properly set ARN in namespace for Iceberg Glue symlinks** [`#2943`](https://github.com/OpenLineage/OpenLineage/pull/2943) [@arturowczarek](https://github.com/arturowczarek) + *Makes `IcebergHandler` support Glue catalog tables and create the symlink using the code from `PathUtils`.* +* **Spark: accept any provider for AWS Glue storage format** [`#2917`](https://github.com/OpenLineage/OpenLineage/pull/2917) [@arturowczarek](https://github.com/arturowczarek) + *Makes the AWS Glue ARN generating method accept every format (including Parquet), not only Hive SerDe.* +* **Spark: return valid JSON for failed logical plan serialization** [`#2892`](https://github.com/OpenLineage/OpenLineage/pull/2892) [@arturowczarek](https://github.com/arturowczarek) + *The `LogicalPlanSerializer` now returns `` for failed serialization instead of an empty string.* +* **Spark: extract legacy column lineage visitors loader** [`#2883`](https://github.com/OpenLineage/OpenLineage/pull/2883) [@arturowczarek](https://github.com/arturowczarek) + *Refactors `CustomCollectorsUtils` for improved readability.* +* **Spark: add Kafka input source when writing in `foreach` batch mode** [`#2868`](https://github.com/OpenLineage/OpenLineage/pull/2868) [@Imbruced](https://github.com/Imbruced) + *Fixes a bug keeping Kafka input sources from being produced.* +* **Spark: extract `DatasetIdentifier` from `SaveIntoDataSourceCommandVisitor` options** [`#2934`](https://github.com/OpenLineage/OpenLineage/pull/2934) [@ddebowczyk92](https://github.com/ddebowczyk92) + *Extracts `DatasetIdentifier` from command's options instead of relying on `p.createRelation(sqlContext, command.options())`, which is a heavy operation for `JdbcRelationProvider`.* diff --git a/versioned_docs/version-1.26.0/releases/1_21_1.md b/versioned_docs/version-1.26.0/releases/1_21_1.md new file mode 100644 index 0000000..326d45f --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_21_1.md @@ -0,0 +1,24 @@ +--- +title: 1.21.1 +sidebar_position: 9936 +--- + +# 1.21.1 - 2024-08-29 + +### Added +* **Spec: add GCP Dataproc facet** [`#2987`](https://github.com/OpenLineage/OpenLineage/pull/2987) [@tnazarew](https://github.com/tnazarew) + *Registers the Google Cloud Platform Dataproc run facet.* + +### Fixed +* **Airflow: update SQL integration code to work with latest sqlparser-rs main** [`#2983`](https://github.com/OpenLineage/OpenLineage/pull/2983) [@kacpermuda](https://github.com/kacpermuda) + *Adjusts the SQL integration after our sqlparser-rs fork has been updated to the latest main.* +* **Spark: fix AWS Glue jobs naming for SQL events** [`#3001`](https://github.com/OpenLineage/OpenLineage/pull/3001) [@arturowczarek](https://github.com/arturowczarek) + *SQL events now properly use the names of the jobs retrieved from AWS Glue.* +* **Spark: fix issue with column lineage when using delta merge into command** [`#2986`](https://github.com/OpenLineage/OpenLineage/pull/2986) [@Imbruced](https://github.com/Imbruced) + *A view instance of a node is now included when gathering data sources for input columns.* +* **Spark: minor Spark filters refactor** [`#2990`](https://github.com/OpenLineage/OpenLineage/pull/2990) [@arturowczarek](https://github.com/arturowczarek) + *Fixes a number of minor issues.* +* **Spark: Iceberg tables in AWS Glue have slashes instead of dots in symlinks** [`#2984`](https://github.com/OpenLineage/OpenLineage/pull/2984) [@arturowczarek](https://github.com/arturowczarek) + *They should use slashes and the prefix `table/`.* +* **Spark: lineage for Iceberg datasets that are present outside of Spark's catalog is now present** [`#2937`](https://github.com/OpenLineage/OpenLineage/pull/2937) [@d-m-h](https://github.com/d-m-h) + *Previously, reading Iceberg datasets outside the configured Spark catalog prevented the datasets from being present in the `inputs` property of the `RunEvent`.* diff --git a/versioned_docs/version-1.26.0/releases/1_22_0.md b/versioned_docs/version-1.26.0/releases/1_22_0.md new file mode 100644 index 0000000..74bff79 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_22_0.md @@ -0,0 +1,24 @@ +--- +title: 1.22.0 +sidebar_position: 9935 +--- + +# 1.22.0 - 2024-09-05 + +### Added +* **SQL: add support for `USE` statement with different syntaxes** [`#2944`](https://github.com/OpenLineage/OpenLineage/pull/2944) [@kacpermuda](https://github.com/kacpermuda) + *Adjusts our Context so that it can use the new support for this statement in the parser and pass it to a number of queries.* +* **Spark: add script to build Spark dependencies** [`#3044`](https://github.com/OpenLineage/OpenLineage/pull/3044) [@arturowczarek](https://github.com/arturowczarek) + *Adds a script to rebuild dependencies automatically following releases.* +* **Website: versionable docs** [`#3007`](https://github.com/OpenLineage/OpenLineage/pull/3007) [`#3023`](https://github.com/OpenLineage/OpenLineage/pull/3023) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds a GitHub action that creates a new Docusaurus version on a tag push, verifiable using the openlineage-site repo. Implements a monorepo approach in a new `website` directory.* + +### Fixed +* **SQL: add support for `SingleQuotedString` in `Identifier()`** [`#3035`](https://github.com/OpenLineage/OpenLineage/pull/3035) [@kacpermuda](https://github.com/kacpermuda) + *Single quoted strings were being treated differently than strings with no quotes, double quotes, or backticks.* +* **SQL: support `IDENTIFIER` function instead of treating it like table name** [`#2999`](https://github.com/OpenLineage/OpenLineage/pull/2999) [@kacpermuda](https://github.com/kacpermuda) + *Adds support for this identifier in SELECT, MERGE, UPDATE, and DELETE statements. For now, only static identifiers are supported. When a variable is used, this table is removed from lineage to avoid emitting incorrect lineage.* +* **Spark: fix issue with only one table in inputs from SQL query while reading from JDBC** [`#2918`](https://github.com/OpenLineage/OpenLineage/pull/2918) [@Imbruced](https://github.com/Imbruced) + *Events created did not contain the correct input table when the query contained multiple tables.* +* **Spark: fix AWS Glue jobs naming for RDD events** [`#3020`](https://github.com/OpenLineage/OpenLineage/pull/3020) [@arturowczarek](https://github.com/arturowczarek) + *The naming for RDD jobs now uses the same code as SQL and Application events.* diff --git a/versioned_docs/version-1.26.0/releases/1_2_2.md b/versioned_docs/version-1.26.0/releases/1_2_2.md new file mode 100644 index 0000000..64d6336 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_2_2.md @@ -0,0 +1,35 @@ +--- +title: 1.2.2 +sidebar_position: 9955 +--- + +# 1.2.2 - 2023-09-20 + +### Added +* **Spark: publish the `ProcessingEngineRunFacet` as part of the normal operation of the `OpenLineageSparkEventListener`** [`#2089`](https://github.com/OpenLineage/OpenLineage/pull/2089) [@d-m-h](https://github.com/d-m-h) + *Publishes the spec-defined `ProcessEngineRunFacet` alongside the custom `SparkVersionFacet` (for now).* + *The `SparkVersionFacet` is deprecated and will be removed in a future release.* +* **Spark: capture and emit `spark.databricks.clusterUsageTags.clusterAllTags` variable from databricks environment** [`#2099`](https://github.com/OpenLineage/OpenLineage/pull/2099) [@Anirudh181001](https://github.com/Anirudh181001) + *Adds `spark.databricks.clusterUsageTags.clusterAllTags` to the list of environment variables captured from databricks.* + +### Fixed +* **Common: support parsing dbt_project.yml without target-path** [`#2106`](https://github.com/OpenLineage/OpenLineage/pull/2106) [@tatiana](https://github.com/tatiana) + *As of dbt v1.5, usage of target-path in the dbt_project.yml file has been deprecated, now preferring a CLI flag or env var. It will be removed in a future version. This allows users to run `DbtLocalArtifactProcessor` in dbt projects that do not declare target-path.* +* **Proxy: fix Proxy chart** [`#2091`](https://github.com/OpenLineage/OpenLineage/pull/2091) [@harels](https://github.com/harels) + *Includes the proper image to deploy in the helm chart.* +* **Python: fix serde filtering** [`#2044`](https://github.com/OpenLineage/OpenLineage/pull/2044) [@xli-1026](https://github.com/xli-1026) + *Fixes the bug causing values in list objects to be filtered accidentally.* +* **Python: use non-deprecated `apiKey` if loading it from env variables** [`@2029`](https://github.com/OpenLineage/OpenLineage/pull/2029) [@mobuchowski](https://github.com/mobuchowski) + *Changes `api_key` to `apiKey` in `create_token_provider`.* +* **Spark: Improve RDDs on S3 integration.** [`#2039`](https://github.com/OpenLineage/OpenLineage/pull/2039) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Prepares integration test to access S3, fixes input dataset duplicates and includes other minor fixes.* +* **Flink: prevent sending `running` events after job completes** [`#2075`](https://github.com/OpenLineage/OpenLineage/pull/2075) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Flink checkpoint tracking thread was not getting stopped properly on job complete.* +* **Spark & Flink: Unify dataset naming from URI objects** [`#2083`](https://github.com/OpenLineage/OpenLineage/pull/2083) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Makes sure Spark and Flink generate same dataset identifiers for the same datasets by having a single implementation to generate dataset namespace and name.* +* **Spark: Databricks improvements** [`#2076`](https://github.com/OpenLineage/OpenLineage/pull/2076) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Filters unwanted events on databricks and adds an integration test to verify this. Adds integration tests to verify dataset naming on databricks runtime is correct when table location is specified. Adds integration test for wide transformation on delta tables.* + +### Removed +* **SQL: remove sqlparser dependency from iface-java and iface-py** [`#2090`](https://github.com/OpenLineage/OpenLineage/pull/2090) [@JDarDagran](https://github.com/JDarDagran) + *Removes the dependency due to a breaking change in the latest release of the parser.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_3_1.md b/versioned_docs/version-1.26.0/releases/1_3_1.md new file mode 100644 index 0000000..60cd99b --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_3_1.md @@ -0,0 +1,24 @@ +--- +title: 1.3.1 +sidebar_position: 9954 +--- + +# 1.3.1 - 2023-10-03 + +### Added +* **Airflow: add some basic stats to the Airflow integration** [`#1845`](https://github.com/OpenLineage/OpenLineage/pull/1845) [@harels](https://github.com/harels) + *Uses the statsd component that already exists in the Airflow codebase and wraps the section that emits to event with a timer, as well as emitting a counter for exceptions in sending the event.* +* **Airflow: add columns as schema facet for `airflow.lineage.Table` (if defined)** [`#2138`](https://github.com/OpenLineage/OpenLineage/pull/2138) [@erikalfthan](https://github.com/erikalfthan) + *Adds columns (if set) from `airflow.lineage.Table` inlets/outlets to the OpenLineage Dataset.* +* **DBT: add SQLSERVER to supported dbt profile types** [`#2136`](https://github.com/OpenLineage/OpenLineage/pull/2136) [@erikalfthan](https://github.com/erikalfthan) + *Adds support for dbt-sqlserver, solving [#2129](https://github.com/OpenLineage/OpenLineage/issues/2129).* +* **Spark: support for latest 3.5** [`#2118`](https://github.com/OpenLineage/OpenLineage/pull/2118) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Integration tests are now run on Spark 3.5. Also upgrades 3.3 branch to 3.3.3. Please note that `delta` and `iceberg` are not supported for Spark `3.5` at this time.* + +### Fixed +* **Airflow: fix find-links path in tox** [`#2139`](https://github.com/OpenLineage/OpenLineage/pull/2139) [@JDarDagran](https://github.com/JDarDagran) + *Fixes a broken link.* +* **Airflow: add more graceful logging when no OpenLineage provider installed** [`#2141`](https://github.com/OpenLineage/OpenLineage/pull/2141) [@JDarDagran](https://github.com/JDarDagran) + *Recognizes a failed import of `airflow.providers.openlineage` and adds more graceful logging to fix a corner case.* +* **Spark: fix bug in PathUtils' prepareDatasetIdentifierFromDefaultTablePath(CatalogTable) to correctly preserve scheme from CatalogTable's location** [`#2142`](https://github.com/OpenLineage/OpenLineage/pull/2142) [@d-m-h](https://github.com/d-m-h) + *Previously, the `prepareDatasetIdentifierFromDefaultTablePath` method would override the scheme with the value of "file" when constructing a dataset identifier. It now uses the scheme of the `CatalogTable`'s URI for this. Thank you @pawel-big-lebowski for the quick triage and suggested fix.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_4_1.md b/versioned_docs/version-1.26.0/releases/1_4_1.md new file mode 100644 index 0000000..92bdc9d --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_4_1.md @@ -0,0 +1,16 @@ +--- +title: 1.4.1 +sidebar_position: 9953 +--- + +# 1.4.1 - 2023-10-09 + +### Added +* **Client: allow setting client's endpoint via environment variable** [`#2151`](https://github.com/OpenLineage/OpenLineage/pull/2151) [@mars-lan](https://github.com/mars-lan) + *Enables setting this endpoint via environment variable because creating the client manually in Airflow is not possible.* +* **Flink: expand Iceberg source types** [`#2149`](https://github.com/OpenLineage/OpenLineage/pull/2149) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Adds support for `FlinkIcebergSource` and `FlinkIcebergTableSource` for Flink Iceberg lineage.* +* **Spark: add debug facet** [`#2147`](https://github.com/OpenLineage/OpenLineage/pull/2147) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *An extra run facet containing some system details (e.g., OS, Java, Scala version), classpath (e.g., package versions, jars included in the Spark job), SparkConf (like openlineage entries except auth, specified extensions, etc.) and LogicalPlan details (execution tree nodes' names) are added to events emitted. SparkConf setting `spark.openlineage.debugFacet=enabled` needs to be set to include the facet. By default, the debug facet is disabled.* +* **Spark: enable Nessie REST catalog** [`#2165`](https://github.com/OpenLineage/OpenLineage/pull/2165) [@julwin](https://github.com/julwin) + *Adds support for Nessie catalog in Spark.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_5_0.md b/versioned_docs/version-1.26.0/releases/1_5_0.md new file mode 100644 index 0000000..6bd18e1 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_5_0.md @@ -0,0 +1,30 @@ +--- +title: 1.5.0 +sidebar_position: 9952 +--- + +# 1.5.0 - 2023-11-02 + +### Added +* **Flink: add Flink lineage for Cassandra Connectors** [`#2175`](https://github.com/OpenLineage/OpenLineage/pull/2175) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Adds Flink Cassandra source and sink visitors and Flink Cassandra Integration test.* +* **Spark: support `rdd` and `toDF` operations available in Spark Scala API** [`#2188`](https://github.com/OpenLineage/OpenLineage/pull/2188) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Includes the first Scala integration test, fixes `ExternalRddVisitor` and adds support for extracting inputs from `MapPartitionsRDD` and `ParallelCollectionRDD` plan nodes.* +* **Spark: support Databricks Runtime 13.3** [`#2185`](https://github.com/OpenLineage/OpenLineage/pull/2185) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Modifies the Spark integration to support the latest Databricks Runtime version.* + +### Changed +* **Airflow: loosen attrs and requests versions** [`#2107`](https://github.com/OpenLineage/OpenLineage/pull/2107) [@JDarDagran](https://github.com/JDarDagran) + *Lowers the version requirements for attrs and requests and removes an unnecessary dependency.* +* **dbt: render yaml configs lazily** [`#2221`](https://github.com/OpenLineage/OpenLineage/pull/2221) [@JDarDagran](https://github.com/JDarDagran) + *Don't render each entry in yaml files at start.* + +### Fixed +* **Airflow/Athena: change dataset name to its location** [`#2167`](https://github.com/OpenLineage/OpenLineage/pull/2167) [@sophiely](https://github.com/sophiely) + *Replaces the dataset and namespace with the data's physical location for more complete lineage across integrations.* +* **Python client: skip redaction in column lineage facet** [`#2177`](https://github.com/OpenLineage/OpenLineage/pull/2177) [@JDarDagran](https://github.com/JDarDagran) + *Redacted fields in `ColumnLineageDatasetFacetFieldsAdditionalInputFields` are now skipped.* +* **Spark: unify dataset naming for RDD jobs and Spark SQL** [`#2181`](https://github.com/OpenLineage/OpenLineage/pull/2181) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Use the same mechanism for RDD jobs to extract dataset identifier as used for Spark SQL.* +* **Spark: ensure a single `START` and a single `COMPLETE` event are sent** [`#2103`](https://github.com/OpenLineage/OpenLineage/pull/2103) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *For Spark SQL at least four events are sent triggered by different SparkListener methods. Each of them is required and used to collect facets unavailable elsewhere. However, there should be only one `START` and `COMPLETE` events emitted. Other events should be sent as `RUNNING`. Please keep in mind that Spark integration remains stateless to limit the memory footprint, and it is the backend responsibility to merge several Openlineage events into a meaningful snapshot of metadata changes.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_6_2.md b/versioned_docs/version-1.26.0/releases/1_6_2.md new file mode 100644 index 0000000..4bfac6b --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_6_2.md @@ -0,0 +1,28 @@ +--- +title: 1.6.2 +sidebar_position: 9951 +--- + +# 1.6.2 - 2023-12-07 + +### Added +* **Dagster: support Dagster 1.5.x** [`#2220`](https://github.com/OpenLineage/OpenLineage/pull/2220) [@tsungchih](https://github.com/tsungchih) + *Gets event records for each target Dagster event type to support Dagster version 0.15.0+.* +* **Dbt: add a new command `dbt-ol send-events` to send metadata of the last run without running the job** [`#2285`](https://github.com/OpenLineage/OpenLineage/pull/2285) [@sophiely](https://github.com/sophiely) + *Adds a new command to send events to OpenLineage according to the latest metadata generated without running any dbt command.* +* **Flink: add option for Flink job listener to read from Flink conf** [`#2229`](https://github.com/OpenLineage/OpenLineage/pull/2229) [@ensctom](https://github.com/ensctom) + *Adds option for the Flink job listener to read jobnames and namespaces from Flink conf.* +* **Spark: get column-level lineage from JDBC dbtable option** [`#2284`](https://github.com/OpenLineage/OpenLineage/pull/2284) [@mobuchowski](https://github.com/mobuchowski) + *Adds support for dbtable, enables lineage in the case of single input columns, and improves dataset naming.* +* **Spec: introduce `JobTypeJobFacet` to contain additional job related information**[`#2241`](https://github.com/OpenLineage/OpenLineage/pull/2241) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *New `JobTypeJobFacet` contains the processing type such as `BATCH|STREAMING`, integration via `SPARK|FLINK|...` and job type in `QUERY|COMMAND|DAG|...`.* +* **SQL: add quote information from sqlparser-rs** [`#2259`](https://github.com/OpenLineage/OpenLineage/pull/2259) [@JDarDagran](https://github.com/JDarDagran) + *Adds quote information from sqlparser-rs.* + +### Fixed +* **Spark: update Jackson dependency to resolve `CVE-2022-1471`** [`#2185`](https://github.com/OpenLineage/OpenLineage/pull/2185) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Updates Gradle for Spark and Flink to 8.1.1. Upgrade Jackson `2.15.3`.* +* **Flink: avoid relying on Guava which can be missing during production runtime** [`#2296`](https://github.com/OpenLineage/OpenLineage/pull/2296) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Removes usage of Guava ImmutableList.* +* **Spark: exclude `commons-logging` transitive dependency from published jar** [`#2297`](https://github.com/OpenLineage/OpenLineage/pull/2297) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Ensures `commons-logging` is not shipped as this can lead to a version mismatch on the user's side.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_7_0.md b/versioned_docs/version-1.26.0/releases/1_7_0.md new file mode 100644 index 0000000..567bf51 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_7_0.md @@ -0,0 +1,35 @@ +--- +title: 1.7.0 +sidebar_position: 9950 +--- + +# 1.7.0 - 2023-12-21 + +_COMPATIBILITY NOTICE_ +Starting in 1.7.0, the Airflow integration will no longer support Airflow versions `>=2.8.0`. +Please use the [OpenLineage Airflow Provider](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html) instead. + +### Added +* **Airflow: add parent run facet to `COMPLETE` and `FAIL` events in Airflow integration** [`#2320`](https://github.com/OpenLineage/OpenLineage/pull/2320) [@kacpermuda](https://github.com/kacpermuda) + *Adds a parent run facet to all events in the Airflow integration.* + +### Fixed +* **Airflow: repair up.sh for MacOS** [`#2316`](https://github.com/OpenLineage/OpenLineage/pull/2316) [`#2318`](https://github.com/OpenLineage/OpenLineage/pull/2318) [@kacpermuda](https://github.com/kacpermuda) + *Some scripts were not working well on MacOS. This adjusts them.* +* **Airflow: repair `run_id` for `FAIL` event in Airflow 2.6+** [`#2305`](https://github.com/OpenLineage/OpenLineage/pull/2305) [@kacpermuda](https://github.com/kacpermuda) + *The `Run_id` in a `FAIL` event was different than in the `START` event for Airflow 2.6+.* +* **Flink: open Iceberg `TableLoader` before loading a table** [`#2314`](https://github.com/OpenLineage/OpenLineage/pull/2314) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Fixes a potential `NullPointerException` in 1.17 when dealing with Iceberg sinks.* +* **Flink: name Kafka datasets according to the naming convention** [`#2321`](https://github.com/OpenLineage/OpenLineage/pull/2321) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds a `kafka://` prefix to Kafka topic datasets' namespaces.* +* **Flink: fix properties within `JobTypeJobFacet`** [`#2325`](https://github.com/OpenLineage/OpenLineage/pull/2325) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Fixes properties assignment in the Flink visitor.* +* **Spark: fix `commons-logging` relocate in target jar** [`#2319`](https://github.com/OpenLineage/OpenLineage/pull/2319) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Avoids relocating a dependency that was getting excluded from the jar.* +* **Spec: fix inconsistency with Redshift authority format** [`#2315`](https://github.com/OpenLineage/OpenLineage/pull/2315) [@davidjgoss](https://github.com/davidjgoss) + *Amends the `Authority` format for consistency with other references in the same section.* + +### Removed +* **Airflow: remove Airflow 2.8+ support** [`#2330`](https://github.com/OpenLineage/OpenLineage/pull/2330) [@kacpermuda](https://github.com/kacpermuda) + *If the Airflow version is `>=2.8.0`, the Airflow integration's plugin does not import the integration's listener, disabling the external integration.* + *Please use the [OpenLineage Airflow Provider](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html) instead.* diff --git a/versioned_docs/version-1.26.0/releases/1_8_0.md b/versioned_docs/version-1.26.0/releases/1_8_0.md new file mode 100644 index 0000000..826ac95 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_8_0.md @@ -0,0 +1,42 @@ +--- +title: 1.8.0 +sidebar_position: 9949 +--- + +# 1.8.0 - 2024-01-22 + +### Added +* **Flink: support Flink 1.18** [`#2366`](https://github.com/OpenLineage/OpenLineage/pull/2366) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Adds support for the latest Flink version with 1.17 used for Iceberg Flink runtime and Cassandra Connector as these do not yet support 1.18.* +* **Spark: add Gradle plugins to simplify the build process to support Scala 2.13** [`#2376`](https://github.com/OpenLineage/OpenLineage/pull/2376) [@d-m-h](https://github.com/d-m-h) + *Defines a set of Gradle plugins to configure the modules and reduce duplication. +* **Spark: support multiple Scala versions `LogicalPlan` implementation** [`#2361`](https://github.com/OpenLineage/OpenLineage/pull/2361) [@mattiabertorello](https://github.com/mattiabertorello) + *In the LogicalPlanSerializerTest class, the implementation of the LogicalPlan interface is different between Scala 2.12 and Scala 2.13. In detail, the IndexedSeq changes package from the scala.collection to scala.collection.immutable. This implements both of the methods necessary in the two versions.* +* **Spark: Use ScalaConversionUtils to convert Scala and Java collections** [`#2357`](https://github.com/OpenLineage/OpenLineage/pull/2357) [@mattiabertorello](https://github.com/mattiabertorello) + *This initial step is to start supporting compilation for Scala 2.13 in the 3.2+ Spark versions. Scala 2.13 changed the default collection to immutable, the methods to create an empty collection, and the conversion between Java and Scala. This causes the code to not compile between 2.12 and 2.13. This replaces the usage of direct Scala collection methods (like creating an empty object) and conversions utils with `ScalaConversionUtils` methods that will support cross-compilation.* +* **Spark: support `MERGE INTO` queries on Databricks** [`#2348`](https://github.com/OpenLineage/OpenLineage/pull/2348) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Supports custom plan nodes used when running `MERGE INTO` queries on Databricks runtime.* +* **Spark: Support Glue catalog in iceberg** [`#2283`](https://github.com/OpenLineage/OpenLineage/pull/2283) [@nataliezeller1](https://github.com/nataliezeller1) + *Adds support for the Glue catalog based on the 'catalog-impl' property (in this case we will not have a 'type' property).* + +### Changed +* **Spark: Move Spark 3.1 code from the spark3 project** [`#2365`](https://github.com/OpenLineage/OpenLineage/pull/2365) [@mattiabertorello](https://github.com/mattiabertorello) + *Moves the Spark 3.1-related code to a specific project, spark31, so the spark3 project can be compiled with any Spark 3.x version.* + +### Fixed +* **Airflow: add database information to SnowflakeExtractor** [`#2364`](https://github.com/OpenLineage/OpenLineage/pull/2364) [@kacpermuda](https://github.com/kacpermuda) + *Fixes missing database information in SnowflakeExtractor.* +* **Airflow: add dag_id to task_run_id to avoid duplicates** [`#2358`](https://github.com/OpenLineage/OpenLineage/pull/2358) [@kacpermuda](https://github.com/kacpermuda) + *The lack of dag_id in task_run_id can cause duplicates in run_id across different dags.* +* **Airflow: Add tests for column lineage facet and sql parser** [`#2373`](https://github.com/OpenLineage/OpenLineage/pull/2373) [@kacpermuda](https://github.com/kacpermuda) + *Improves naming (database.schema.table) in SQLExtractor's column lineage facet and adds some unit tests.* +* **Spark: fix removePathPattern behaviour** [`#2350`](https://github.com/OpenLineage/OpenLineage/pull/2350) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *The removepath pattern feature is not applied all the time. The method is called when constructing DatasetIdentifier through PathUtils which is not the case all the time. This moves removePattern to another place in the codebase that is always run.* +* **Spark: fix a type incompatibility in RddExecutionContext between Scala 2.12 and 2.13** [`#2360`](https://github.com/OpenLineage/OpenLineage/pull/2360) [@mattiabertorello](https://github.com/mattiabertorello) + *The function from the ResultStage.func() object change type in Spark between Scala 2.12 and 2.13 makes the compilation fail. This avoids getting the function with an explicit type; instead, it gets it every time it is needed from the ResultStage object. This PR is part of the effort to support Scala 2.13 in the Spark integration.* +* **Spark: Fix `removePathPattern` feature** [`#2350`](https://github.com/OpenLineage/OpenLineage/pull/2350) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Refactors code to make sure that all datasets sent are processed through `removePathPattern` if configured to do so.* +* **Spark: Clean up the individual build.gradle files in preparation for Scala 2.13 support** [`#2377`](https://github.com/OpenLineage/OpenLineage/pull/2377) [@d-m-h](https://github.com/d-m-h) + *Cleans up the build.gradle files, consolidating the custom plugin and removing unused and unnecessary configuration.* +* **Spark: refactor the Gradle plugins to make it easier to define Scala variants per module** [`#2383`](https://github.com/OpenLineage/OpenLineage/pull/2383) [@d-m-h](https://github.com/d-m-h) + *The third of several PRs to support producing Scala 2.12 and Scala 2.13 variants of the OpenLineage Spark integration. This PR refactors the custom Gradle plugins in order to make supporting multiple variants per module easier. This is necessary because the shared module fails its tests when consuming the Scala 2.13 variants of Apache Spark.* \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/releases/1_9_1.md b/versioned_docs/version-1.26.0/releases/1_9_1.md new file mode 100644 index 0000000..73fdfc5 --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/1_9_1.md @@ -0,0 +1,69 @@ +--- +title: 1.9.1 +sidebar_position: 9948 +--- + +# 1.9.1 - 2024-02-26 + +:::important +This version adds the capability to publish **Scala 2.12** and **2.13** variants of **Apache Spark**, +which necessitates a change in the artifact identifier for `io.openlineage:openlineage-spark`. +From this version onwards, please use:
+`io.openlineage:openlineage-spark_${SCALA_BINARY_VERSION}:${OPENLINEAGE_SPARK_VERSION}`. +::: + +### Added +* **Airflow: add support for `JobTypeJobFacet` properties** [`#2412`](https://github.com/OpenLineage/OpenLineage/pull/2412) [@mattiabertorello](https://github.com/mattiabertorello) + *Adds support for Job type properties within the Airflow Job facet.* +* **dbt: add support for `JobTypeJobFacet` properties** [`#2411`](https://github.com/OpenLineage/OpenLineage/pull/2411) [@mattiabertorello](https://github.com/mattiabertorello) + *Support Job type properties within the DBT Job facet.* +* **Flink: support Flink Kafka dynamic source and sink** [`#2417`](https://github.com/OpenLineage/OpenLineage/pull/2417) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Adds support for Flink Kafka Table Connector use cases for topic and schema extraction.* +* **Flink: support multi-topic Kafka Sink** [`#2372`](https://github.com/OpenLineage/OpenLineage/pull/2372) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Adds support for multi-topic Kafka sinks. Limitations: `recordSerializer` needs to implement `KafkaTopicsDescriptor`. Please refer to the [limitations](https://openlineage.io/docs/integrations/flink/#limitations) sections in documentation.* +* **Flink: support lineage for JDBC connector** [`#2436`](https://github.com/OpenLineage/OpenLineage/pull/2436) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Adds support for use cases that employ this connector.* +* **Flink: add common config gradle plugin** [`#2461`](https://github.com/OpenLineage/OpenLineage/pull/2461) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Add common config gradle plugin to simplify gradle files of Flink submodules.* +* **Java: extend circuit breaker loaded with `ServiceLoader`** [`#2435`](https://github.com/OpenLineage/OpenLineage/pull/2435) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Loads the circuit breaker builder with `ServiceLoader` as an addition to a list of implemented builders available within the existing package.* +* **Spark: integration now emits intermediate, application level events wrapping entire job execution** [`#2371`](https://github.com/OpenLineage/OpenLineage/pull/2471) [@mobuchowski](https://github.com/mobuchowski) + *Previously, the Spark event model described only single actions, potentially linked only to some parent run. Closes [`#1672`](https://github.com/OpenLineage/OpenLineage/issues/1672).* +* **Spark: support built-in lineage within `DataSourceV2Relation`** [`#2394`](https://github.com/OpenLineage/OpenLineage/pull/2394) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Enables built-in lineage extraction within from `DataSourceV2Relation` lineage nodes.* +* **Spark: add support for `JobTypeJobFacet` properties** [`#2410`](https://github.com/OpenLineage/OpenLineage/pull/2410) [@mattiabertorello](https://github.com/mattiabertorello) + *Adds support for Job type properties within the Spark Job facet.* +* **Spark: stop sending `spark.LogicalPlan` facet by default** [`#2433`](https://github.com/OpenLineage/OpenLineage/pull/2433) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *`spark.LogicalPlan` has been added to default value of `spark.openlineage.facets.disabled`.* +* **Spark/Flink/Java: circuit breaker** [`#2407`](https://github.com/OpenLineage/OpenLineage/issues/2407) [@pawel-big-lebowski](https://github.com/pawel-big-lebowski) + *Introduces a circuit breaker mechanism to prevent effects of over-instrumentation. Implemented within Java client, it serves both the Flink and Spark integration. Read the Java client README for more details.* +* **Spark: add the capability to publish Scala 2.12 and 2.13 variants of `openlineage-spark`** [`#2446`](https://github.com/OpenLineage/OpenLineage/pull/2446) [@d-m-h](https://github.com/d-m-h) + *Adds the capability to publish Scala 2.12 and 2.13 variants of `openlineage-spark`* + +### Changed +* **Spark: enable the `app` module to be compiled with Scala 2.12 and Scala 2.13 variants of Apache Spark** [`#2432`](https://github.com/OpenLineage/OpenLineage/pull/2432) [@d-m-h](https://github.com/d-m-h) + *The `spark.binary.version` and `spark.version` properties control which variant to build.* +* **Spark: enable Scala 2.13 support in the `app` module** [`#2432`](https://github.com/OpenLineage/OpenLineage/pull/2432) [@d-m-h](https://github.com/d-m-h) + *Enables the `app` module to be built using both Scala 2.12 and Scala 2.13 variants of various Apache Spark versions, and enables the CI/CD pipeline to build and test them.* +* **Spark: don't fail on exception of `UnknownEntryFacet` creation** [`#2431`](https://github.com/OpenLineage/OpenLineage/pull/2431) [@mobuchowski](https://github.com/mobuchowski) + *Failure to generate `UnknownEntryFacet` was resulting in the event not being sent.* +* **Spark: move Snowflake code into the vendor projects folders** [`#2405`](https://github.com/OpenLineage/OpenLineage/pull/2405) [@mattiabertorello](https://github.com/mattiabertorello) + *Creates a `vendor` folder to isolate Snowflake-specific code from the main Spark integration, enhancing organization and flexibility.* + +### Fixed +* **Flink: resolve PMD rule violation warnings** [`#2403`](https://github.com/OpenLineage/OpenLineage/pull/2403) [@HuangZhenQiu](https://github.com/HuangZhenQiu) + *Resolves the PMD rule violation warnings in the Flink integration module.* +* **Flink: Added the 'isReleaseVersion' property back to the build, enabling the Flink integration to be release** [`#2468`](https://github.com/OpenLineage/OpenLineage/pull/2468) [@d-m-h](https://github.com/d-m-h) + *The 'isReleaseVersion' property was removed from the build, preventing the Flink integration from being released.* +* **Python: fix issue with file config creating additional file** [`#2447`](https://github.com/OpenLineage/OpenLineage/pull/2447) [@kacpermuda](https://github.com/kacpermuda) + *`FileConfig` was creating an additional file when not in append mode. Closes [`#2439`](https://github.com/OpenLineage/OpenLineage/issues/2439).* +* **Python: fix issue with append option in file config** [`#2441`](https://github.com/OpenLineage/OpenLineage/pull/2441) [@kacpermuda](https://github.com/kacpermuda) + *`FileConfig` was ignoring the append key in YAML config. Closes [`#2440`](https://github.com/OpenLineage/OpenLineage/issues/2440)* +* **Spark: fix integration catalog symlink without warehouse** [`#2379`](https://github.com/OpenLineage/OpenLineage/pull/2379) [@algorithmy1](https://github.com/algorithmy1) + *In the case of symlinked Glue Catalog Tables, the parsing method was producing dataset names identical to the namespace.* +* **Flink: fix `IcebergSourceWrapper` for Iceberg connector 1.17** [`#2409`](https://github.com/OpenLineage/OpenLineage/pull/2409) [@ensctom](https://github.com/ensctom) + *In Flink 1.17, the Iceberg `catalogloader` was loading the catalog in the open function, causing the `loadTable` method to throw a `NullPointerException` error.* +* **Spark: migrate `spark35`, `spark3`, `shared` modules to produce Scala 2.12 and Scala 2.13 variants** [`#2390`](https://github.com/OpenLineage/OpenLineage/pull/2390) [`#2385`](https://github.com/OpenLineage/OpenLineage/pull/2385)[`#2384`](https://github.com/OpenLineage/OpenLineage/pull/2384) [@d-m-h](https://github.com/d-m-h) + *Migrates the three modules to use the refactored Gradle plugins. Also splits some tests into Scala 2.12- and Scala 2.13-specific versions.* +* **Spark: conform the `spark2` module to the new build process** [`#2391`](https://github.com/OpenLineage/OpenLineage/pull/2391) [@d-m-h](https://github.com/d-m-h) + *Due to a change in the Scala Collections API in Scala 2.13, `NoSuchMethodErrors` were being thrown when running the openlineage-spack connector in an Apache Spark runtime compiled using Scala 2.13.* diff --git a/versioned_docs/version-1.26.0/releases/_category_.json b/versioned_docs/version-1.26.0/releases/_category_.json new file mode 100644 index 0000000..ffa523b --- /dev/null +++ b/versioned_docs/version-1.26.0/releases/_category_.json @@ -0,0 +1,5 @@ +{ + "label": "Releases", + "position": 9 + } + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/scope.svg b/versioned_docs/version-1.26.0/scope.svg new file mode 100644 index 0000000..041badf --- /dev/null +++ b/versioned_docs/version-1.26.0/scope.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/_category_.json b/versioned_docs/version-1.26.0/spec/_category_.json new file mode 100644 index 0000000..e895446 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Core Specification", + "position": 3 +} diff --git a/versioned_docs/version-1.26.0/spec/facets/_category_.json b/versioned_docs/version-1.26.0/spec/facets/_category_.json new file mode 100644 index 0000000..5984410 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Facets & Extensibility", + "position": 5 +} diff --git a/versioned_docs/version-1.26.0/spec/facets/custom-facets.md b/versioned_docs/version-1.26.0/spec/facets/custom-facets.md new file mode 100644 index 0000000..5303cab --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/custom-facets.md @@ -0,0 +1,541 @@ +--- +title: Custom Facets +sidebar_position: 4 +--- + +# Custom Facets + +In addition to the existing facets mentioned in this documentation, users can extend the base facets and provide their own facet definition as part of the payload in OpenLineage event. For example, when OpenLineage event is emitted from the Apache Airflow using OpenLineage's Airflow integration, the following facets can be observed: + +```json +{ + "eventTime": "2022-10-03T00:07:56.891667Z", + "eventType": "START", + "inputs": [], + "job": { + "facets": {}, + "name": "inlet_outlet_demo.test-operator", + "namespace": "uninhabited-magnify-7821" + }, + "outputs": [], + "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.13.0/integration/airflow", + "run": { + "facets": { + "airflow_runArgs": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.13.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/BaseFacet", + "externalTrigger": true + }, + "airflow_version": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.13.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/BaseFacet", + "airflowVersion": "2.3.4+astro.1", + "openlineageAirflowVersion": "0.13.0", + "operator": "airflow.operators.python.PythonOperator", + "taskInfo": { + "_BaseOperator__from_mapped": false, + "_BaseOperator__init_kwargs": { + "depends_on_past": false, + "email": [], + "email_on_failure": false, + "email_on_retry": false, + "op_kwargs": { + "x": "Apache Airflow" + }, + "owner": "demo", + "python_callable": "", + "start_date": "2022-10-02T00:00:00+00:00", + "task_id": "test-operator" + }, + "_BaseOperator__instantiated": true, + "_dag": { + "dag_id": "inlet_outlet_demo", + "tags": [] + }, + "_inlets": [], + "_log": "", + "_outlets": [], + "depends_on_past": false, + "do_xcom_push": true, + "downstream_task_ids": "{'end'}", + "email": [], + "email_on_failure": false, + "email_on_retry": false, + "executor_config": {}, + "ignore_first_depends_on_past": true, + "inlets": [], + "op_args": [], + "op_kwargs": { + "x": "Apache Airflow" + }, + "outlets": [], + "owner": "demo", + "params": "{}", + "pool": "default_pool", + "pool_slots": 1, + "priority_weight": 1, + "python_callable": "", + "queue": "default", + "retries": 0, + "retry_delay": "0:05:00", + "retry_exponential_backoff": false, + "show_return_value_in_logs": true, + "start_date": "2022-10-02T00:00:00+00:00", + "task_group": "", + "task_id": "test-operator", + "trigger_rule": "all_success", + "upstream_task_ids": "{'begin'}", + "wait_for_downstream": false, + "weight_rule": "downstream" + } + }, + "parentRun": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.13.0/integration/airflow", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/ParentRunFacet", + "job": { + "name": "inlet_outlet_demo", + "namespace": "uninhabited-magnify-7821" + }, + "run": { + "runId": "4da6f6d2-8902-3b6c-be7e-9269610a8c8f" + } + } + }, + "runId": "753b0c7c-e424-4e10-a5ab-062ae5be43ee" + } +} +``` +Both `airflow_runArgs` and `airflow_version` are not part of the default OpenLineage facets found [here](https://openlineage.io/apidocs/openapi). However, as long as they follow the [BaseFacet](https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet) to contain the two mandatory element `_producer` and `_schemaURL`, it will be accepted and stored as part of the OpenLineage event, and will be able to be retrieved when you query those events. + +Custom facets are not part of the default facets. Therefore, it will be treated as a payload data as-is, but applications retrieving those, if they have the capability to understand its structure and use them, should be able to do so without any problems. + +## Example of creating your first custom facet + +Let's look at this sample OpenLineage client code written in python, that defines and uses a custom facet called `my-facet`. + +```python +#!/usr/bin/env python3 +from openlineage.client.run import ( + RunEvent, + RunState, + Run, + Job, + Dataset, + OutputDataset, + InputDataset, +) +from openlineage.client.client import OpenLineageClient, OpenLineageClientOptions +from openlineage.client.facet import ( + BaseFacet, + SqlJobFacet, + SchemaDatasetFacet, + SchemaField, + SourceCodeLocationJobFacet, + NominalTimeRunFacet, +) +from openlineage.client.uuid import generate_new_uuid +from datetime import datetime, timezone, timedelta +from typing import List +import attr +from random import random + +import logging, os +logging.basicConfig(level=logging.DEBUG) + +PRODUCER = f"https://github.com/openlineage-user" +namespace = "python_client" + +url = "http://localhost:5000" +api_key = "1234567890ckcu028rzu5l" + +client = OpenLineageClient( + url=url, + # optional api key in case the backend requires it + options=OpenLineageClientOptions(api_key=api_key), +) + +# generates job facet +def job(job_name, sql, location): + facets = { + "sql": SqlJobFacet(sql) + } + if location != None: + facets.update( + {"sourceCodeLocation": SourceCodeLocationJobFacet("git", location)} + ) + return Job(namespace=namespace, name=job_name, facets=facets) + +@attr.s +class MyFacet(BaseFacet): + name: str = attr.ib() + age: str = attr.ib() + email: str = attr.ib() + _additional_skip_redact: List[str] = ['name', 'age', 'email'] + def __init__(self, name, age, email): + super().__init__() + self.name = name + self.age = age + self.email = email + +# geneartes run racet +def run(run_id, hour, name, age, email): + return Run( + runId=run_id, + facets={ + "nominalTime": NominalTimeRunFacet( + nominalStartTime=f"2022-04-14T{twoDigits(hour)}:12:00Z" + ), + "my_facet": MyFacet(name, age, email) + }, + ) + +# generates dataset +def dataset(name, schema=None, ns=namespace): + if schema == None: + facets = {} + else: + facets = {"schema": schema} + return Dataset(namespace, name, facets) + + +# generates output dataset +def outputDataset(dataset, stats): + output_facets = {"stats": stats, "outputStatistics": stats} + return OutputDataset(dataset.namespace, dataset.name, dataset.facets, output_facets) + + +# generates input dataset +def inputDataset(dataset, dq): + input_facets = { + "dataQuality": dq, + } + return InputDataset(dataset.namespace, dataset.name, dataset.facets, input_facets) + + +def twoDigits(n): + if n < 10: + result = f"0{n}" + elif n < 100: + result = f"{n}" + else: + raise f"error: {n}" + return result + + +now = datetime.now(timezone.utc) + + +# generates run Event +def runEvents(job_name, sql, inputs, outputs, hour, min, location, duration): + run_id = str(generate_new_uuid()) + myjob = job(job_name, sql, location) + myrun = run(run_id, hour, 'user_1', 25, 'user_1@email.com') + st = now + timedelta(hours=hour, minutes=min, seconds=20 + round(random() * 10)) + end = st + timedelta(minutes=duration, seconds=20 + round(random() * 10)) + started_at = st.isoformat() + ended_at = end.isoformat() + return ( + RunEvent( + eventType=RunState.START, + eventTime=started_at, + run=myrun, + job=myjob, + producer=PRODUCER, + inputs=inputs, + outputs=outputs, + ), + RunEvent( + eventType=RunState.COMPLETE, + eventTime=ended_at, + run=myrun, + job=myjob, + producer=PRODUCER, + inputs=inputs, + outputs=outputs, + ), + ) + + +# add run event to the events list +def addRunEvents( + events, job_name, sql, inputs, outputs, hour, minutes, location=None, duration=2 +): + (start, complete) = runEvents( + job_name, sql, inputs, outputs, hour, minutes, location, duration + ) + events.append(start) + events.append(complete) + +events = [] + +# create dataset data +for i in range(0, 5): + + user_counts = dataset("tmp_demo.user_counts") + user_history = dataset( + "temp_demo.user_history", + SchemaDatasetFacet( + fields=[ + SchemaField(name="id", type="BIGINT", description="the user id"), + SchemaField( + name="email_domain", type="VARCHAR", description="the user id" + ), + SchemaField(name="status", type="BIGINT", description="the user id"), + SchemaField( + name="created_at", + type="DATETIME", + description="date and time of creation of the user", + ), + SchemaField( + name="updated_at", + type="DATETIME", + description="the last time this row was updated", + ), + SchemaField( + name="fetch_time_utc", + type="DATETIME", + description="the time the data was fetched", + ), + SchemaField( + name="load_filename", + type="VARCHAR", + description="the original file this data was ingested from", + ), + SchemaField( + name="load_filerow", + type="INT", + description="the row number in the original file", + ), + SchemaField( + name="load_timestamp", + type="DATETIME", + description="the time the data was ingested", + ), + ] + ), + "snowflake://", + ) + + create_user_counts_sql = """CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS ( + SELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count + FROM TMP_DEMO.USER_HISTORY + GROUP BY date + )""" + + # location of the source code + location = "https://github.com/some/airflow/dags/example/user_trends.py" + + # run simulating Airflow DAG with snowflake operator + addRunEvents( + events, + "create_user_counts", + create_user_counts_sql, + [user_history], + [user_counts], + i, + 11, + location, + ) + + +for event in events: + from openlineage.client.serde import Serde + # print(Serde.to_json(event)) + # time.sleep(1) + client.emit(event) + +``` + +As you can see in the source code, there is a class called `MyFacet` which extends from the `BaseFacet` of OpenLineage, having three attributes of `name`, `age`, and `email`. + +```python +@attr.s +class MyFacet(BaseFacet): + name: str = attr.ib() + age: str = attr.ib() + email: str = attr.ib() + _additional_skip_redact: List[str] = ['name', 'age', 'email'] + def __init__(self, name, age, email): + super().__init__() + self.name = name + self.age = age + self.email = email +``` + +And, when the application is generating a Run data, you can see the instantiation of `MyFacet`, having the name `my_facet`. + +```python +def run(run_id, hour, name, age, email): + return Run( + runId=run_id, + facets={ + "nominalTime": NominalTimeRunFacet( + nominalStartTime=f"2022-04-14T{twoDigits(hour)}:12:00Z" + ), + "my_facet": MyFacet(name, age, email) + }, + ) +``` + +When you run this application with python (and please make sure you have installed `openlineage-python` using pip before running it), you will see a series of JSON output that represents the OpenLineage events being submitted. Here is one example. + +```json +{ + "eventTime": "2022-12-09T09:17:28.239394+00:00", + "eventType": "COMPLETE", + "inputs": [ + { + "facets": { + "schema": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SchemaDatasetFacet", + "fields": [ + { + "description": "the user id", + "name": "id", + "type": "BIGINT" + }, + { + "description": "the user id", + "name": "email_domain", + "type": "VARCHAR" + }, + { + "description": "the user id", + "name": "status", + "type": "BIGINT" + }, + { + "description": "date and time of creation of the user", + "name": "created_at", + "type": "DATETIME" + }, + { + "description": "the last time this row was updated", + "name": "updated_at", + "type": "DATETIME" + }, + { + "description": "the time the data was fetched", + "name": "fetch_time_utc", + "type": "DATETIME" + }, + { + "description": "the original file this data was ingested from", + "name": "load_filename", + "type": "VARCHAR" + }, + { + "description": "the row number in the original file", + "name": "load_filerow", + "type": "INT" + }, + { + "description": "the time the data was ingested", + "name": "load_timestamp", + "type": "DATETIME" + } + ] + } + }, + "name": "temp_demo.user_history", + "namespace": "python_client" + } + ], + "job": { + "facets": { + "sourceCodeLocation": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet", + "type": "git", + "url": "https://github.com/some/airflow/dags/example/user_trends.py" + }, + "sql": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet", + "query": "CREATE OR REPLACE TABLE TMP_DEMO.USER_COUNTS AS (\n\t\t\tSELECT DATE_TRUNC(DAY, created_at) date, COUNT(id) as user_count\n\t\t\tFROM TMP_DEMO.USER_HISTORY\n\t\t\tGROUP BY date\n\t\t\t)" + } + }, + "name": "create_user_counts", + "namespace": "python_client" + }, + "outputs": [ + { + "facets": {}, + "name": "tmp_demo.user_counts", + "namespace": "python_client" + } + ], + "producer": "https://github.com/openlineage-user", + "run": { + "facets": { + "my_facet": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/BaseFacet", + "age": 25, + "email": "user_1@email.com", + "name": "user_1" + }, + "nominalTime": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/NominalTimeRunFacet", + "nominalStartTime": "2022-04-14T04:12:00Z" + } + }, + "runId": "7886a902-8fec-422f-9ee4-818489e59f5f" + } +} +``` + +Notice the facet information `my_facet` that has is now part of the OpenLineage event. +```json + ... + "run": { + "facets": { + "my_facet": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", + "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/BaseFacet", + "age": 25, + "email": "user_1@email.com", + "name": "user_1" + }, + ... +``` + +OpenLineage backend should be able to store this information when submitted, and later, when you access the Lineage, you should be able to view the facet information that you submitted, along with your custom facet that you made. Below is the screen shot of one of the OpenLineage backend called [Marquez](https://marquezproject.ai/), that shows th custom facet that the application has submitted. + +![image](./custom-facets.png) + +You might have noticed the schema URL is actually that of `BaseFacet`. By default, if the facet class did not specify its own schema URL, that value would be that of BaseFacet. From the view of OpenLineage specification, this is legal. However, if you have your own JSON spec defined, and has it publically accessible, you can specify it by overriding the `_get_schema` function as such: + +```python +@attr.s +class MyFacet(BaseFacet): + name: str = attr.ib() + age: str = attr.ib() + email: str = attr.ib() + _additional_skip_redact: List[str] = ['name', 'age', 'email'] + def __init__(self, name, age, email): + super().__init__() + self.name = name + self.age = age + self.email = email + + @staticmethod + def _get_schema() -> str: + return "https://somewhere/schemas/myfacet.json#/definitions/MyFacet" +``` + +And the `_schemaURL` of the OpenLineage event would now reflect the change as such: + +```json + "run": { + "facets": { + "my_facet": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/client/python", + "_schemaURL": "https://somewhere/schemas/myfacet.json#/definitions/MyFacet", + "age": 25, + "email": "user_1@email.com", + "name": "user_1" + }, +``` diff --git a/versioned_docs/version-1.26.0/spec/facets/custom-facets.png b/versioned_docs/version-1.26.0/spec/facets/custom-facets.png new file mode 100644 index 0000000..9af5a42 Binary files /dev/null and b/versioned_docs/version-1.26.0/spec/facets/custom-facets.png differ diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/_category_.json b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/_category_.json new file mode 100644 index 0000000..d358b6a --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Dataset Facets", + "position": 3 +} \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/column_lineage_facet.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/column_lineage_facet.md new file mode 100644 index 0000000..1d36d33 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/column_lineage_facet.md @@ -0,0 +1,262 @@ +--- +sidebar_position: 1 +--- + +# Column Level Lineage Dataset Facet + +Column level lineage provides fine grained information on datasets' dependencies. +Not only we know the dependency exist, but we are also able to understand +which input columns are used to produce which output columns and in what way. +This allows answering questions like *Which root input columns are used to construct column x?* + +For example, a Job might executes the following query: + +```sql +INSERT INTO top_delivery_times ( + order_id, + order_placed_on, + order_delivered_on, + order_delivery_time +) +SELECT + order_id, + order_placed_on, + order_delivered_on, + DATEDIFF(minute, order_placed_on, order_delivered_on) AS order_delivery_time, +FROM delivery_7_days +ORDER BY order_delivery_time DESC +LIMIT 1; +``` + +This would establish the following relationships between the `delivery_7_days` and `top_delivery_times` tables: + +![image](./column_lineage_facet.svg) + +An OpenLinage run state update that represent this query using column-level lineage facets might look like: + +```json +{ + "eventType": "START", + "eventTime": "2020-02-22T22:42:42.000Z", + "run": ..., + "job": ..., + "inputs": [ + { + "namespace": "food_delivery", + "name": "public.delivery_7_days" + } + ], + "outputs": [ + { + "namespace": "food_delivery", + "name": "public.top_delivery_times", + "facets": { + "columnLineage": { + "_producer": "https://github.com/MarquezProject/marquez/blob/main/docker/metadata.json", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/ColumnLineageDatasetFacet.json", + "fields": { + "order_id": { + "inputFields": [ + { + "namespace": "food_delivery", + "name": "public.delivery_7_days", + "field": "order_id", + "transformations": [ + { + "type": "DIRECT", + "subtype": "IDENTITY", + "description": "", + "masking": false + } + ] + }, + { + "namespace": "food_delivery", + "name": "public.delivery_7_days", + "field": "order_placed_on", + "transformations": [ + { + "type": "INDIRECT", + "subtype": "SORT", + "description": "", + "masking": false + } + ] + }, + { + "namespace": "food_delivery", + "name": "public.delivery_7_days", + "field": "order_delivered_on", + "transformations": [ + { + "type": "INDIRECT", + "subtype": "SORT", + "description": "", + "masking": false + } + ] + } + ] + }, + "order_placed_on": { + "inputFields": [ + { + "namespace": "food_delivery", + "name": "public.delivery_7_days", + "field": "order_placed_on", + "transformations": [ + { + "type": "DIRECT", + "subtype": "IDENTITY", + "description": "", + "masking": false + }, + { + "type": "INDIRECT", + "subtype": "SORT", + "description": "", + "masking": false + } + ] + }, + { + "namespace": "food_delivery", + "name": "public.delivery_7_days", + "field": "order_delivered_on", + "transformations": [ + { + "type": "INDIRECT", + "subtype": "SORT", + "description": "", + "masking": false + } + ] + } + ] + }, + "order_delivered_on": { + "inputFields": [ + { + "namespace": "food_delivery", + "name": "public.delivery_7_days", + "field": "order_delivered_on", + "transformations": [ + { + "type": "DIRECT", + "subtype": "IDENTITY", + "description": "", + "masking": false + }, + { + "type": "INDIRECT", + "subtype": "SORT", + "description": "", + "masking": false + } + ] + }, + { + "namespace": "food_delivery", + "name": "public.delivery_7_days", + "field": "order_placed_on", + "transformations": [ + { + "type": "INDIRECT", + "subtype": "SORT", + "description": "", + "masking": false + } + ] + } + ] + }, + "order_delivery_time": { + "inputFields": [ + { + "namespace": "food_delivery", + "name": "public.delivery_7_days", + "field": "order_placed_on", + "transformations": [ + { + "type": "DIRECT", + "subtype": "TRANSFORMATION", + "description": "", + "masking": false + }, + { + "type": "INDIRECT", + "subtype": "SORT", + "description": "", + "masking": false + } + ] + }, + { + "namespace": "food_delivery", + "name": "public.delivery_7_days", + "field": "order_delivered_on", + "transformations": [ + { + "type": "DIRECT", + "subtype": "TRANSFORMATION", + "description": "", + "masking": false + }, + { + "type": "INDIRECT", + "subtype": "SORT", + "description": "", + "masking": false + } + ] + } + ] + } + } + } + } + } + ], + ... +} +``` + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-1-0/ColumnLineageDatasetFacet.json). + +## Transformation Type + +To provide the best information about each field lineage, each `inputField` of an output can contain +the `transformations` field. This field describes what is the nature of relation between the input and the output columns. +Each transformation is described by 4 fields: `type`, `subtype`, `description` and `masking`. + +#### Type +Indicates how direct is the relationship e.g. in query +```roomsql +SELECT + source AS result +FROM TAB +WHERE pred = true; +``` +1. `DIRECT` - output column value was somehow derived from `inputField` value. In example `result` value is derived from `source` +2. `INDIRECT` - output column value is impacted by the value of `inputField` column, but it's not derived from it. In example no part `result` value is derived from `pred` but `pred` has impact on the values of `result` in the output dataset + +#### Subtype +Contains more specific information about the transformation + +Direct +- Identity - output value is taken as is from the input +- Transformation - output value is transformed source value from input row +- Aggregation - output value is aggregation of source values from multiple input rows + +Indirect +- Join - input used in join condition +- GroupBy - output is aggregated based on input (e.g. `GROUP BY` clause) +- Filter - input used as a filtering condition (e.g. `WHERE` clause) +- Order - output is sorted based on input field +- Window - output is windowed based on input field +- Conditional - input value is used in `IF` of `CASE WHEN` statements + +#### Masking +Boolean value indicating if the input value was obfuscated during the transformation. +The examples are: `hash` for Transformation and `count` for Aggregation. +List of available methods that are considered masking is dependent on the source system. diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/column_lineage_facet.svg b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/column_lineage_facet.svg new file mode 100644 index 0000000..86fcdce --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/column_lineage_facet.svg @@ -0,0 +1,59 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/data_quality_assertions.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/data_quality_assertions.md new file mode 100644 index 0000000..41eeb0b --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/data_quality_assertions.md @@ -0,0 +1,35 @@ +--- +sidebar_position: 3 +--- + +# Data Quality Assertions Facet + +Example: + +```json +{ + ... + "inputs": { + "facets": { + "dataQualityAssertions": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/DataQualityAssertionsDatasetFacet.json", + "assertions": [ + { + "assertion": "not_null", + "success": true, + "column": "user_name" + }, + { + "assertion": "is_string", + "success": true, + "column": "user_name" + } + ] + } + } + } + ... +} +``` +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/DataQualityAssertionsDatasetFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/data_source.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/data_source.md new file mode 100644 index 0000000..5d176c1 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/data_source.md @@ -0,0 +1,25 @@ +--- +sidebar_position: 2 +--- + +# Datasource Facet + +Example: + +```json +{ + ... + "inputs": { + "facets": { + "dataSource": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json", + "name": "datasource_one", + "url": "https://some.location.com/datsource/one" + } + } + } + ... +} +``` +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/dataset-facets.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/dataset-facets.md new file mode 100644 index 0000000..2bf51ea --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/dataset-facets.md @@ -0,0 +1,36 @@ +--- +sidebar_position: 1 +--- + +# Dataset Facets + +Dataset Facets are generally consisted of common facet that is used both in `inputs` and `outputs` of the OpenLineage event. There are facets that exist specifically for input or output datasets. + +```json +{ + ... + "inputs": [{ + "namespace": "postgres://workshop-db:None", + "name": "workshop.public.taxes-in", + "facets": { + # This is where the common dataset facets are located + }, + "inputFacets": { + # This is where the input dataset facets are located + } + }], + "outputs": [{ + "namespace": "postgres://workshop-db:None", + "name": "workshop.public.taxes-out", + "facets": { + # This is where the common dataset facets are located + }, + "outputFacets": { + # This is where the output dataset facets are located + } + }], + ... +} +``` + +In the above Example, Notice that there is a distinction of facets that are common for both input and output dataset, and input or output specific datasets. As for the common datasets, they all reside under the `facets` property. However, input or output specific facets are located either in `inputFacets` or `outputFacets` property. diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/input-dataset-facets/_category_.json b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/input-dataset-facets/_category_.json new file mode 100644 index 0000000..092b82b --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/input-dataset-facets/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Input Dataset Facets", + "position": 100 +} \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/input-dataset-facets/data_quality_metrics.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/input-dataset-facets/data_quality_metrics.md new file mode 100644 index 0000000..ba82b85 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/input-dataset-facets/data_quality_metrics.md @@ -0,0 +1,70 @@ +--- +sidebar_position: 1 +--- + +# Data Quality Metrics Facet + +Example: + +```json +{ + ... + "inputs": { + "inputFacets": { + "dataQualityMetrics": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-2/DataQualityMetricsInputDatasetFacet.json", + "rowCount": 123, + "fileCount": 5, + "bytes": 35602, + "columnMetrics": { + "column_one": { + "nullCount": 132, + "distincCount": 11, + "sum": 500, + "count": 234, + "min": 111, + "max": 3234, + "quantiles": { + "0.1": 12, + "0.5": 22, + "1": 123, + "2": 11 + } + }, + "column_two": { + "nullCount": 132, + "distinctCount": 11, + "sum": 500, + "count": 234, + "min": 111, + "max": 3234, + "quantiles": { + "0.1": 12, + "0.5": 22, + "1": 123, + "2": 11 + } + }, + "column_three": { + "nullCount": 132, + "distincCount": 11, + "sum": 500, + "count": 234, + "min": 111, + "max": 3234, + "quantiles": { + "0.1": 12, + "0.5": 22, + "1": 123, + "2": 11 + } + } + } + } + } + } + ... +} +``` +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-2/DataQualityMetricsInputDatasetFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/input-dataset-facets/input_statistics.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/input-dataset-facets/input_statistics.md new file mode 100644 index 0000000..760539d --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/input-dataset-facets/input_statistics.md @@ -0,0 +1,26 @@ +--- +sidebar_position: 1 +--- + +# Input Statistics Facet + +Example: + +```json +{ + ... + "inputs": { + "inputFacets": { + "inputStatistics": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/InputStatisticsInputDatasetFacet.json", + "rowCount": 123, + "fileCount": 5, + "size": 35602 + } + } + } + ... +} +``` +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/InputStatisticsInputDatasetFacet.json). diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/lifecycle_state_change.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/lifecycle_state_change.md new file mode 100644 index 0000000..0e87fc6 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/lifecycle_state_change.md @@ -0,0 +1,44 @@ +--- +sidebar_position: 4 +--- + +# Lifecycle State Change Facet + +Example: + +```json +{ + ... + "outputs": { + "facets": { + "lifecycleStateChange": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json", + "lifecycleStateChange": "CREATE" + } + } + } + ... +} +``` + +```json +{ + ... + "outputs": { + "facets": { + "lifecycleStateChange": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json", + "lifecycleStateChange": "RENAME", + "previousIdentifier": { + "namespace": "example_namespace", + "name": "example_table_1" + } + } + } + } + ... +} +``` +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/output-dataset-facets/_category_.json b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/output-dataset-facets/_category_.json new file mode 100644 index 0000000..21117f1 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/output-dataset-facets/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Output Dataset Facets", + "position": 101 +} \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/output-dataset-facets/output_statistics.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/output-dataset-facets/output_statistics.md new file mode 100644 index 0000000..7523782 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/output-dataset-facets/output_statistics.md @@ -0,0 +1,26 @@ +--- +sidebar_position: 1 +--- + +# Output Statistics Facet + +Example: + +```json +{ + ... + "outputs": { + "outputFacets": { + "outputStatistics": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-2/OutputStatisticsOutputDatasetFacet.json", + "rowCount": 123, + "fileCount": 5, + "size": 35602 + } + } + } + ... +} +``` +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-2/OutputStatisticsOutputDatasetFacet.json). diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/ownership.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/ownership.md new file mode 100644 index 0000000..f84dadd --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/ownership.md @@ -0,0 +1,30 @@ +--- +sidebar_position: 5 +--- + +# Ownership Dataset Facet + +Example: + +```json +{ + ... + "inputs": { + "facets": { + "ownership": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/OwnershipDatasetFacet.json", + "owners": [ + { + "name": "maintainer_one", + "type": "MAINTAINER" + } + ] + } + } + } + ... +} +``` + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/OwnershipDatasetFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/schema.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/schema.md new file mode 100644 index 0000000..1d7b7b2 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/schema.md @@ -0,0 +1,115 @@ +--- +sidebar_position: 6 +--- + +# Schema Dataset Facet + +The schema dataset facet contains the schema of a particular dataset. +Besides a name, it provides an optional type and description of each field. + +Nested fields are supported as well. + + +Example: + +```json +{ + ... + "inputs": { + "facets": { + "schema": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-1-1/SchemaDatasetFacet.json", + "fields": [ + { + "name": "id", + "type": "int", + "description": "Customer's identifier" + }, + { + "name": "name", + "type": "string", + "description": "Customer's name" + }, + { + "name": "is_active", + "type": "boolean", + "description": "Has customer completed activation process" + }, + { + "name": "phones", + "type": "array", + "description": "List of phone numbers", + "fields": [ + { + "name": "_element", + "type": "string", + "description": "Phone number" + } + ] + }, + { + "name": "address", + "type": "struct", + "description": "Customer address", + "fields": [ + { + "name": "type", + "type": "string", + "description": "Address type, g.e. home, work, etc." + }, + { + "name": "country", + "type": "string", + "description": "Country name" + }, + { + "name": "zip", + "type": "string", + "description": "Zip code" + }, + { + "name": "state", + "type": "string", + "description": "State name" + }, + { + "name": "street", + "type": "string", + "description": "Street name" + } + ] + }, + { + "name": "custom_properties", + "type": "map", + "fields": [ + { + "name": "key", + "type": "string" + }, + { + "name": "value", + "type": "union", + "fields": [ + { + "name": "_0", + "type": "string" + }, + { + "name": "_1", + "type": "int64" + } + ] + } + ] + } + ] + } + } + } + ... +} +``` + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-1-1/SchemaDatasetFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/storage.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/storage.md new file mode 100644 index 0000000..0a79aea --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/storage.md @@ -0,0 +1,25 @@ +--- +sidebar_position: 7 +--- + +# Storage Facet + +Example: + +```json +{ + ... + "inputs": { + "facets": { + "storage": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/StorageDatasetFacet.json", + "storageLayer": "iceberg", + "fileFormat": "csv" + } + } + } + ... +} +``` +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/StorageDatasetFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/symlinks.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/symlinks.md new file mode 100644 index 0000000..7180823 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/symlinks.md @@ -0,0 +1,28 @@ +--- +sidebar_position: 8 +--- + +# Symlinks Facet + +Example: + +```json +{ + ... + "inputs": { + "facets": { + "symlinks": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json", + "identifiers": [ + "namespace": "example_namespace", + "name": "example_dataset_1", + "type": "table" + ] + } + } + } + ... +} +``` +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/dataset-facets/version_facet.md b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/version_facet.md new file mode 100644 index 0000000..b7c9906 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/dataset-facets/version_facet.md @@ -0,0 +1,24 @@ +--- +sidebar_position: 9 +--- + +# Version Facet + +Example: + +```json +{ + ... + "inputs": { + "facets": { + "version": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json", + "datasetVersion": "1" + } + } + } + ... +} +``` +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/facets.md b/versioned_docs/version-1.26.0/spec/facets/facets.md new file mode 100644 index 0000000..cf3c6c5 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/facets.md @@ -0,0 +1,96 @@ +--- +sidebar_position: 4 +--- + +# Facets & Extensibility + +Facets provide context to the OpenLineage events. Generally, an OpenLineage event contains the type of the event, who created it, and when the event happened. In addition to the basic information related to the event, it provides `facets` for more details in four general categories: + +- job: What kind of activity ran +- run: How it ran +- inputs: What was used during its run +- outputs: What was the outcome of the run + +Here is an example of the four facets in action. Notice the element `facets` under each of the four categories of the OpenLineage event: + +```json +{ + "eventType": "START", + "eventTime": "2020-12-28T19:52:00.001+10:00", + "run": { + "runId": "d46e465b-d358-4d32-83d4-df660ff614dd", + "facets": { + "parent": { + "job": { + "name": "dbt-execution-parent-job", + "namespace": "dbt-namespace" + }, + "run": { + "runId": "f99310b4-3c3c-1a1a-2b2b-c1b95c24ff11" + } + } + } + }, + "job": { + "namespace": "workshop", + "name": "process_taxes", + "facets": { + "sql": { + "query": "insert into taxes_out select id, name, is_active from taxes_in" + } + } + }, + "inputs": [{ + "namespace": "postgres://workshop-db:None", + "name": "workshop.public.taxes-in", + "facets": { + "schema": { + "fields": [ + { + "name": "id", + "type": "int", + "description": "Customer's identifier" + }, + { + "name": "name", + "type": "string", + "description": "Customer's name" + }, + { + "name": "is_active", + "type": "boolean", + "description": "Has customer completed activation process" + } + ] + } + } + }], + "outputs": [{ + "namespace": "postgres://workshop-db:None", + "name": "workshop.public.taxes-out", + "facets": { + "schema": { + "fields": [ + { + "name": "id", + "type": "int", + "description": "Customer's identifier" + }, + { + "name": "name", + "type": "string", + "description": "Customer's name" + }, + { + "name": "is_active", + "type": "boolean", + "description": "Has customer completed activation process" + } + ] + } + } + }], + "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client" +} +``` +For more information of what kind of facets are available as part of OpenLineage spec, please refer to the sub sections `Run Facets`, `Job Facets`, and `Dataset Facets` of this document. diff --git a/versioned_docs/version-1.26.0/spec/facets/job-facets/_category_.json b/versioned_docs/version-1.26.0/spec/facets/job-facets/_category_.json new file mode 100644 index 0000000..dbfbefb --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/job-facets/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Job Facets", + "position": 2 +} \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/job-facets/documentation.md b/versioned_docs/version-1.26.0/spec/facets/job-facets/documentation.md new file mode 100644 index 0000000..fa3e49a --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/job-facets/documentation.md @@ -0,0 +1,28 @@ +--- +sidebar_position: 1 +--- + +# Documentation Facet + +Contains the documentation or description of the job. + +Example: + +```json +{ + ... + "job": { + "facets": { + "documentation": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DocumentationJobFacet.json", + "description": "This is the documentation of something." + } + } + } + ... +} +``` + + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/DocumentationJobFacet.json) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/job-facets/job-facets.md b/versioned_docs/version-1.26.0/spec/facets/job-facets/job-facets.md new file mode 100644 index 0000000..63229f5 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/job-facets/job-facets.md @@ -0,0 +1,7 @@ +--- +sidebar_position: 1 +--- + +# Job Facets + +Job Facets apply to a distinct instance of a job: an abstract `process` that consumes, executes, and produces datasets (defined as its inputs and outputs). It is identified by a `unique name` within a `namespace`. The *Job* evolves over time and this change is captured during the job runs. diff --git a/versioned_docs/version-1.26.0/spec/facets/job-facets/job-type.md b/versioned_docs/version-1.26.0/spec/facets/job-facets/job-type.md new file mode 100644 index 0000000..6005b48 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/job-facets/job-type.md @@ -0,0 +1,44 @@ +--- +sidebar_position: 6 +--- + +# Job type Job Facet + +Facet to contain job properties like: + * `processingType` which can be `STREAMING` or `BATCH`, + * `integration` which can be `SPARK|DBT|AIRFLOW|FLINK`, + * `jobType` which can be `QUERY|COMMAND|DAG|TASK|JOB|MODEL`. + +Example: + +```json +{ + ... + "job": { + "facets": { + "jobType": { + "processingType": "BATCH", + "integration": "SPARK", + "jobType": "QUERY", + "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client", + "_schemaURL": "https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json" + } + } + ... +} +``` + +The examples for specific integrations: + + * Integration: `SPARK` + * Processing type: `STREAM`|`BATCH` + * Job type: `JOB`|`COMMAND` + * Integration: `AIRFLOW` + * Processing type: `BATCH` + * Job type: `DAG`|`TASK` + * Integration: `DBT` + * ProcessingType: `BATCH` + * JobType: `PROJECT`|`MODEL` + * Integration: `FLINK` + * Processing type: `STREAMING`|`BATCH` + * Job type: `JOB` diff --git a/versioned_docs/version-1.26.0/spec/facets/job-facets/ownership.md b/versioned_docs/version-1.26.0/spec/facets/job-facets/ownership.md new file mode 100644 index 0000000..b27560c --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/job-facets/ownership.md @@ -0,0 +1,34 @@ +--- +sidebar_position: 2 +--- + +# Ownership Job Facet + + +The facet that contains the information regarding users or group who owns this particular job. + +Example: + +```json +{ + ... + "job": { + "facets": { + "ownership": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/OwnershipJobFacet.json", + "owners": [ + { + "name": "maintainer_one", + "type": "MAINTAINER" + } + ] + } + } + } + ... +} +``` + + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/OwnershipJobFacet.json) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/job-facets/source-code-location.md b/versioned_docs/version-1.26.0/spec/facets/job-facets/source-code-location.md new file mode 100644 index 0000000..7116352 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/job-facets/source-code-location.md @@ -0,0 +1,34 @@ +--- +sidebar_position: 4 +--- + +# Source Code Location Facet + +The facet that indicates where the source code is located. + +Example: + +```json +{ + ... + "job": { + "facets": { + "sourceCodeLocation": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/SourceCodeLocationJobFacet.json", + "type": "git|svn", + "url": "https://github.com/MarquezProject/marquez-airflow-quickstart/blob/693e35482bc2e526ced2b5f9f76ef83dec6ec691/dags/hello.py", + "repoUrl": "git@github.com:{org}/{repo}.git or https://github.com/{org}/{repo}.git|svn:///", + "path": "path/to/my/dags", + "version": "git: the git sha | Svn: the revision number", + "tag": "example", + "branch": "main" + } + } + } + ... +} +``` + + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/SourceCodeLocationJobFacet.json) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/job-facets/source-code.md b/versioned_docs/version-1.26.0/spec/facets/job-facets/source-code.md new file mode 100644 index 0000000..abfa374 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/job-facets/source-code.md @@ -0,0 +1,29 @@ +--- +sidebar_position: 3 +--- + +# Source Code Facet + +The source code of a particular job (e.g. Python script) + +Example: + +```json +{ + ... + "job": { + "facets": { + "sourceCode": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/SourceCodeJobFacet.json", + "language": "python", + "sourceCode": "print('hello, OpenLineage!')" + } + } + } + ... +} +``` + + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/SourceCodeJobFacet.json) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/job-facets/sql.md b/versioned_docs/version-1.26.0/spec/facets/job-facets/sql.md new file mode 100644 index 0000000..320de5f --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/job-facets/sql.md @@ -0,0 +1,29 @@ +--- +sidebar_position: 5 +--- + + +# SQL Job Facet + +The SQL Job Facet contains a SQL query that was used in a particular job. + +Example: + +```json +{ + ... + "job": { + "facets": { + "sql": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/SQLJobFacet.json", + "query": "select id, name from schema.table where id = 1" + } + } + } + ... +} +``` + + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/SQLJobFacet.json) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/run-facets/_category_.json b/versioned_docs/version-1.26.0/spec/facets/run-facets/_category_.json new file mode 100644 index 0000000..1b30c5a --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/run-facets/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Run Facets", + "position": 1 +} \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/run-facets/environment_variables.md b/versioned_docs/version-1.26.0/spec/facets/run-facets/environment_variables.md new file mode 100644 index 0000000..b142986 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/run-facets/environment_variables.md @@ -0,0 +1,13 @@ +--- +sidebar_position: 6 +--- + +# Environment Variables Run Facet +The Environment Variables Run Facet provides detailed information about the environment variables that were set during the execution of a job. This facet is useful for capturing the runtime environment configuration, which can be used for categorizing and filtering jobs based on their environment settings. + +| Property | Description | Type | Example | Required | +|-----------------------|-----------------------------------------------------------------------------|--------|---------------------------|----------| +| name | The name of the environment variable. This helps in identifying the specific environment variable used during the job run. | string | "JAVA_HOME" | Yes | +| value | The value of the environment variable. This captures the actual value set for the environment variable during the job run. | string | "/usr/lib/jvm/java-11" | Yes | + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/EnvironmentVariablesRunFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/run-facets/error_message.md b/versioned_docs/version-1.26.0/spec/facets/run-facets/error_message.md new file mode 100644 index 0000000..219ab0e --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/run-facets/error_message.md @@ -0,0 +1,31 @@ +--- +sidebar_position: 1 +--- + + +# Error Message Facet + +The facet to contain information about the failures during the run of the job. A typical payload would be the message, stack trace, etc. + +Example: + +```json +{ + ... + "run": { + "facets": { + "errorMessage": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ErrorMessageRunFacet.json", + "message": "org.apache.spark.sql.AnalysisException: Table or view not found: wrong_table_name; line 1 pos 14", + "programmingLanguage": "JAVA", + "stackTrace": "Exception in thread \"main\" java.lang.RuntimeException: A test exception\nat io.openlineage.SomeClass.method(SomeClass.java:13)\nat io.openlineage.SomeClass.anotherMethod(SomeClass.java:9)" + } + } + } + ... +} +``` + + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/ErrorMessageRunFacet.json) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/run-facets/external_query.md b/versioned_docs/version-1.26.0/spec/facets/run-facets/external_query.md new file mode 100644 index 0000000..d6c4f5c --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/run-facets/external_query.md @@ -0,0 +1,30 @@ +--- +sidebar_position: 2 +--- + + +# External Query Facet + +The facet that describes the identification of the query that the run is related to which was executed by external systems. Even though the query itself is not contained, using this facet, the user should be able to access the query and its details. + +Example: + +```json +{ + ... + "run": { + "facets": { + "externalQuery": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ExternalQueryRunFacet.json", + "externalQueryId": "my-project-1234:US.bquijob_123x456_123y123z123c", + "source": "bigquery" + } + } + } + ... +} +``` + + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/ExternalQueryRunFacet.json) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/run-facets/nominal_time.md b/versioned_docs/version-1.26.0/spec/facets/run-facets/nominal_time.md new file mode 100644 index 0000000..aa9f3c9 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/run-facets/nominal_time.md @@ -0,0 +1,30 @@ +--- +sidebar_position: 3 +--- + + +# Nominal Time Facet + +The facet to describe the nominal start and end time of the run. The nominal usually means the time the job run was expected to run (like a scheduled time), and the actual time can be different. + +Example: + +```json +{ + ... + "run": { + "facets": { + "nominalTime": { + "_producer": "https://some.producer.com/version/1.0", + "_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/SQLJobFacet.json", + "nominalStartTime": "2020-12-17T03:00:00.000Z", + "nominalEndTime": "2020-12-17T03:05:00.000Z" + } + } + } + ... +} +``` + + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/NominalTimeRunFacet.json) \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/run-facets/parent_run.md b/versioned_docs/version-1.26.0/spec/facets/run-facets/parent_run.md new file mode 100644 index 0000000..8c4f6ec --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/run-facets/parent_run.md @@ -0,0 +1,34 @@ +--- +sidebar_position: 4 +--- + +# Parent Run Facet + +Commonly, scheduler systems like Apache Airflow will trigger processes on remote systems, such as on Apache Spark or Apache Beam jobs. +Those systems might have their own OpenLineage integration and report their own job runs and dataset inputs/outputs. +The ParentRunFacet allows those downstream jobs to report which jobs spawned them to preserve job hierarchy. +To do that, the scheduler system should have a way to pass its own job and run id to the child job. + +Example: + +```json +{ + ... + "run": { + "facets": { + "parent": { + "job": { + "name": "the-execution-parent-job", + "namespace": "the-namespace" + }, + "run": { + "runId": "f99310b4-3c3c-1a1a-2b2b-c1b95c24ff11" + } + } + } + } + ... +} +``` + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-0-0/ParentRunFacet.json). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/run-facets/processing_engine.md b/versioned_docs/version-1.26.0/spec/facets/run-facets/processing_engine.md new file mode 100644 index 0000000..86e857e --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/run-facets/processing_engine.md @@ -0,0 +1,16 @@ +--- +sidebar_position: 5 +--- + +# Processing Engine Run Facet +The Processing Engine Run Facet provides detailed information about the processing engine that executed the job. This facet is commonly used to track and document the specific engine and its version, ensuring reproducibility and aiding in debugging processes. + +| Property | Description | Type | Example | Required | +|---------------------------|-----------------------------------------------------------------------------|--------|-----------|----------| +| version | The version of the processing engine, such as Airflow or Spark. This helps in identifying the exact environment in which the job was run. | string | "2.5.0" | Yes | +| name | The name of the processing engine, for example, Airflow or Spark. This is useful for categorizing and filtering jobs based on the engine used. | string | "Airflow" | Yes | +| openlineageAdapterVersion | The version of the OpenLineage adapter package used, such as the OpenLineage Airflow integration package version. This can be helpful for troubleshooting and ensuring compatibility. | string | "0.19.0" | No | + +Example use case: When a data pipeline job fails, the Processing Engine Run Facet can be used to quickly identify the version and type of processing engine that was used, making it easier to replicate the issue and find a solution. + +The facet specification can be found [here](https://openlineage.io/spec/facets/1-1-1/ProcessingEngineRunFacet.json#/$defs/ProcessingEngineRunFacet). \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/facets/run-facets/run-facets.md b/versioned_docs/version-1.26.0/spec/facets/run-facets/run-facets.md new file mode 100644 index 0000000..ffdc9c0 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/facets/run-facets/run-facets.md @@ -0,0 +1,7 @@ +--- +sidebar_position: 1 +--- + +# Run Facets + +Run Facets apply to a specific `instance` of a particular running _job_. Every run will have a uniquely identifiable `run ID` that is usually a [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier), that can later be tracked. It is recommended to use [UUIDv7](https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/) version of the format. diff --git a/versioned_docs/version-1.26.0/spec/job-hierarchy-events.png b/versioned_docs/version-1.26.0/spec/job-hierarchy-events.png new file mode 100644 index 0000000..55cd3d1 Binary files /dev/null and b/versioned_docs/version-1.26.0/spec/job-hierarchy-events.png differ diff --git a/versioned_docs/version-1.26.0/spec/job-hierarchy-jobs.png b/versioned_docs/version-1.26.0/spec/job-hierarchy-jobs.png new file mode 100644 index 0000000..ca55b3b Binary files /dev/null and b/versioned_docs/version-1.26.0/spec/job-hierarchy-jobs.png differ diff --git a/versioned_docs/version-1.26.0/spec/job-hierarchy.md b/versioned_docs/version-1.26.0/spec/job-hierarchy.md new file mode 100644 index 0000000..10acc43 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/job-hierarchy.md @@ -0,0 +1,49 @@ +--- +sidebar_position: 8 +--- + +# Job Hierarchy + +:::info +This feature is available in OpenLineage versions >= 1.9.0. +::: + +In a complex environment, where there are thousands of processing jobs daily, there can be a lot of chaos. +Understanding not only which jobs produced what dataset, but also answering questions like: +- why did the job ran? +- when it ran? +- who scheduled the job? +- why did the job ran after other one finished? +can be often muddy. + +Fortunately, OpenLineage gives us not only the ability to understand the dataset-to-dataset lineage, but also +includes a description of the job hierarchy in its model. + +The tool OpenLineage provides for that is the ParentRunFacet. For a given run, it describes what other run spawned it. + +```json +"parent": { + "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.0.1/integration/dbt", + "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/ParentRunFacet.json", + "run": { + "runId": "f99310b4-3c3c-1a1a-2b2b-c1b95c24ff11" + }, + "job": { + "namespace": "dbt", + "name": "dbt-job-name" + } +} +``` + +Data processing systems often integrate built-in hierarchies. Schedulers, for instance, use large, schedulable units like Airflow DAGs, which in turn comprise smaller, executable units like Airflow Tasks. OpenLineage seamlessly reflects this natural organization by mirroring the job hierarchy within its model. + +## Complex Job Hierarchy + +The simple mechanism on which OpenLineage bases it's job hierarchy model also allows us to describe more complex environments. +In this case, we have an Airflow DAG that has two tasks; one of which spawns a Spark job with two actions. The parent structure is shown in following diagram: + +![image](./job-hierarchy-jobs.png) + +Following diagram shows order in which events from those jobs are coming: + +![image](./job-hierarchy-events.png) diff --git a/versioned_docs/version-1.26.0/spec/naming-correlations.svg b/versioned_docs/version-1.26.0/spec/naming-correlations.svg new file mode 100644 index 0000000..5673bd6 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/naming-correlations.svg @@ -0,0 +1,85 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/versioned_docs/version-1.26.0/spec/naming.md b/versioned_docs/version-1.26.0/spec/naming.md new file mode 100644 index 0000000..a3344a1 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/naming.md @@ -0,0 +1,80 @@ +--- +sidebar_position: 3 +--- + +# Naming Conventions + +Employing a unique naming strategy per resource ensures that the spec is followed uniformly regardless of metadata +producer. + +Jobs and Datasets have their own namespaces, job namespaces being derived from schedulers and dataset namespaces from +datasources. + +## Dataset Naming + +A dataset, or `table`, is organized according to a producer, namespace, database and (optionally) schema. + +| Data Store | Type | Namespace | Name | +|:------------------------------|:-------------------------------------|:---------------------------------------------------------------|:-----------------------------------------------------------------------------| +| Athena | Warehouse | `awsathena://athena.{region_name}.amazonaws.com` | `{catalog}.{database}.{table}` | +| AWS Glue | Data catalog | `arn:aws:glue:{region}:{account id}` | `table/{database name}/{table name}` | +| Azure Cosmos DB | Warehouse | `azurecosmos://{host}/dbs/{database}` | `colls/{table}` | +| Azure Data Explorer | Warehouse | `azurekusto://{host}.kusto.windows.net` | `{database}/{table}` | +| Azure Synapse | Warehouse | `sqlserver://{host}:{port}` | `{schema}.{table}` | +| BigQuery | Warehouse | `bigquery://` | `{project id}.{dataset name}.{table name}` | +| Cassandra | Warehouse | `cassandra://{host}:{port}` | `{keyspace}.{table}` | +| MySQL | Warehouse | `mysql://{host}:{port}` | `{database}.{table}` | +| Oracle | Warehouse | `oracle://{host}:{port}` | `{serviceName}.{schema}.{table} or {sid}.{schema}.{table}` | +| Postgres | Warehouse | `postgres://{host}:{port}` | `{database}.{schema}.{table}` | +| Teradata | Warehouse | `teradata://{host}:{port}` | `{database}.{table}` | +| Redshift | Warehouse | `redshift://{cluster_identifier}.{region_name}:{port}` | `{database}.{schema}.{table}` | +| Snowflake | Warehouse | `snowflake://{organization name}-{account name}` | `{database}.{schema}.{table}` | +| Trino | Warehouse | `trino://{host}:{port}` | `{catalog}.{schema}.{table}` | +| ABFSS (Azure Data Lake Gen2) | Data lake | `abfss://{container name}@{service name}.dfs.core.windows.net` | `{path}` | +| DBFS (Databricks File System) | Distributed file system | `dbfs://{workspace name}` | `{path}` | +| GCS | Blob storage | `gs://{bucket name}` | `{object key}` | +| HDFS | Distributed file system | `hdfs://{namenode host}:{namenode port}` | `{path}` | +| Kafka | Distributed event streaming platform | `kafka://{bootstrap server host}:{port}` | `{topic}` | +| Local file system | File system | `file` | `{path}` | +| Remote file system | File system | `file://{host}` | `{path}` | +| S3 | Blob Storage | `s3://{bucket name}` | `{object key}` | +| WASBS (Azure Blob Storage) | Blob Storage | `wasbs://{container name}@{service name}.dfs.core.windows.net` | `{object key}` | +| PubSub | Distributed event streaming platform | `pubsub` | `topic:{projectId}:{topicId}` or `subscription:{projectId}:{subscriptionId}` | + +## Job Naming + +A `Job` is a recurring data transformation with inputs and outputs. Each execution is captured as a `Run` with +corresponding metadata. +A `Run` event identifies the `Job` it instances by providing the job’s unique identifier. +The `Job` identifier is composed of a `Namespace` and `Name`. The job namespace is usually set in OpenLineage client +config. The job name is unique within its namespace. + +| Job type | Name | Example | +|:-------------|:------------------------------|:-------------------------------------------------------------| +| Airflow task | `{dag_id}.{task_id}` | `orders_etl.count_orders` | +| Spark job | `{appName}.{command}.{table}` | `my_awesome_app.execute_insert_into_hive_table.mydb_mytable` | +| SQL | `{schema}.{table}` | `gx.validate_datasets` | + +## Run Naming + +Runs are named using client-generated UUIDs. The OpenLineage client is responsible for generating them and maintaining +them throughout the duration of the runcycle. + +```python +from openlineage.client.run import Run +from openlineage.client.uuid import generate_new_uuid + +run = Run(str(generate_new_uuid())) +``` + +## Why Naming Matters + +Naming enables focused insight into data flows, even when datasets and workflows are distributed across an organization. +This focus enabled by naming is key to the production of useful lineage. + +![image](./naming-correlations.svg) + +## Additional Resources + +* [The OpenLineage Naming Spec](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md) +* [What's in a Namespace Blog Post](https://openlineage.io/blog/whats-in-a-namespace/) diff --git a/versioned_docs/version-1.26.0/spec/object-model.md b/versioned_docs/version-1.26.0/spec/object-model.md new file mode 100644 index 0000000..c327df1 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/object-model.md @@ -0,0 +1,94 @@ +--- +sidebar_position: 1 +--- + +# Object Model + +OpenLineage was designed to enable large-scale observation of datasets as they move through a complex pipeline. + +Because of this, it integrates with various tools with the aim of emitting real-time lineage events as datasets are created and transformed. The object model is flexible, with abstract definitions for Dataset and Job that support a variety of underlying data architectures. OpenLineage cares how Datasets come into being, not just that relationships exist between them. Accordingly, its object model contains both Jobs *and* Datasets. + +Logically, an OpenLineage backend learns about Datasets by receiving information about Jobs that run. Most Jobs have at least one input or output Dataset, and a lineage graph can be created by weaving together observations of many Jobs across multiple platforms. + +This information is in the form of **Run State Updates**, which contain information about Jobs, Datasets, and Runs. + +## Run State Update +A Run State Update is prepared and sent when something important occurs within your pipeline, and each one can be thought of as a distinct observation. This commonly happens when a Job starts or finishes. + +The run state itself refers to a stage within the [run cycle](./run-cycle.md) of the current run. Usually, the first Run State for a Job would be `START` and the last would be `COMPLETE`. A run cycle is likely to have at least two Run State Updates, and perhaps more. Each one will also have timestamp of when this particular state change happened. + +![OpenLineage Object Model](object-model.svg) + +Each Run State Update can include detail about the Job, the Run, and the input and output Datasets involved in the run. Subsequent updates are additive: input Datasets, for example, can be specified along with `START`, along with `COMPLETE`, or both. This accommodates situations where information is only available at certain times. + +Each of these three core entities can also be extended through the use of facets, some of which are documented in the relevant sections below. + +## Job +A Job is a process that consumes or produces Datasets. + +This is abstract, and can map to different things in different operational contexts. For example, a job could be a task in a workflow orchestration system. It could also be a model, a query, or a checkpoint. Depending on the system under observation, a Job can represent a small or large amount of work. + +A Job is the part of the object model that represents a discrete bit of defined work. If, for example, you have cron running a Python script that executes a `CREATE TABLE x AS SELECT * FROM y` query every day, the Python script is the Job. + +Jobs are identified by a unique name within a `namespace`. They are expected to evolve over time and their changes can be captured through Run State Updates. + +### Job Facets +Facets that can be used to augment the metadata of a Job include: + +- **sourceCodeLocation**: Captures the source code location and version (e.g., the git SHA) of the job. + +- **sourceCode**: Captures the language (e.g. python) and complete source code of the job. Using this source code, users can gain useful information about what the job does. + +For more details, please refer to the [Job Facets](./facets/job-facets). + +## Run +A Run is an instance of a Job that represents one of its occurrences in time. + +Each run will have a uniquely identifiable `runId` that is generated by the client as [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier). The client is responsible for maintaining the `runId` between different Run State Updates in the same Run. It is recommended to use [UUIDv7](https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/) format. + +Runs can be used to observe changes in Jobs between their instances. If, for example, you have cron running a Python script that repeats a query every day, this should result in a separate Run for each day. + +### Run Facets + +Facets that can be used to augment the metadata of a Run include: + +- **nominalTime**: Captures the time this run is scheduled for. This is typically used for scheduled jobs. The job has a nominally scheduled time that will be different from the actual time it ran. + +- **parent**: Captures the parent Job and Run, for instances where this Run was spawned from a parent Run. For example in the case of [Airflow](https://airflow.apache.org/), there's a Run that represents the DAG itself that is the parent of the individual Runs that represent the tasks it spawns. Similarly when a SparkOperator starts a Spark job, this creates a separate run that refers to the task run as its parent. + +- **errorMessage**: Captures potential error messages - and optionally stack traces - with which the run failed. + +- **sql**: Captures the SQL query, if this job runs one. + +For more details, please refer to the [Run Facets](./facets/run-facets). + +## Dataset +A Dataset is an abstract representation of data. This can refer to a small amount or large amount of data, as long as it's discrete. For databases, this should be a table. For cloud storage, this is often an object in a bucket. This can represent a directory of a filesystem. + +It has a unique name within a namespace derived from its physical location (i.e., db.host.database.schema.table). The combined namespace and name for a Dataset should be enough to uniquely identify it within a data ecosystem. + +Typically, a *Dataset* changes when a job writing to it completes. Similarly to the *Job* and *Run* distinction, metadata that is more static from Run to Run is captured in a DatasetFacet - for example, the schema that does not change every run). What changes every *Run* is captured as an *InputFacet* or an *OutputFacet* - for example, a time partition indicating the subset of the data set that was read or written). + +A Dataset is the part of the object model that represents a discrete collection of data. If, for example, you have cron running a Python script that executes a `CREATE TABLE x AS SELECT * FROM y` query every day, the `x` and `y` tables are Datasets. + +### Dataset Facets + +Facets that can be used to augment the metadata of a Dataset include: + +- **schema**: Captures the schema of the dataset + +- **dataSource**: Captures the database instance containing this Dataset (e.g., database schema, object store bucket) + +- **lifecycleStateChange**: Captures the lifecycle states of the Dataset (e.g., alter, create, drop, overwrite, rename, truncate) + +- **version**: Captures the dataset version when versioning is defined by the data store (e.g.. Iceberg snapshot ID) + +Input Datasets have the following facets: +- **dataQualityMetrics**: Captures dataset-level and column-level data quality metrics (row count, byte size, null count, distinct count, average, min, max, quantiles) + +- **dataQualityAssertions**: Captures the result of running data tests on dataset or its columns + +Output Datasets have the following facets: +- **outputStatistics**: Captures the size of the output written to a dataset (e.g., row count and byte size) + +For more details, please refer to the [Dataset Facets](./facets/dataset-facets). diff --git a/versioned_docs/version-1.26.0/spec/object-model.svg b/versioned_docs/version-1.26.0/spec/object-model.svg new file mode 100644 index 0000000..46c074f --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/object-model.svg @@ -0,0 +1,60 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/versioned_docs/version-1.26.0/spec/producers.md b/versioned_docs/version-1.26.0/spec/producers.md new file mode 100644 index 0000000..9e9f64f --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/producers.md @@ -0,0 +1,13 @@ +--- +sidebar_position: 6 +--- + +# Producers + +:::info +This page could use some extra detail! You're welcome to contribute using the Edit link at the bottom. +::: + +The `_producer` value is included in an OpenLineage request as a way to know how the metadata was generated. It is a URI that links to a source code SHA or the location where a package can be found. + +For example, this field is populated by many of the common integrations. For example, the dbt integration will set this value to `https://github.com/OpenLineage/OpenLineage/tree/{{PREPROCESSOR:OPENLINEAGE_VERSION}}/integration/dbt` and the Python client will set it to `https://github.com/OpenLineage/OpenLineage/tree/{{PREPROCESSOR:OPENLINEAGE_VERSION}}/client/python`. \ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/run-cycle-batch.svg b/versioned_docs/version-1.26.0/spec/run-cycle-batch.svg new file mode 100644 index 0000000..7a4e1ee --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/run-cycle-batch.svg @@ -0,0 +1,48 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/versioned_docs/version-1.26.0/spec/run-cycle-stream.svg b/versioned_docs/version-1.26.0/spec/run-cycle-stream.svg new file mode 100644 index 0000000..f01c664 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/run-cycle-stream.svg @@ -0,0 +1,53 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/versioned_docs/version-1.26.0/spec/run-cycle.md b/versioned_docs/version-1.26.0/spec/run-cycle.md new file mode 100644 index 0000000..93c82bc --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/run-cycle.md @@ -0,0 +1,48 @@ +--- +sidebar_position: 4 +--- + +# The Run Cycle + +The OpenLineage [object model](object-model.md) is event-based and updates provide an OpenLineage backend with details about the activities of a Job. + +The OpenLineage Run Cycle has several defined states that correspond to changes in the state of a pipeline task. When a task transitions between these - e.g. it is initiated, finishes, or fails - a Run State Update is sent that describes what happened. + +Each Run State Update contains the run state (i.e., `START`) along with metadata about the Job, its current Run, and its input and output Datasets. It is common to add additional metadata throughout the lifecycle of the run as it becomes available. + +## Run States + +There are six run states currently defined in the OpenLineage [spec](https://openlineage.io/apidocs/openapi/): + +* `START` to indicate the beginning of a Job + +* `RUNNING` to provide additional information about a running Job + +* `COMPLETE` to signify that execution of the Job has concluded + +* `ABORT` to signify that the Job has been stopped abnormally + +* `FAIL` to signify that the Job has failed + +* `OTHER` to send additional metadata outside standard run cycle + +We assume events describing a single run are **accumulative** and +`COMPLETE`, `ABORT` and `FAIL` are terminal events. Sending any of terminal events +means no other events related to this run will be emitted. + +Additionally, we allow `OTHER` to be sent anytime before the terminal states, +also before `START`. The purpose of this is the agility to send additional +metadata outside standard run cycle - e.g., on a run that hasn't yet started +but is already awaiting the resources. + +![image](./run-life-cycle.svg) + +## Typical Scenarios + +A batch Job - e.g., an Airflow task or a dbt model - will typically be represented as a `START` event followed by a `COMPLETE` event. Occasionally, an `ABORT` or `FAIL` event will be sent when a job does not complete successfully. + +![image](./run-cycle-batch.svg) + +A long-running Job - e.g., a microservice or a stream - will typically be represented by a `START` event followed by a series of `RUNNING` events that report changes in the run or emit performance metrics. Occasionally, a `COMPLETE`, `ABORT`, or `FAIL` event will occur, often followed by a `START` event as the job is reinitiated. + +![image](./run-cycle-stream.svg) diff --git a/versioned_docs/version-1.26.0/spec/run-life-cycle.svg b/versioned_docs/version-1.26.0/spec/run-life-cycle.svg new file mode 100644 index 0000000..18c13bd --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/run-life-cycle.svg @@ -0,0 +1,4 @@ + + + +
COMPLETE
COMPLETE
START
START
ABORT
ABORT
FAIL
FAIL
RUNNING
RUNNING
OTHER
OTHER
Text is not SVG - cannot display
\ No newline at end of file diff --git a/versioned_docs/version-1.26.0/spec/schemas.md b/versioned_docs/version-1.26.0/spec/schemas.md new file mode 100644 index 0000000..68c8c08 --- /dev/null +++ b/versioned_docs/version-1.26.0/spec/schemas.md @@ -0,0 +1,47 @@ +--- +sidebar_position: 7 +--- + +# Working with Schemas + +OpenLineage is a rapidly growing open source project, and therefore, will face many new changes in its `SPEC`. The spec file is based on [JSON schema specification](https://json-schema.org/) and defines how the OpenLineage's event message would be structured. More details on what are defined in its object model can be found [here](./object-model.md). + +When you are working in the OpenLineage project and decided to introduce a new facet or make changes to existing facets, you have to know what needs to be done and also understand how the general build and test process works, so that the OpenLineage specs are well maintained and does not break anything. + +The following guidelines may help you to correctly introduce new changes. + +## Create a new issue with label `spec` +Before you decide to make any changes, it is best advised that you first label your issue with `spec`. This will indicate the the issue is related to any changes in the current OpenLineage spec. + +## Make changes to the spec's version +Whenever there is a change to existing spec file (JSON), you need to bump up the version of the existing current spec, so that the changes can go through the code generation and gradle build. Consider the following spec file, where you will see the URL in `$id` that shows what is the current spec version the file currently is. + +``` +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://openlineage.io/spec/facets/1-0-1/ColumnLineageDatasetFacet.json", + "$defs": { +``` + +In this example, bumping up the version to the new value, should be changed from 1-0-1 to 1-0-2. + +``` +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://openlineage.io/spec/facets/1-0-2/ColumnLineageDatasetFacet.json", + "$defs": { +``` + +> If you do not bump the version to higher number, the code generation of Java client will fail. + +## Python client's codes need to be manually updated +Java client's build process does involve `code generation` that automatically produces OpenLineage classes derived from the spec files, so you do not need to do anything in terms of coding the client. However, python client libraries does not depend on the spec files to be generated, so you have to make sure to add changes to the python code in order for it to know and use the changes. As for the facets, they are implemented [here](https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/facet.py), so generally, you need to apply necessary changes to it. As for the general structure of OpenLineage's run events, it can be found [here](https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/run.py). + +## Add test cases +Make sure to add changes to the unit tests for [python](https://github.com/OpenLineage/OpenLineage/tree/main/client/python/tests) and [java](https://github.com/OpenLineage/OpenLineage/tree/main/client/java/src/test/java/io/openlineage/client) to make sure the unit test can be performed against your new SPEC changes. Refer to existing test codes to add yours in. + +## Test the SPEC change using code generation and integration tests +When you have modified the SPEC file(s), always make sure to perform code generation and unit tests by going into `client/java` and running `./gradlew generateCode` and `./gradlew test`. As for python, cd into `client/python` and run `pytest`. + +> Note: Some of the tests may fail due to the fact that they require external systems like kafka. You can ignore those errors. + diff --git a/versioned_docs/version-1.26.0/where-ol-fits.svg b/versioned_docs/version-1.26.0/where-ol-fits.svg new file mode 100644 index 0000000..2461553 --- /dev/null +++ b/versioned_docs/version-1.26.0/where-ol-fits.svg @@ -0,0 +1,146 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/versioned_docs/version-1.26.0/with-ol.svg b/versioned_docs/version-1.26.0/with-ol.svg new file mode 100644 index 0000000..15945a7 --- /dev/null +++ b/versioned_docs/version-1.26.0/with-ol.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/versioned_sidebars/version-1.26.0-sidebars.json b/versioned_sidebars/version-1.26.0-sidebars.json new file mode 100644 index 0000000..caea0c0 --- /dev/null +++ b/versioned_sidebars/version-1.26.0-sidebars.json @@ -0,0 +1,8 @@ +{ + "tutorialSidebar": [ + { + "type": "autogenerated", + "dirName": "." + } + ] +} diff --git a/versions.json b/versions.json index fd80798..21c2f39 100644 --- a/versions.json +++ b/versions.json @@ -1,4 +1,5 @@ [ + "1.26.0", "1.25.0", "1.24.2", "1.23.0",