From 2ca2d8d43459cb47672e4bbb8d0e13a4aaa2f41b Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 13 May 2024 23:03:58 +0200 Subject: [PATCH 01/31] Added quality checks --- CHANGELOG.md | 3 + README.md | 200 ++-- versions/0.9.3/README.md | 1082 ++++++++++++++++++++ versions/0.9.3/datacontract.init.yaml | 109 ++ versions/0.9.3/datacontract.schema.json | 1215 +++++++++++++++++++++++ versions/0.9.3/definition.schema.json | 81 ++ 6 files changed, 2634 insertions(+), 56 deletions(-) create mode 100644 versions/0.9.3/README.md create mode 100644 versions/0.9.3/datacontract.init.yaml create mode 100644 versions/0.9.3/datacontract.schema.json create mode 100644 versions/0.9.3/definition.schema.json diff --git a/CHANGELOG.md b/CHANGELOG.md index 860f802..f815d6e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,7 +10,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 Please note, while the major version is zero (0.y.z), Anything MAY change at any time. The public API SHOULD NOT be considered stable. +## [0.9.4] - 2024-05-13 + ### Added +- Data quality attributes on model and field level - AWS Glue Catalog server support - sftp server support - info.status field diff --git a/README.md b/README.md index e0721f3..7d97c22 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ Data usage agreements have a defined lifecycle, start/end date, and help the dat Version --- -0.9.3 ([Changelog](CHANGELOG.md)) +0.9.4([Changelog](CHANGELOG.md)) Example --- @@ -98,16 +98,28 @@ models: minLength: 10 maxLength: 20 customer_email_address: - description: The email address, as entered by the customer. The email address was not verified. + description: The email address, as entered by the customer. type: text format: email required: true pii: true classification: sensitive + quality: + - type: business-rule + name: The email address was verified by the system processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp required: true + quality: + - type: row_count + must_be_greater_than: 5 + - type: sql + description: The maximum duration between two orders should be less that 3600 seconds + query: | + SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration + FROM orders + must_be_less_than: 3600 line_items: description: A single article that is part of an order. type: table @@ -208,15 +220,6 @@ servicelevels: cron: 0 0 * * 0 recoveryTime: 24 hours recoveryPoint: 1 week -quality: - type: SodaCL # data quality check format: SodaCL, montecarlo, custom - specification: # expressed as string or inline yaml or via "$ref: checks.yaml" - checks for orders: - - row_count >= 5 - - duplicate_count(order_id) = 0 - checks for line_items: - - values in (order_id) must exist in orders (order_id) - - row_count >= 5 ``` Data Contract CLI @@ -517,6 +520,8 @@ The name of the data model (table name) is defined by the key that refers to thi | description | `string` | An optional string describing the data model. | | title | `string` | An optional string for the title of the data model. Especially useful if the name of the model is cryptic or contains abbreviations. | | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | +| quality | Array of [Quality Object](#quality-object) | Specifies the quality attributes on model level. | + @@ -551,6 +556,8 @@ The Field Objects describes one field (column, property, nested field) of a data | $ref | `string` | A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. | | fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is object, record, or struct. | | items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is array. | +| quality | Array of [Quality Object](#quality-object) | Specifies the quality attributes on field level. | + ### Definition Object @@ -928,80 +935,138 @@ Backup specifies details about data backup procedures. ### Quality Object -The quality object contains quality attributes and checks. +The quality object defined a quality attribute. -| Field | Type | Description | -|---------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------| -| type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `great-expectations`, `custom` | -| specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | +Quality attributes are checks that can be applied to the data to ensure its quality. Data can be verified by executing these checks through a data quality engine. +A quality object can be specified on field level, or on model level. The top-level quality object is deprecated. -#### SodaCL Quality Object +The fields of the quality object depends on the quality type. -Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). +#### Plain-text -The `specification` represents the content of a `checks.yml` file. +A human-readable text that describe the quality of the data. These can later be translated into a technical check (such as SQL), or checked through an AI engine. + +| Field | Type | Description | +|-------------|----------|--------------------------------------------------| +| type | `string` | `plain-text` | +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | A plain text describing the quality of the data. | -Example (inline): +Example: ```yaml -quality: - type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom - specification: # expressed as string or inline yaml or via "$ref: checks.yaml" - checks for orders: - - row_count > 0 - - duplicate_count(order_id) = 0 - checks for line_items: - - row_count > 0 +- type: plain-text + description: The email address was verified by the system ``` -Example (string): +#### SQL + +An individual SQL query that returns a single number or boolean value that can be compared. The SQL query must be in the SQL dialect of the provided server. + +| Field | Type | Description | +|----------------------------------|------------------------|------------------------------------------------------------| +| type | `string` | `sql` | +| query | `string` | A SQL query that returns a single number or boolean value. | +| must_be_equal_to | `integer` or `boolean` | The threshold to check the return value of the query | +| must_be_greater_than | `integer` | The threshold to check the return value of the query | +| must_be_greater_than_or_equal_to | `integer` | The threshold to check the return value of the query | +| must_be_less_than | `integer` | The threshold to check the return value of the query | +| must_be_less_than_or_equal_to | `integer` | The threshold to check the return value of the query | +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | A plain text describing the quality of the data. | ```yaml -quality: - type: SodaCL - specification: |- - checks for search_queries: - - freshness(search_timestamp) < 1d - - row_count > 100000 - - missing_count(search_query) = 0 +- type: sql + description: The maximum duration between two orders should be less that 3600 seconds + query: | + SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration + FROM orders + must_be_less_than: 3600 ``` -#### Monte Carlo Quality Object -Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). +#### Row Count -The `specification` represents the content of a `montecarlo.yml` file. +Counts the number of rows in a model. -Example (string): +| Field | Type | Description | +|----------------------------------|-----------|------------------------------------------------------| +| type | `string` | `row_count` | +| must_be_equal_to | `number` | The threshold to check the return value of the query | +| must_not_be_equal_to | `number` | The threshold to check the return value of the query | +| must_be_greater_than | `number` | The threshold to check the return value of the query | +| must_be_greater_than_or_equal_to | `number` | The threshold to check the return value of the query | +| must_be_less_than | `number` | The threshold to check the return value of the query | +| must_be_less_than_or_equal_to | `number` | The threshold to check the return value of the query | +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | A plain text describing the quality of the data. | + + +```yaml +- type: row_count + must_be_greater_than: 500000 +``` + + +#### Unique + +A uniqueness check for multiple fields. + +| Field | Type | Description | +|----------------------------------|-------------------|------------------------------------------------------------------------| +| type | `string` | `unique` | +| fields | Array of `string` | An ordered list of fields that values need to be unique in combination | +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | A plain text describing the quality of the data. | + + +```yaml +- type: unique + fields: + - country + - order_id +``` + + +#### Freshness +TBD + + +#### SodaCL + +Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). + +The `specification` represents the content of a `checks.yml` file. + +Example: ```yaml quality: - type: montecarlo - specification: |- - montecarlo: - field_health: - - table: project:dataset.table_name - timestamp_field: created - dimension_tracking: - - table: project:dataset.table_name - timestamp_field: created - field: order_status + - type: SodaCL + specification: | + checks for orders: + - row_count > 0 + - duplicate_count(order_id) = 0 + checks for line_items: + - row_count > 0 ``` -#### Great Expectations Quality Object + +#### Great Expectations Quality attributes defined as Great Expectations [Expectations](https://greatexpectations.io/expectations/). -The `specification` represents a list of expectations on a specific model. +The `specification` represents a expectation suite as JSON string. -Example (string): +New with 0.9.4: This quality type is only applicable on model level. + +Example: ```yaml quality: - type: great-expectations - specification: - orders: |- + - type: great-expectations + specification: | [ { "expectation_type": "expect_table_row_count_to_be_between", @@ -1015,6 +1080,29 @@ quality: ] ``` + +#### Monte Carlo + +Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). + +The `specification` represents the content of a `montecarlo.yml` file. + +Example: + +```yaml +quality: + - type: montecarlo + specification: | + montecarlo: + field_health: + - table: project:dataset.table_name + timestamp_field: created + dimension_tracking: + - table: project:dataset.table_name + timestamp_field: created + field: order_status +``` + ### Data Types The following data types are supported for model fields and definitions: diff --git a/versions/0.9.3/README.md b/versions/0.9.3/README.md new file mode 100644 index 0000000..e0721f3 --- /dev/null +++ b/versions/0.9.3/README.md @@ -0,0 +1,1082 @@ +# Data Contract Specification + + + Stars +Slack Status + +![datacontract.png](images/datacontract.png) + +Data contracts bring data providers and data consumers together. + +A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. +A data contract is implemented by a data product's output port or other data technologies. +Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. + +The _data contract specification_ defines a YAML format to describe attributes of provided data sets. +It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Azure, Databricks, and Snowflake. +The data contract specification is an open initiative to define a common data contract format. +It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. + +Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). +First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. +They make semantic and quality expectations explicit. +They are often created collaboratively in [workshops](./workshop.md) together with data providers and data consumers. +Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. + +The specification comes along with the [Data Contract CLI](https://github.com/datacontract/datacontract-cli), an open-source tool to develop, validate, and enforce data contracts. + +IntelliJ, VS Code and other common IDEs allow you to use autocompletions without additional configuration. + +_Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. +The term "contract" may be somewhat misleading, but it is how it is used in practice. +The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. +Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ + +Version +--- + +0.9.3 ([Changelog](CHANGELOG.md)) + +Example +--- + +[![Data Contract Catalog](https://img.shields.io/badge/Data%20Contract-Catalog-blue)](https://datacontract.com/examples/index.html) + +```yaml +dataContractSpecification: 0.9.3 +id: urn:datacontract:checkout:orders-latest +info: + title: Orders Latest + version: 1.0.0 + description: | + Successful customer orders in the webshop. + All orders since 2020-01-01. + Orders with their line items are in their current state (no history included). + owner: Checkout Team + contact: + name: John Doe (Data Product Owner) + url: https://teams.microsoft.com/l/channel/example/checkout +servers: + production: + type: s3 + location: s3://datacontract-example-orders-latest/data/{model}/*.json + format: json + delimiter: new_line +terms: + usage: | + Data can be used for reports, analytics and machine learning use cases. + Order may be linked and joined by other tables + limitations: | + Not suitable for real-time use cases. + Data may not be used to identify individual customers. + Max data processing per day: 10 TiB + billing: 5000 USD per month + noticePeriod: P3M +models: + orders: + description: One record per order. Includes cancelled and deleted orders. + type: table + fields: + order_id: + $ref: '#/definitions/order_id' + required: true + unique: true + primary: true + order_timestamp: + description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. + type: timestamp + required: true + example: "2024-09-09T08:30:00Z" + order_total: + description: Total amount the smallest monetary unit (e.g., cents). + type: long + required: true + example: "9999" + customer_id: + description: Unique identifier for the customer. + type: text + minLength: 10 + maxLength: 20 + customer_email_address: + description: The email address, as entered by the customer. The email address was not verified. + type: text + format: email + required: true + pii: true + classification: sensitive + processed_timestamp: + description: The timestamp when the record was processed by the data platform. + type: timestamp + required: true + line_items: + description: A single article that is part of an order. + type: table + fields: + lines_item_id: + type: text + description: Primary key of the lines_item_id table + required: true + unique: true + primary: true + order_id: + $ref: '#/definitions/order_id' + references: orders.order_id + sku: + description: The purchased article number + $ref: '#/definitions/sku' +definitions: + order_id: + domain: checkout + name: order_id + title: Order ID + type: text + format: uuid + description: An internal ID that identifies an order in the online shop. + example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 + pii: true + classification: restricted + sku: + domain: inventory + name: sku + title: Stock Keeping Unit + type: text + pattern: ^[A-Za-z0-9]{8,14}$ + example: "96385074" + description: | + A Stock Keeping Unit (SKU) is an internal unique identifier for an article. + It is typically associated with an article's barcode, such as the EAN/GTIN. +examples: + - type: csv # csv, json, yaml, custom + model: orders + description: An example list of order records. + data: | # expressed as string or inline yaml or via "$ref: data.csv" + order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp + "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" + "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" + "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" + "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" + "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" + "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" + "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" + "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" + - type: csv + model: line_items + description: An example list of line items. + data: | + lines_item_id,order_id,sku + "LI-1","1001","5901234123457" + "LI-2","1001","4001234567890" + "LI-3","1002","5901234123457" + "LI-4","1002","2001234567893" + "LI-5","1003","4001234567890" + "LI-6","1003","5001234567892" + "LI-7","1004","5901234123457" + "LI-8","1005","2001234567893" + "LI-9","1005","5001234567892" + "LI-10","1005","6001234567891" +servicelevels: + availability: + description: The server is available during support hours + percentage: 99.9% + retention: + description: Data is retained for one year + period: P1Y + unlimited: false + latency: + description: Data is available within 25 hours after the order was placed + threshold: 25h + sourceTimestampField: orders.order_timestamp + processedTimestampField: orders.processed_timestamp + freshness: + description: The age of the youngest row in a table. + threshold: 25h + timestampField: orders.order_timestamp + frequency: + description: Data is delivered once a day + type: batch # or streaming + interval: daily # for batch, either or cron + cron: 0 0 * * * # for batch, either or interval + support: + description: The data is available during typical business hours at headquarters + time: 9am to 5pm in EST on business days + responseTime: 1h + backup: + description: Data is backed up once a week, every Sunday at 0:00 UTC. + interval: weekly + cron: 0 0 * * 0 + recoveryTime: 24 hours + recoveryPoint: 1 week +quality: + type: SodaCL # data quality check format: SodaCL, montecarlo, custom + specification: # expressed as string or inline yaml or via "$ref: checks.yaml" + checks for orders: + - row_count >= 5 + - duplicate_count(order_id) = 0 + checks for line_items: + - values in (order_id) must exist in orders (order_id) + - row_count >= 5 +``` + +Data Contract CLI +--- + +The [Data Contract CLI](https://cli.datacontract.com) is a command line tool and Python library to lint, test, import and export data contracts. + +Here is short example how to verify that your actual dataset matches the data contract: + +```bash +pip3 install datacontract-cli +datacontract test https://datacontract.com/examples/orders-latest/datacontract.yaml +``` + +or, if you prefer Docker: +```bash +docker run datacontract/cli test https://datacontract.com/examples/orders-latest/datacontract.yaml +``` + +The Data Contract contains all required information to verify data: + +- The _servers_ block has the connection details to the actual data set. +- The _models_ define the syntax, formats, and constraints. +- The _quality_ defined further quality checks. + +The Data Contract CLI chooses the appropriate engine, formulates test cases, connects to the server, and executes the tests, based on the server type. + +More information and configuration options on [cli.datacontract.com](https://cli.datacontract.com). + +IDE Integration +--- +IntelliJ comes with a built-in YAML plugin which will show you autocompletions. For VS Code we recommend to install the [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) plugin. No additional configuration is required. Autocompletion is then enabled for files following these patterns: + +``` +datacontract.yaml +datacontract.yml +*-datacontract.yaml +*-datacontract.yml +*.datacontract.yaml +*.datacontract.yml +datacontract-*.yaml +datacontract-*.yml +**/datacontract/*.yml +**/datacontract/*.yaml +**/datacontracts/*.yml +**/datacontracts/*.yaml +``` + +Specification +--- + +![The eight major categories in the data contract specification](images/categories.png) + +- [Data Contract Object](#data-contract-object) +- [Info Object](#info-object) +- [Contact Object](#contact-object) +- [Server Object](#server-object) +- [Terms Object](#terms-object) +- [Model Object](#model-object) +- [Field Object](#field-object) +- [Definition Object](#definition-object) +- [Schema Object](#schema-object) +- [Example Object](#example-object) +- [Service Level Object](#service-levels-object) +- [Quality Object](#quality-object) +- [Data Types](#data-types) +- [Specification Extensions](#specification-extensions) + + +[JSON Schema](https://github.com/datacontract/datacontract-specification/blob/main/datacontract.schema.json) of the Data Contract Specification. + +### Data Contract Object + +This is the root document. + +It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. + +| Field | Type | Description | +|---------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------| +| dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | +| id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | +| info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | +| servers | Map[string, [Server Object](#server-object)] | Specifies the servers of the data contract. | +| terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | +| models | Map[string, [Model Object](#model-object)] | Specifies the logical data model. | +| definitions | Map[string, [Definition Object](#definition-object)] | Specifies definitions. | +| schema | [Schema Object](#schema-object) | Specifies the physical schema. The specification supports different schema format. | +| examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | +| servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | +| quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + + + + +### Info Object + +Metadata and life cycle information about the data contract. + + +| Field | Type | Description | +|-------------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| title | `string` | REQUIRED. The title of the data contract. | +| version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | +| status | `string` | The status of the data contract. Can be proposed, in development, active, retired. | +| description | `string` | A description of the data contract. | +| owner | `string` | The owner or team responsible for managing the data contract and providing the data. | +| contact | [Contact Object](#contact-object) | Contact information for the data contract. | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + + +### Contact Object + +Contact information for the data contract. + +| Field | Type | Description | +|-------|----------|-------------------------------------------------------------------------------------------------------| +| name | `string` | The identifying name of the contact person/organization. | +| url | `string` | The URL pointing to the contact information. This _MUST_ be in the form of a URL. | +| email | `string` | The email address of the contact person/organization. This _MUST_ be in the form of an email address. | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + +### Server Object + +The fields are dependent on the defined type. + +| Field | Type | Description | +|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `glue`, `redshift`, `azure`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `local` | +| description | `string` | An optional string describing the server. | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + +#### BigQuery Server Object + +| Field | Type | Description | +|---------|----------|-----------------------| +| type | `string` | `bigquery` | +| project | `string` | The GCP project name. | +| dataset | `string` | | + +#### S3 Server Object + +| Field | Type | Description | +|-------------|----------|------------------------------------------------------------------------------------------------------------------| +| type | `string` | `s3` | +| location | `string` | S3 URL, starting with `s3://` | +| endpointUrl | `string` | The server endpoint for S3-compatible servers, such as `https://minio.example.com` | +| format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | +| delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | + +Example: + +```yaml +servers: + production: + type: s3 + location: s3://acme-orders-prod/orders/ +``` + +#### AWS Glue Server Object + +| Field | Type | Description | +|----------|----------|------------------------------------------------------------| +| type | `string` | `glue` | +| account | `string` | REQUIRED. The AWS account, e.g., `1234-5678-9012` | +| database | `string` | REQUIRED. The AWS Glue Catalog database | +| location | `string` | S3 path, starting with `s3://` | +| format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | + +Example: + +```yaml +servers: + production: + type: glue + account: "1234-5678-9012" + database: acme-orders + location: s3://acme-orders-prod/orders/ + format: parquet +``` + + +#### Redshift Server Object + +| Field | Type | Description | +|----------|----------|-------------| +| type | `string` | `redshift` | +| account | `string` | | +| database | `string` | | +| schema | `string` | | + +#### Azure Server Object + +| Field | Type | Description | +|-----------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | `azure` | +| location | `string` | Fully qualified path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs. Starting with `az://` or `abfss`
Examples: `az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet` or `abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet` | +| format | `string` | Format of files, such as `parquet`, `json`, `csv` | +| delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | + + +#### Snowflake Server Object + +| Field | Type | Description | +|----------|----------|-------------| +| type | `string` | `snowflake` | +| account | `string` | | +| database | `string` | | +| schema | `string` | | + +#### Databricks Server Object + +| Field | Type | Description | +|---------|----------|---------------------------------------------------------------------| +| type | `string` | `databricks` | +| host | `string` | The Databricks host, e.g., `dbc-abcdefgh-1234.cloud.databricks.com` | +| catalog | `string` | The name of the Hive or Unity catalog | +| schema | `string` | The schema name in the catalog | + +#### Postgres Server Object + +| Field | Type | Description | +|----------|-----------|---------------------------------------------------------| +| type | `string` | `postgres` | +| host | `string` | The host to the database server | +| port | `integer` | The port to the database server | +| database | `string` | The name of the database, e.g., `postgres`. | +| schema | `string` | The name of the schema in the database, e.g., `public`. | + +#### Oracle Server Object + +| Field | Type | Description | +|-------------|-----------|---------------------------------| +| type | `string` | `oracle` | +| host | `string` | The host to the oracle server | +| port | `integer` | The port to the oracle server | +| serviceName | `string` | The name of the service | + +#### Kafka Server Object + +| Field | Type | Description | +|--------|----------|---------------------------------------------------------------------------| +| type | `string` | `kafka` | +| host | `string` | The bootstrap server of the kafka cluster. | +| topic | `string` | The topic name. | +| format | `string` | The format of the message. Examples: json, avro, protobuf. Default: json. | + +#### Pub/Sub Server Object + +| Field | Type | Description | +|---------|----------|-----------------------| +| type | `string` | `pubsub` | +| project | `string` | The GCP project name. | +| topic | `string` | The topic name. | + +#### sftp Server Object + +| Field | Type | Description | +|-----------|----------|------------------------------------------------------------------------------------------------------------------| +| type | `string` | `sftp` | +| location | `string` | S3 URL, starting with `sftp://` | +| format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | +| delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | + +#### Local Server Object + +| Field | Type | Description | +|--------|----------|-------------------------------------------------------------------------------------| +| type | `string` | `local` | +| path | `string` | The relative or absolute path to the data file(s), such as `./folder/data.parquet`. | +| format | `string` | The format of the file(s), such as `parquet`, `delta`, `csv`, or `json`. | + +### Terms Object + +The terms and conditions of the data contract. + +| Field | Type | Description | +|--------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | +| limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | +| billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | +| noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | + + +### Model Object + +The Model Object describes the structure and semantics of a data model, such as tables, views, or structured files. + +The name of the data model (table name) is defined by the key that refers to this Model Object. + +| Field | Type | Description | +|-------------|----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | The type of the model. Examples: `table`, `view`, `object`. Default: `table`. | +| description | `string` | An optional string describing the data model. | +| title | `string` | An optional string for the title of the data model. Especially useful if the name of the model is cryptic or contains abbreviations. | +| fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | + + + +### Field Object + +The Field Objects describes one field (column, property, nested field) of a data model. + +| Field | Type | Description | +|------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| description | `string` | An optional string describing the semantic of the data in this field. | +| type | [Data Type](#data-types) | The logical data type of the field. | +| title | `string` | An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations. | +| enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | +| required | `boolean` | An indication, if this field must contain a value and may not be null. Default: `false` | +| primary | `boolean` | If this field is a primary key. Default: `false` | +| references | `string` | The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship. | +| unique | `boolean` | An indication, if the value must be unique within the model. Default: `false` | +| format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | +| scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | +| minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| example | `string` | An example value. | +| pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | +| classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | +| tags | Array of `string` | Custom metadata to provide additional context. | +| $ref | `string` | A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. | +| fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is object, record, or struct. | +| items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is array. | + +### Definition Object + +The Definition Object includes a clear and concise explanations of syntax, semantic, and classification of a business object in a given domain. +It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields. +Models fields can refer to definitions using the `$ref` field to link to existing definitions and avoid duplicate documentations. + +| Field | Type | Description | +|------------------|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| name | `string` | REQUIRED. The technical name of this definition. | +| type | [Data Type](#data-types) | REQUIRED. The logical data type | +| domain | `string` | The domain in which this definition is valid. Default: `global`. | +| title | `string` | The business name of this definition. | +| description | `string` | Clear and concise explanations related to the domain | +| enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | +| format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | +| scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | +| minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| example | `string` | An example value. | +| pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | +| classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | +| tags | Array of `string` | Custom metadata to provide additional context. | + + +### Schema Object + +The schema of the data contract describes the physical schema. +The type of the schema depends on the data platform. + +| Field | Type | Description | +|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | REQUIRED. The type of the schema.
Typical values are: `dbt`, `bigquery`, `json-schema`, `sql-ddl`, `avro`, `protobuf`, `custom` | +| specification | [dbt Schema Object](#dbt-schema-object) \|
[BigQuery Schema Object](#bigquery-schema-object) \|
[JSON Schema Schema Object](#bigquery-schema-object) \|
[SQL DDL Schema Object](#sql-ddl-schema-object) \|
`string` | REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. | + + +#### dbt Schema Object + +https://docs.getdbt.com/reference/model-properties + +Example (inline YAML): + +```yaml +schema: + type: dbt + specification: + version: 2 + models: + - name: "My Table" + description: "My description" + columns: + - name: "My column" + data_type: text + description: "My description" +``` + +Example (string): + +```yaml +schema: + type: dbt + specification: |- + version: 2 + models: + - name: "My Table" + description: "My description" + columns: + - name: "My column" + data_type: text + description: "My description" +``` + +#### BigQuery Schema Object + +The schema structure is defined by the [Google BigQuery Table](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource:-table) object. You can extract such a Table object via the [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) endpoint. + +Instead of providing a single Table object, you can also provide an array of such objects. Be aware that [tables.list](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list) only returns a subset of the full Table object. You need to call every Table object via [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) to get the full Table object, including the actual schema. + +Learn more: [Google BigQuery REST Reference v2](https://cloud.google.com/bigquery/docs/reference/rest) + + + +Example: + +```yaml +schema: + type: bigquery + specification: |- + { + "tableReference": { + "projectId": "my-project", + "datasetId": "my_dataset", + "tableId": "my_table" + }, + "description": "This is a description", + "type": "TABLE", + "schema": { + "fields": [ + { + "name": "name", + "type": "STRING", + "mode": "NULLABLE", + "description": "This is a description" + } + ] + } + } +``` + +#### JSON Schema Schema Object + +JSON Schema can be defined as JSON or rendered as YAML, following the [OpenAPI Schema Object dialect](https://spec.openapis.org/oas/v3.1.0#properties) + +Example (inline YAML): + +```yaml +schema: + type: json-schema + specification: + orders: + description: One record per order. Includes cancelled and deleted orders. + type: object + properties: + order_id: + type: string + description: Primary key of the orders table + order_timestamp: + type: string + format: date-time + description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. + order_total: + type: integer + description: Total amount of the order in the smallest monetary unit (e.g., cents). + line_items: + type: object + properties: + lines_item_id: + type: string + description: Primary key of the lines_item_id table + order_id: + type: string + description: Foreign key to the orders table + sku: + type: string + description: The purchased article number +``` + +Example (string): + +```yaml +schema: + type: json-schema + specification: |- + { + "$schema": "http://json-schema.org/draft-07/schema#", + "type": "object", + "properties": { + "orders": { + "type": "object", + "description": "One record per order. Includes cancelled and deleted orders.", + "properties": { + "order_id": { + "type": "string", + "description": "Primary key of the orders table" + }, + "order_timestamp": { + "type": "string", + "format": "date-time", + "description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful." + }, + "order_total": { + "type": "integer", + "description": "Total amount of the order in the smallest monetary unit (e.g., cents)." + } + }, + "required": ["order_id", "order_timestamp", "order_total"] + }, + "line_items": { + "type": "object", + "properties": { + "lines_item_id": { + "type": "string", + "description": "Primary key of the lines_item_id table" + }, + "order_id": { + "type": "string", + "description": "Foreign key to the orders table" + }, + "sku": { + "type": "string", + "description": "The purchased article number" + } + }, + "required": ["lines_item_id", "order_id", "sku"] + } + }, + "required": ["orders", "line_items"] + } +``` + +#### SQL DDL Schema Object + +Classical SQL DDLs can be used to describe the structure. + + +Example (string): + +```yaml +schema: + type: sql-ddl + specification: |- + -- One record per order. Includes cancelled and deleted orders. + CREATE TABLE orders ( + order_id TEXT PRIMARY KEY, -- Primary key of the orders table + order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. + order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents) + ); + + -- The items that are part of an order + CREATE TABLE line_items ( + lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table + order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table + sku TEXT NOT NULL -- The purchased article number + ); + +``` + +### Example Object + +| Field | Type | Description | +|-------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `csv`, `json`, `yaml`, `custom` | +| description | `string` | An optional string describing the example. | +| model | `string` | The reference to the model in the schema, e.g. a table name. | +| data | `string` | Example data for this model. | + +Example: + +```yaml +examples: +- type: csv + model: orders + data: |- + order_id,order_timestamp,order_total + "1001","2023-09-09T08:30:00Z",2500 + "1002","2023-09-08T15:45:00Z",1800 + "1003","2023-09-07T12:15:00Z",3200 + "1004","2023-09-06T19:20:00Z",1500 + "1005","2023-09-05T10:10:00Z",4200 + "1006","2023-09-04T14:55:00Z",2800 + "1007","2023-09-03T21:05:00Z",1900 + "1008","2023-09-02T17:40:00Z",3600 + "1009","2023-09-01T09:25:00Z",3100 + "1010","2023-08-31T22:50:00Z",2700 +``` + +### Service Levels Object + +A service level is defined as an agreed-upon, measurable level of performance for provided the data. +Data Contract Specification defines well-known service levels. +This list can be extended with custom service levels. + +One can either describe each service level informally using the `description` field, or make use of the predefined fields for automation support, e.g., via the [Data Contract CLI](https://cli.datacontract.com). + +| Field | Type | Description | +|--------------|-----------------------------------------------|-------------------------------------------------------------------------| +| availability | [Availability Object](#availability-object) | The promised uptime of the system that provides the data | +| retention | [Retention Object](#retention-object) | The period how long data will be available. | +| latency | [Latency Object](#latency-object) | The maximum amount of time from the from the source to its destination. | +| freshness | [Freshness Object](#freshness-object) | The maximum age of the youngest entry. | +| frequency | [Frequency Object](#frequency-object) | The update frequency. | +| support | [Support Object](#support-object) | The times when support is provided. | +| backup | [Backup Object](#backup-object) | The details about data backup procedures. | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + +#### Availability Object + +Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data. + +| Field | Type | Description | +|-------------|----------|--------------------------------------------------------------------------------| +| description | `string` | An optional string describing the availability service level. | +| percentage | `string` | An optional string describing the guaranteed uptime in percent (e.g., `99.9%`) | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + +#### Retention Object + +Retention covers the period how long data will be available. + +| Field | Type | Description | +|----------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------| +| description | `string` | An optional string describing the retention service level. | +| period | `string` | An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`). | +| unlimited | `boolean` | An optional indicator that data is kept forever. | +| timestampField | `string` | An optional reference to the field that contains the timestamp that the period refers to. | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + +#### Latency Object + +Latency refers to the maximum amount of time from the source to its destination. + +Examples are the maximum duration it takes after an order has been recorded in the ecommerce shop until it is available in the orders table in the data analytics platform. This includes the waiting times until the next batch run is started and the processing time of the pipeline. + +| Field | Type | Description | +|-------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| description | `string` | An optional string describing the latency service level. | +| threshold | `string` | An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`). | +| sourceTimestampField | `string` | An optional reference to the field that contains the timestamp when the data was provided at the source. | +| processedTimestampField | `string` | An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract. | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + +#### Freshness Object + +Freshness refers to the maximum age of the youngest entry. + +| Field | Type | Description | +|-------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------| +| description | `string` | An optional string describing the freshness service level. | +| threshold | `string` | An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`). | +| timestampField | `string` | An optional reference to the field that contains the timestamp that the threshold refers to. | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + +#### Frequency Object + +Frequency describes how often data is updated. + +| Field | Type | Description | +|-------------|----------|-----------------------------------------------------------------------------------------------------------| +| description | `string` | An optional string describing the frequency service level. | +| type | `string` | An optional type of data processing. Typical values are `batch`, `micro-batching`, `streaming`, `manual`. | +| interval | `string` | Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`. | +| cron | `string` | Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`. | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + + +#### Support Object + +Support describes the times when support will be available for contact. + +| Field | Type | Description | +|--------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| description | `string` | An optional string describing the support service level. | +| time | `string` | An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`. | +| responseTime | `string` | An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with. | + + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + + +#### Backup Object + +Backup specifies details about data backup procedures. + +| Field | Type | Description | +|---------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| description | `string` | An optional string describing the backup service level. | +| interval | `string` | An optional interval that defines how often data will be backed up, e.g., `daily`. | +| cron | `string` | An optional cron expression when data will be backed up, e.g., `0 0 * * *`. | +| recoveryTime | `string` | An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours). | +| recoveryPoint | `string` | An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours). | + + + +### Quality Object + +The quality object contains quality attributes and checks. + +| Field | Type | Description | +|---------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `great-expectations`, `custom` | +| specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | + + +#### SodaCL Quality Object + +Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). + +The `specification` represents the content of a `checks.yml` file. + +Example (inline): + +```yaml +quality: + type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom + specification: # expressed as string or inline yaml or via "$ref: checks.yaml" + checks for orders: + - row_count > 0 + - duplicate_count(order_id) = 0 + checks for line_items: + - row_count > 0 +``` + +Example (string): + +```yaml +quality: + type: SodaCL + specification: |- + checks for search_queries: + - freshness(search_timestamp) < 1d + - row_count > 100000 + - missing_count(search_query) = 0 +``` + +#### Monte Carlo Quality Object + +Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). + +The `specification` represents the content of a `montecarlo.yml` file. + +Example (string): + +```yaml +quality: + type: montecarlo + specification: |- + montecarlo: + field_health: + - table: project:dataset.table_name + timestamp_field: created + dimension_tracking: + - table: project:dataset.table_name + timestamp_field: created + field: order_status +``` + +#### Great Expectations Quality Object + +Quality attributes defined as Great Expectations [Expectations](https://greatexpectations.io/expectations/). + +The `specification` represents a list of expectations on a specific model. + +Example (string): + +```yaml +quality: + type: great-expectations + specification: + orders: |- + [ + { + "expectation_type": "expect_table_row_count_to_be_between", + "kwargs": { + "min_value": 10 + }, + "meta": { + + } + } + ] +``` + +### Data Types + +The following data types are supported for model fields and definitions: + +- Unicode character sequence: `string`, `text`, `varchar` +- Any numeric type, either integers or floating point numbers: `number`, `decimal`, `numeric` +- 32-bit signed integer: `int`, `integer` +- 64-bit signed integer: `long`, `bigint` +- Single precision (32-bit) IEEE 754 floating-point number: `float` +- Double precision (64-bit) IEEE 754 floating-point number: `double` +- Binary value: `boolean` +- Timestamp with timezone: `timestamp`, `timestamp_tz` +- Timestamp with no timezone: `timestamp_ntz` +- Date with no time information: `date` +- Array: `array` +- Sequence of 8-bit unsigned bytes: `bytes` +- Complex type: `object`, `record`, `struct` +- No value: `null` + +### Specification Extensions + +While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. + +A custom fields can be added with any name. The value can be null, a primitive, an array or an object. + +### Design Principles + +The Data Contract Specification follows these design principles: + +- A free, open, and open-sourced standard +- Follow OpenAPI and AsyncAPI conventions so that it feels immediately familiar +- Support contract-first approaches +- Support code-first approaches +- Support tooling by being machine-readable + +Tooling +--- +- [Data Contract CLI](https://github.com/datacontract/datacontract-cli) is a free CLI tool to help you create, develop, and maintain your data contracts. +- [Data Mesh Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data products and data contracts. It supports the data contract specification and allows the user to import or export data contracts using this specification. + + +Other Data Contract Specifications +--- +- [AIDA User Group's Open Data Contract Standard](https://github.com/AIDAUserGroup/open-data-contract-standard) +- [PayPal's Data Contract Template](https://github.com/paypal/data-contract-template/blob/main/docs/README.md) + +Literature +--- +- [Driving Data Quality with Data Contracts](https://www.amazon.com/dp/B0C37FPH3D) by Andrew Jones + +Authors +--- +The Data Contract Specification was originally created by [Jochen Christ](https://www.linkedin.com/in/jochenchrist/) and [Dr. Simon Harrer](https://www.linkedin.com/in/simonharrer/), and is currently maintained by them. + + +Contributing +--- +Contributions are welcome! Please open an issue or a pull request. + +License +--- +[MIT License](LICENSE) + + + diff --git a/versions/0.9.3/datacontract.init.yaml b/versions/0.9.3/datacontract.init.yaml new file mode 100644 index 0000000..382ad8b --- /dev/null +++ b/versions/0.9.3/datacontract.init.yaml @@ -0,0 +1,109 @@ +dataContractSpecification: 0.9.3 +id: my-data-contract-id +info: + title: My Data Contract + version: 0.0.1 +# description: +# owner: +# contact: +# name: +# url: +# email: + + +### servers + +#servers: +# production: +# type: s3 +# location: s3:// +# format: parquet +# delimiter: new_line + +### terms + +#terms: +# usage: +# limitations: +# billing: +# noticePeriod: + + +### models + +# models: +# my_model: +# description: +# type: +# fields: +# my_field: +# type: +# description: + + +### definitions + +# definitions: +# my_field: +# domain: +# name: +# title: +# type: +# description: +# example: +# pii: +# classification: + + +### examples + +#examples: +# - type: csv +# model: my_model +# data: |- +# id,timestamp,amount +# "1001","2023-09-09T08:30:00Z",2500 +# "1002","2023-09-08T15:45:00Z",1800 + +### servicelevels + +#servicelevels: +# availability: +# description: The server is available during support hours +# percentage: 99.9% +# retention: +# description: Data is retained for one year because! +# period: P1Y +# unlimited: false +# latency: +# description: Data is available within 25 hours after the order was placed +# threshold: 25h +# sourceTimestampField: orders.order_timestamp +# processedTimestampField: orders.processed_timestamp +# freshness: +# description: The age of the youngest row in a table. +# threshold: 25h +# timestampField: orders.order_timestamp +# frequency: +# description: Data is delivered once a day +# type: batch # or streaming +# interval: daily # for batch, either or cron +# cron: 0 0 * * * # for batch, either or interval +# support: +# description: The data is available during typical business hours at headquarters +# time: 9am to 5pm in EST on business days +# responseTime: 1h +# backup: +# description: Data is backed up once a week, every Sunday at 0:00 UTC. +# interval: weekly +# cron: 0 0 * * 0 +# recoveryTime: 24 hours +# recoveryPoint: 1 week + +### quality + +#quality: +# type: SodaCL +# specification: +# checks for my_model: |- +# - duplicate_count(id) = 0 diff --git a/versions/0.9.3/datacontract.schema.json b/versions/0.9.3/datacontract.schema.json new file mode 100644 index 0000000..9c65c5d --- /dev/null +++ b/versions/0.9.3/datacontract.schema.json @@ -0,0 +1,1215 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "type": "object", + "title": "DataContractSpecification", + "properties": { + "dataContractSpecification": { + "type": "string", + "title": "DataContractSpecificationVersion", + "enum": [ + "0.9.3", + "0.9.2", + "0.9.1", + "0.9.0" + ], + "description": "Specifies the Data Contract Specification being used." + }, + "id": { + "type": "string", + "description": "Specifies the identifier of the data contract." + }, + "info": { + "type": "object", + "properties": { + "title": { + "type": "string", + "description": "The title of the data contract." + }, + "version": { + "type": "string", + "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." + }, + "status": { + "type": "string", + "description": "The status of the data contract. Can be proposed, in development, active, retired.", + "x-extensible-enum": [ + "proposed", + "in development", + "active", + "retired" + ] + }, + "description": { + "type": "string", + "description": "A description of the data contract." + }, + "owner": { + "type": "string", + "description": "The owner or team responsible for managing the data contract and providing the data." + }, + "contact": { + "type": "object", + "properties": { + "name": { + "type": "string", + "description": "The identifying name of the contact person/organization." + }, + "url": { + "type": "string", + "format": "uri", + "description": "The URL pointing to the contact information. This MUST be in the form of a URL." + }, + "email": { + "type": "string", + "format": "email", + "description": "The email address of the contact person/organization. This MUST be in the form of an email address." + } + }, + "description": "Contact information for the data contract.", + "additionalProperties": true + } + }, + "additionalProperties": true, + "required": [ + "title", + "version" + ], + "description": "Metadata and life cycle information about the data contract." + }, + "servers": { + "type": "object", + "additionalProperties": { + "oneOf": [ + { + "type": "object", + "title": "BigQueryServer", + "properties": { + "type": { + "type": "string", + "enum": [ + "bigquery", + "BigQuery" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "project": { + "type": "string", + "description": "An optional string describing the server." + }, + "dataset": { + "type": "string", + "description": "An optional string describing the server." + } + }, + "additionalProperties": true, + "required": [ + "type", + "project", + "dataset" + ] + }, + { + "type": "object", + "title": "S3Server", + "properties": { + "type": { + "type": "string", + "enum": [ + "s3" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "location": { + "type": "string", + "format": "uri", + "description": "An optional string describing the server. Must be in the form of a URL.", + "examples": [ + "s3://datacontract-example-orders-latest/data/{model}/*.json" + ] + }, + "endpointUrl": { + "type": "string", + "format": "uri", + "description": "The server endpoint for S3-compatible servers.", + "examples": ["https://minio.example.com"] + }, + "format": { + "type": "string", + "enum": [ + "parquet", + "delta", + "json", + "csv" + ], + "description": "File format." + }, + "delimiter": { + "type": "string", + "enum": [ + "new_line", + "array" + ], + "description": "Only for format = json. How multiple json documents are delimited within one file" + } + }, + "additionalProperties": true, + "required": [ + "type", + "location" + ] + }, + { + "type": "object", + "title": "SftpServer", + "properties": { + "type": { + "type": "string", + "enum": [ + "sftp" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "location": { + "type": "string", + "format": "uri", + "description": "An optional string describing the server. Must be in the form of a sftp URL.", + "examples": [ + "sftp://123.123.12.123/{model}/*.json" + ] + }, + "format": { + "type": "string", + "enum": [ + "parquet", + "delta", + "json", + "csv" + ], + "description": "File format." + }, + "delimiter": { + "type": "string", + "enum": [ + "new_line", + "array" + ], + "description": "Only for format = json. How multiple json documents are delimited within one file" + } + }, + "additionalProperties": true, + "required": [ + "type", + "location" + ] + }, + { + "type": "object", + "title": "RedshiftServer", + "properties": { + "type": { + "type": "string", + "enum": [ + "redshift" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "account": { + "type": "string", + "description": "An optional string describing the server." + }, + "database": { + "type": "string", + "description": "An optional string describing the server." + }, + "schema": { + "type": "string", + "description": "An optional string describing the server." + } + }, + "additionalProperties": true, + "required": [ + "type", + "account", + "database", + "schema" + ] + }, + { + "type": "object", + "title": "AzureServer", + "properties": { + "type": { + "type": "string", + "enum": [ + "azure" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "location": { + "type": "string", + "format": "uri", + "description": "Fully qualified path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs.", + "examples": [ + "az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet", + "abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet" + ] + }, + "format": { + "type": "string", + "enum": [ + "parquet", + "delta", + "json", + "csv" + ], + "description": "File format." + }, + "delimiter": { + "type": "string", + "enum": [ + "new_line", + "array" + ], + "description": "Only for format = json. How multiple json documents are delimited within one file" + } + }, + "additionalProperties": true, + "required": [ + "type", + "location", + "format" + ] + }, + { + "type": "object", + "title": "SnowflakeServer", + "properties": { + "type": { + "type": "string", + "enum": [ + "snowflake" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "account": { + "type": "string", + "description": "An optional string describing the server." + }, + "database": { + "type": "string", + "description": "An optional string describing the server." + }, + "schema": { + "type": "string", + "description": "An optional string describing the server." + } + }, + "additionalProperties": true, + "required": [ + "type", + "account", + "database", + "schema" + ] + }, + { + "type": "object", + "title": "DatabricksServer", + "properties": { + "type": { + "type": "string", + "const": "databricks", + "description": "The type of the data product technology that implements the data contract." + }, + "host": { + "type": "string", + "description": "The Databricks host", + "examples": [ + "dbc-abcdefgh-1234.cloud.databricks.com" + ] + }, + "catalog": { + "type": "string", + "description": "The name of the Hive or Unity catalog" + }, + "schema": { + "type": "string", + "description": "The schema name in the catalog" + } + }, + "additionalProperties": true, + "required": [ + "type", + "host", + "catalog", + "schema" + ] + }, + { + "type": "object", + "title": "GlueServer", + "properties": { + "type": { + "type": "string", + "const": "glue", + "description": "The type of the data product technology that implements the data contract." + }, + "account": { + "type": "string", + "description": "The AWS Glue account", + "examples": [ + "1234-5678-9012" + ] + }, + "database": { + "type": "string", + "description": "The AWS Glue database name", + "examples": [ + "my_database" + ] + }, + "location": { + "type": "string", + "format": "uri", + "description": "The AWS S3 path. Must be in the form of a URL.", + "examples": [ + "s3://datacontract-example-orders-latest/data/{model}" + ] + }, + "format": { + "type": "string", + "description": "The format of the files", + "examples": [ + "parquet", + "csv", + "json", + "delta" + ] + } + }, + "additionalProperties": true, + "required": [ + "type", + "account", + "database" + ] + }, + { + "type": "object", + "title": "PostgresServer", + "properties": { + "type": { + "type": "string", + "const": "postgres", + "description": "The type of the data product technology that implements the data contract." + }, + "host": { + "type": "string", + "description": "The host to the database server", + "examples": [ + "localhost" + ] + }, + "port": { + "type": "integer", + "description": "The port to the database server." + }, + "database": { + "type": "string", + "description": "The name of the database.", + "examples": [ + "postgres" + ] + }, + "schema": { + "type": "string", + "description": "The name of the schema in the database.", + "examples": [ + "public" + ] + } + }, + "additionalProperties": true, + "required": [ + "type", + "host", + "port", + "database", + "schema" + ] + }, + { + "type": "object", + "title": "OracleServer", + "properties": { + "type": { + "type": "string", + "const": "oracle", + "description": "The type of the data product technology that implements the data contract." + }, + "host": { + "type": "string", + "description": "The host to the oracle server", + "examples": [ + "localhost" + ] + }, + "port": { + "type": "integer", + "description": "The port to the oracle server.", + "examples": [ + 1523 + ] + }, + "serviceName": { + "type": "string", + "description": "The name of the service.", + "examples": [ + "service" + ] + } + }, + "additionalProperties": true, + "required": [ + "type", + "host", + "port", + "serviceName" + ] + }, + { + "type": "object", + "title": "KafkaServer", + "description": "Kafka Server", + "properties": { + "type": { + "type": "string", + "enum": [ + "kafka" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "host": { + "type": "string", + "description": "The bootstrap server of the kafka cluster." + }, + "topic": { + "type": "string", + "description": "The topic name." + }, + "format": { + "type": "string", + "description": "The format of the message. Examples: json, avro, protobuf. Default: json.", + "default": "json" + } + }, + "additionalProperties": true, + "required": [ + "type", + "host", + "topic" + ] + }, + { + "type": "object", + "title": "PubSubServer", + "properties": { + "type": { + "type": "string", + "enum": [ + "pubsub" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "project": { + "type": "string", + "description": "The GCP project name." + }, + "topic": { + "type": "string", + "description": "The topic name." + } + }, + "additionalProperties": true, + "required": [ + "type", + "project", + "topic" + ] + }, + { + "type": "object", + "title": "LocalServer", + "properties": { + "type": { + "type": "string", + "enum": [ + "local" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "path": { + "type": "string", + "description": "The relative or absolute path to the data file(s).", + "examples": [ + "./folder/data.parquet", + "./folder/*.parquet" + ] + }, + "format": { + "type": "string", + "description": "The format of the file(s)", + "examples": [ + "json", + "parquet", + "delta", + "csv" + ] + } + }, + "additionalProperties": true, + "required": [ + "type", + "path", + "format" + ] + } + ] + }, + "description": "Information about the servers." + }, + "terms": { + "type": "object", + "description": "The terms and conditions of the data contract.", + "properties": { + "usage": { + "type": "string", + "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." + }, + "limitations": { + "type": "string", + "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." + }, + "billing": { + "type": "string", + "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." + }, + "noticePeriod": { + "type": "string", + "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." + } + }, + "additionalProperties": true + }, + "models": { + "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", + "type": "object", + "minProperties": 1, + "propertyNames": { + "pattern": "^[a-zA-Z0-9_-]+$" + }, + "additionalProperties": { + "type": "object", + "title": "Model", + "properties": { + "description": { + "type": "string" + }, + "type": { + "description": "The type of the model. Examples: table, view, object. Default: table.", + "type": "string", + "title": "ModelType", + "default": "table", + "enum": [ + "table", + "view", + "object" + ] + }, + "title": { + "type": "string", + "description": "An optional string providing a human readable name for the model. Especially useful if the model name is cryptic or contains abbreviations.", + "examples": ["Purchase Orders", "Air Shipments"] + }, + "fields": { + "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", + "type": "object", + "additionalProperties": { + "type": "object", + "title": "Field", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the semantic of the data in this field." + }, + "title": { + "type": "string", + "description": "An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations." + }, + "type": { + "$ref": "#/$defs/FieldType" + }, + "required": { + "type": "boolean", + "default": false, + "description": "An indication, if this field must contain a value and may not be null." + }, + "fields": { + "description": "The nested fields (e.g. columns) of the object, record, or struct.", + "type": "object", + "additionalProperties": { + "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" + } + }, + "items": { + "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" + }, + "primary": { + "type": "boolean", + "default": false, + "description": "If this field is a primary key." + }, + "references": { + "type": "string", + "description": "The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship.", + "examples": [ + "orders.order_id", + "model.nested_field.field" + ] + }, + "unique": { + "type": "boolean", + "default": false, + "description": "An indication, if the value must be unique within the model." + }, + "enum": { + "type": "array", + "items": { + "type": "string" + }, + "uniqueItems": true, + "description": "A value must be equal to one of the elements in this array value. Only evaluated if the value is not null." + }, + "minLength": { + "type": "integer", + "description": "A value must greater than, or equal to, the value of this. Only applies to string types." + }, + "maxLength": { + "type": "integer", + "description": "A value must less than, or equal to, the value of this. Only applies to string types." + }, + "format": { + "type": "string", + "description": "A specific format the value must comply with (e.g., 'email', 'uri', 'uuid').", + "examples": [ + "email", + "uri", + "uuid" + ] + }, + "precision": { + "type": "number", + "examples": [ + 38 + ], + "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." + }, + "scale": { + "type": "number", + "examples": [ + 0 + ], + "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." + }, + "pattern": { + "type": "string", + "description": "A regular expression the value must match. Only applies to string types.", + "examples": [ + "^[a-zA-Z0-9_-]+$" + ] + }, + "minimum": { + "type": "number", + "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." + }, + "exclusiveMinimum": { + "type": "number", + "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." + }, + "maximum": { + "type": "number", + "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." + }, + "exclusiveMaximum": { + "type": "number", + "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." + }, + "example": { + "type": "string", + "description": "An example value for this field." + }, + "pii": { + "type": "boolean", + "description": "An indication, if this field contains Personal Identifiable Information (PII)." + }, + "classification": { + "type": "string", + "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", + "examples": [ + "sensitive", + "restricted", + "internal", + "public" + ] + }, + "tags": { + "type": "array", + "items": { + "type": "string" + }, + "description": "Custom metadata to provide additional context." + }, + "$ref": { + "type": "string", + "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." + } + } + } + } + } + } + }, + "definitions": { + "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", + "type": "object", + "propertyNames": { + "pattern": "^[a-zA-Z0-9_-]+$" + }, + "additionalProperties": { + "type": "object", + "title": "Definition", + "properties": { + "domain": { + "type": "string", + "description": "The domain in which this definition is valid.", + "default": "global" + }, + "name": { + "type": "string", + "description": "The technical name of this definition." + }, + "title": { + "type": "string", + "description": "The business name of this definition." + }, + "description": { + "type": "string", + "description": "Clear and concise explanations related to the domain." + }, + "type": { + "$ref": "#/$defs/FieldType" + }, + "minLength": { + "type": "integer", + "description": "A value must be greater than or equal to this value. Applies only to string types." + }, + "maxLength": { + "type": "integer", + "description": "A value must be less than or equal to this value. Applies only to string types." + }, + "format": { + "type": "string", + "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." + }, + "precision": { + "type": "integer", + "examples": [ + 38 + ], + "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." + }, + "scale": { + "type": "integer", + "examples": [ + 0 + ], + "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." + }, + "pattern": { + "type": "string", + "description": "A regular expression pattern the value must match. Applies only to string types." + }, + "minimum": { + "type": "number", + "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." + }, + "exclusiveMinimum": { + "type": "number", + "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." + }, + "maximum": { + "type": "number", + "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." + }, + "exclusiveMaximum": { + "type": "number", + "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." + }, + "example": { + "type": "string", + "description": "An example value." + }, + "pii": { + "type": "boolean", + "description": "Indicates if the field contains Personal Identifiable Information (PII)." + }, + "classification": { + "type": "string", + "description": "The data class defining the sensitivity level for this field." + }, + "tags": { + "type": "array", + "items": { + "type": "string" + }, + "description": "Custom metadata to provide additional context." + } + }, + "required": [ + "name", + "type" + ] + } + }, + "schema": { + "type": "object", + "properties": { + "type": { + "type": "string", + "title": "SchemaType", + "enum": [ + "dbt", + "bigquery", + "json-schema", + "sql-ddl", + "avro", + "protobuf", + "custom" + ], + "description": "The type of the schema. Typical values are dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." + }, + "specification": { + "oneOf": [ + { + "type": "string", + "description": "The specification of the schema as a string." + }, + { + "type": "object", + "description": "The specification of the schema as an object." + } + ] + } + }, + "required": [ + "type", + "specification" + ], + "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." + }, + "examples": { + "type": "array", + "items": { + "type": "object", + "properties": { + "type": { + "type": "string", + "title": "ExampleType", + "enum": [ + "csv", + "json", + "yaml", + "custom" + ], + "description": "The type of the example data. Well-known types are csv, json, yaml, custom." + }, + "description": { + "type": "string", + "description": "An optional string describing the example." + }, + "model": { + "type": "string", + "description": "The reference to the model in the schema, e.g., a table name." + }, + "data": { + "oneOf": [ + { + "type": "string", + "description": "Example data for this model." + }, + { + "type": "array", + "description": "Example data for this model in a structured format. Use this for type json or yaml." + } + ] + } + }, + "required": [ + "type", + "data" + ] + }, + "description": "The Examples Object is an array of Example Objects." + }, + "servicelevels": { + "type": "object", + "description": "Specifies the service level agreements for the provided data, including availability, data retention policies, latency requirements, data freshness, update frequency, support availability, and backup policies.", + "properties": { + "availability": { + "type": "object", + "description": "Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data.", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the availability service level.", + "example": "The server is available during support hours" + }, + "percentage": { + "type": "string", + "description": "An optional string describing the guaranteed uptime in percent (e.g., `99.9%`)", + "pattern": "^\\d+(\\.\\d+)?%$", + "example": "99.9%" + } + } + }, + "retention": { + "type": "object", + "description": "Retention covers the period how long data will be available.", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the retention service level.", + "example": "Data is retained for one year." + }, + "period": { + "type": "string", + "description": "An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`).", + "example": "P1Y" + }, + "unlimited": { + "type": "boolean", + "description": "An optional indicator that data is kept forever.", + "example": false + }, + "timestampField": { + "type": "string", + "description": "An optional reference to the field that contains the timestamp that the period refers to.", + "example": "orders.order_timestamp" + } + } + }, + "latency": { + "type": "object", + "description": "Latency refers to the maximum amount of time from the source to its destination.", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the latency service level.", + "example": "Data is available within 25 hours after the order was placed." + }, + "threshold": { + "type": "string", + "description": "An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`).", + "example": "25h" + }, + "sourceTimestampField": { + "type": "string", + "description": "An optional reference to the field that contains the timestamp when the data was provided at the source.", + "example": "orders.order_timestamp" + }, + "processedTimestampField": { + "type": "string", + "description": "An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract.", + "example": "orders.processed_timestamp" + } + } + }, + "freshness": { + "type": "object", + "description": "The maximum age of the youngest row in a table.", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the freshness service level.", + "example": "The age of the youngest row in a table is within 25 hours." + }, + "threshold": { + "type": "string", + "description": "An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`).", + "example": "25h" + }, + "timestampField": { + "type": "string", + "description": "An optional reference to the field that contains the timestamp that the threshold refers to.", + "example": "orders.order_timestamp" + } + } + }, + "frequency": { + "type": "object", + "description": "Frequency describes how often data is updated.", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the frequency service level.", + "example": "Data is delivered once a day." + }, + "type": { + "type": "string", + "enum": [ + "batch", + "micro-batching", + "streaming", + "manual" + ], + "description": "The method of data processing.", + "example": "batch" + }, + "interval": { + "type": "string", + "description": "Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`.", + "example": "daily" + }, + "cron": { + "type": "string", + "description": "Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`.", + "example": "0 0 * * *" + } + } + }, + "support": { + "type": "object", + "description": "Support describes the times when support will be available for contact.", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the support service level.", + "example": "The data is available during typical business hours at headquarters." + }, + "time": { + "type": "string", + "description": "An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`.", + "example": "9am to 5pm in EST on business days" + }, + "responseTime": { + "type": "string", + "description": "An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with.", + "example": "24 hours" + } + } + }, + "backup": { + "type": "object", + "description": "Backup specifies details about data backup procedures.", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the backup service level.", + "example": "Data is backed up once a week, every Sunday at 0:00 UTC." + }, + "interval": { + "type": "string", + "description": "An optional interval that defines how often data will be backed up, e.g., `daily`.", + "example": "weekly" + }, + "cron": { + "type": "string", + "description": "An optional cron expression when data will be backed up, e.g., `0 0 * * *`.", + "example": "0 0 * * 0" + }, + "recoveryTime": { + "type": "string", + "description": "An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours).", + "example": "24 hours" + }, + "recoveryPoint": { + "type": "string", + "description": "An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours).", + "example": "1 week" + } + } + } + } + }, + "quality": { + "type": "object", + "properties": { + "type": { + "type": "string", + "title": "QualityType", + "enum": [ + "SodaCL", + "montecarlo", + "great-expectations", + "custom" + ], + "description": "The type of the quality check. Typical values are SodaCL, montecarlo, great-expectations, custom." + }, + "specification": { + "oneOf": [ + { + "type": "string", + "description": "The specification of the quality attributes as a string." + }, + { + "type": "object", + "description": "The specification of the quality attributes as an object." + } + ] + } + }, + "required": [ + "type", + "specification" + ], + "description": "The quality object contains quality attributes and checks." + } + }, + "required": [ + "dataContractSpecification", + "id", + "info" + ], + "$defs": { + "FieldType": { + "type": "string", + "title": "FieldType", + "description": "The logical data type of the field.", + "enum": [ + "number", + "decimal", + "numeric", + "int", + "integer", + "long", + "bigint", + "float", + "double", + "string", + "text", + "varchar", + "boolean", + "timestamp", + "timestamp_tz", + "timestamp_ntz", + "date", + "array", + "object", + "record", + "struct", + "bytes", + "null" + ] + } + } +} diff --git a/versions/0.9.3/definition.schema.json b/versions/0.9.3/definition.schema.json new file mode 100644 index 0000000..1cd561d --- /dev/null +++ b/versions/0.9.3/definition.schema.json @@ -0,0 +1,81 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "type": "object", + "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", + "properties": { + "domain": { + "type": "string", + "description": "The domain in which this definition is valid.", + "default": "global" + }, + "name": { + "type": "string", + "description": "The technical name of this definition." + }, + "title": { + "type": "string", + "description": "The business name of this definition." + }, + "description": { + "type": "string", + "description": "Clear and concise explanations related to the domain." + }, + "type": { + "type": "string", + "description": "The logical data type." + }, + "minLength": { + "type": "integer", + "description": "A value must be greater than or equal to this value. Applies only to string types." + }, + "maxLength": { + "type": "integer", + "description": "A value must be less than or equal to this value. Applies only to string types." + }, + "format": { + "type": "string", + "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." + }, + "precision": { + "type": "integer", + "examples": [ + 38 + ], + "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." + }, + "scale": { + "type": "integer", + "examples": [ + 0 + ], + "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." + }, + "pattern": { + "type": "string", + "description": "A regular expression pattern the value must match. Applies only to string types." + }, + "example": { + "type": "string", + "description": "An example value." + }, + "pii": { + "type": "boolean", + "description": "Indicates if the field contains Personal Identifiable Information (PII)." + }, + "classification": { + "type": "string", + "description": "The data class defining the sensitivity level for this field." + }, + "tags": { + "type": "array", + "items": { + "type": "string" + }, + "description": "Custom metadata to provide additional context." + } + }, + "required": [ + "name", + "type" + ] +} From ad4e36f9835ea5cfe0ab696a505de2cf681d45b5 Mon Sep 17 00:00:00 2001 From: jochen Date: Tue, 14 May 2024 08:37:01 +0200 Subject: [PATCH 02/31] Update quality checks --- README.md | 224 ++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 158 insertions(+), 66 deletions(-) diff --git a/README.md b/README.md index 7d97c22..d919558 100644 --- a/README.md +++ b/README.md @@ -105,7 +105,7 @@ models: pii: true classification: sensitive quality: - - type: business-rule + - type: text name: The email address was verified by the system processed_timestamp: description: The timestamp when the record was processed by the data platform. @@ -297,19 +297,19 @@ This is the root document. It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. -| Field | Type | Description | -|---------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------| -| dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | -| id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | -| info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | -| servers | Map[string, [Server Object](#server-object)] | Specifies the servers of the data contract. | -| terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | -| models | Map[string, [Model Object](#model-object)] | Specifies the logical data model. | -| definitions | Map[string, [Definition Object](#definition-object)] | Specifies definitions. | -| schema | [Schema Object](#schema-object) | Specifies the physical schema. The specification supports different schema format. | -| examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | -| servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | -| quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | +| Field | Type | Description | +|---------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------| +| dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | +| id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | +| info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | +| servers | Map[string, [Server Object](#server-object)] | Specifies the servers of the data contract. | +| terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | +| models | Map[string, [Model Object](#model-object)] | Specifies the logical data model. | +| definitions | Map[string, [Definition Object](#definition-object)] | Specifies definitions. | +| schema | [Schema Object](#schema-object) | Specifies the physical schema. The specification supports different schema format. | +| examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | +| servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | +| quality | [Quality Object](#quality-object) | Deprecated on top-level. Use model-level and field-field level quality. Specifies the quality attributes and checks. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -939,29 +939,44 @@ The quality object defined a quality attribute. Quality attributes are checks that can be applied to the data to ensure its quality. Data can be verified by executing these checks through a data quality engine. -A quality object can be specified on field level, or on model level. The top-level quality object is deprecated. +Quality attributes can be: +- Text: A human-readable text that describes the quality of the data. +- SQL: An individual SQL query that returns a single value that can be compared. +- Predefined Types: Some commonly-used predefined quality attributes such as `row_count`, `unique`, `freshness` +- Vendor-specific: Quality attributes that are specific to a vendor, such as Great Expectations, SodaCL or Montecarlo. -The fields of the quality object depends on the quality type. +A quality object can be specified on field level, or on model level. The top-level quality object are deprecated. -#### Plain-text +The fields of the quality object depends on the quality `type`. -A human-readable text that describe the quality of the data. These can later be translated into a technical check (such as SQL), or checked through an AI engine. +#### Text -| Field | Type | Description | -|-------------|----------|--------------------------------------------------| -| type | `string` | `plain-text` | -| name | `string` | Optional. A human-readable name for this check | -| description | `string` | A plain text describing the quality of the data. | + Applicable on: [ ] top-level, [x] model, [x] field + +A human-readable text that describe the quality of the data. Later in the development process, these might be translated into an executable check (such as `sql`), or checked through an AI engine. + +| Field | Type | Description | +|-------------|----------|--------------------------------------------------------------------| +| type | `string` | `text` | +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | A plain text describing the quality attribute in natural language. | Example: ```yaml -- type: plain-text - description: The email address was verified by the system +models: + my_table: + fields: + email: + quality: + - type: text + description: The email address was verified by the system ``` #### SQL +Applicable on: [ ] top-level, [x] model, [x] field + An individual SQL query that returns a single number or boolean value that can be compared. The SQL query must be in the SQL dialect of the provided server. | Field | Type | Description | @@ -977,17 +992,23 @@ An individual SQL query that returns a single number or boolean value that can b | description | `string` | A plain text describing the quality of the data. | ```yaml -- type: sql - description: The maximum duration between two orders should be less that 3600 seconds - query: | - SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration - FROM orders - must_be_less_than: 3600 +models: + my_table: + quality: + - type: sql + description: The maximum duration between two orders should be less that 3600 seconds + query: | + SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration + FROM orders + must_be_less_than: 3600 ``` #### Row Count +Applicable on: [ ] top-level, [x] model, [ ] field + + Counts the number of rows in a model. | Field | Type | Description | @@ -1004,14 +1025,19 @@ Counts the number of rows in a model. ```yaml -- type: row_count - must_be_greater_than: 500000 +models: + my_table: + quality: + - type: row_count + must_be_greater_than: 500000 ``` #### Unique -A uniqueness check for multiple fields. +Applicable on: [ ] top-level, [x] model, [ ] field + +A uniqueness check for multiple fields. (For a single field uniqueness check, use the `unique` field attribute.) | Field | Type | Description | |----------------------------------|-------------------|------------------------------------------------------------------------| @@ -1020,69 +1046,135 @@ A uniqueness check for multiple fields. | name | `string` | Optional. A human-readable name for this check | | description | `string` | A plain text describing the quality of the data. | +Example: ```yaml -- type: unique - fields: - - country - - order_id +models: + my_table: + fields: + order_id: + type: string + country: + type: string + quality: + - type: unique + fields: + - country + - order_id ``` #### Freshness -TBD -#### SodaCL +Applicable on: [ ] top-level, [ ] model, [x] field -Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). +At least one element in the model must have a timestamp value that is less than a certain threshold. -The `specification` represents the content of a `checks.yml` file. -Example: +| Field | Type | Description | +|---------------------------|----------|--------------------------------------------------| +| type | `string` | `freshness` | +| must_be_less_than_seconds | `number` | The threshold in seconds to compare | +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | A plain text describing the quality of the data. | + + +Example: ```yaml -quality: - - type: SodaCL - specification: | - checks for orders: - - row_count > 0 - - duplicate_count(order_id) = 0 - checks for line_items: - - row_count > 0 +models: + my_table: + fields: + some_timestamp: + type: timestamp + quality: + - type: freshness + must_be_less_than_seconds: 3600 + description: At least one element in the model must have a timestamp value that is less than 1 hour ``` #### Great Expectations +Applicable on: [ ] top-level, [x] model, [x] field + +Quality attributes defined as an Great Expectations [Expectation](https://greatexpectations.io/expectations/). + + +Example: + +```yaml +models: + my_table: + quality: + - type: great-expectations + expectation_type: expect_table_row_count_to_be_between + kwargs: + min_value: 10000 + max_value: 50000 +``` + + + +#### Great Expectations (Expectation Suite) + +Applicable on: [ ] top-level, [x] model, [ ] field + Quality attributes defined as Great Expectations [Expectations](https://greatexpectations.io/expectations/). -The `specification` represents a expectation suite as JSON string. +The `specification` represents an expectation suite as JSON string. -New with 0.9.4: This quality type is only applicable on model level. +New with v0.9.4: This quality type is only applicable on model level. Example: ```yaml -quality: - - type: great-expectations - specification: | - [ - { - "expectation_type": "expect_table_row_count_to_be_between", - "kwargs": { - "min_value": 10 - }, - "meta": { - +models: + my_table: + quality: + - type: great-expectations + specification: | + [ + { + "expectation_type": "expect_table_row_count_to_be_between", + "kwargs": { + "min_value": 10000, + "max_value": 50000, + }, + "meta": { + + } } - } - ] + ] ``` +#### SodaCL + +Applicable on: [x] top-level, [x] model, [ ] field + +Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). + +The `specification` represents the content of a `checks.yml` file. + +Example: + +```yaml +quality: + - type: SodaCL + specification: | + checks for orders: + - row_count > 0 + - duplicate_count(order_id) = 0 + checks for line_items: + - row_count > 0 +``` + #### Monte Carlo +Applicable on: [x] top-level, [x] model, [ ] field + Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). The `specification` represents the content of a `montecarlo.yml` file. From b9a976e0ef7edf14a53a8cb16f2409faaa779f0d Mon Sep 17 00:00:00 2001 From: jochen Date: Sat, 20 Jul 2024 16:58:52 +0200 Subject: [PATCH 03/31] Update quality --- CHANGELOG.md | 9 +- README.md | 516 +++++------------------ datacontract.init.yaml | 10 +- examples/orders-latest/datacontract.yaml | 28 +- versions/0.9.3/README.md | 292 ++++++++----- versions/0.9.3/datacontract.init.yaml | 207 +++++---- versions/0.9.3/datacontract.schema.json | 322 +++++++++++++- versions/0.9.3/definition.schema.json | 19 +- 8 files changed, 756 insertions(+), 647 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index f025bc9..7f22319 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,8 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] -Please note, while the major version is zero (0.y.z), Anything MAY change at any time. -The public API SHOULD NOT be considered stable. +## [1.0.1] - 2024-07-20 ### Added - Data quality attributes on model and field level @@ -25,6 +24,12 @@ The public API SHOULD NOT be considered stable. - Field `type: map` support with properties `keys` and `values` - Definitions: `fields`, for type `object`, `record`, and `struct` +### Removed + +- `quality` on top-level removed (is now considered as specification extension) +- `schema` removed (is now considered as specification extension) + + ## [0.9.3] - 2024-03-06 ### Added diff --git a/README.md b/README.md index cf22f20..2d34560 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ The specification comes along with the [Data Contract CLI](https://github.com/da Version --- -0.9.4([Changelog](CHANGELOG.md)) +1.0.1([Changelog](CHANGELOG.md)) Example --- @@ -42,7 +42,7 @@ Example [![Data Contract Catalog](https://img.shields.io/badge/Data%20Contract-Catalog-blue)](https://datacontract.com/examples/index.html) ```yaml -dataContractSpecification: 0.9.3 +dataContractSpecification: 1.0.1 id: urn:datacontract:checkout:orders-latest info: title: Orders Latest @@ -114,7 +114,7 @@ models: classification: sensitive quality: - type: text - name: The email address was verified by the system + name: The email address was verified by a user processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp @@ -123,14 +123,15 @@ models: jsonType: string jsonFormat: date-time quality: - - type: row_count - must_be_greater_than: 5 - type: sql description: The maximum duration between two orders should be less that 3600 seconds query: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration FROM orders must_be_less_than: 3600 + - type: row_count + engine: soda + must_be_greater_than: 5 line_items: description: A single article that is part of an order. type: table @@ -296,21 +297,19 @@ This is the root document. It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. -| Field | Type | Description | -|---------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------| -| dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | -| id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | -| info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | +| Field | Type | Description | +|---------------------------|--------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------| +| dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | +| id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | +| info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | | servers | Map[`string`, [Server Object](#server-object)] | Specifies the servers of the data contract. | -| terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | +| terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | | models | Map[`string`, [Model Object](#model-object)] | Specifies the logical data model. | | definitions | Map[`string`, [Definition Object](#definition-object)] | Specifies definitions. | -| schema | [Schema Object](#schema-object) | Specifies the physical schema. The specification supports different schema format. | -| examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | -| servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | -| quality | [Quality Object](#quality-object) | Deprecated on top-level. Use model-level and field-field level quality. Specifies the quality attributes and checks. | -| links | Map[`string`, `string`] | Additional external documentation links. | -| tags | Array of `string` | Custom metadata to provide additional context. | +| examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | +| servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | +| links | Map[`string`, `string`] | Additional external documentation links. | +| tags | Array of `string` | Custom metadata to provide additional context. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -326,7 +325,7 @@ Metadata and life cycle information about the data contract. |-------------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | title | `string` | REQUIRED. The title of the data contract. | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | -| status | `string` | The status of the data contract. Can be `proposed`, `in development`, `active`, `deprecated`, `retired`. | +| status | `string` | The status of the data contract. Can be `proposed`, `in development`, `active`, `deprecated`, `retired`. | | description | `string` | A description of the data contract. | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | @@ -429,13 +428,13 @@ servers: #### SQL-Server Server Object -| Field | Type | Description | -|----------|-----------|------------------------------------------------------| -| type | `string` | `sqlserver` | -| host | `string` | The host to the database server | -| port | `integer` | The port to the database server, default: `1433` | -| database | `string` | The name of the database, e.g., `database`. | -| schema | `string` | The name of the schema in the database, e.g., `dbo`. | +| Field | Type | Description | +|----------|-----------|--------------------------------------------------------------------------| +| type | `string` | `sqlserver` | +| host | `string` | The host to the database server | +| port | `integer` | The port to the database server, default: `1433` | +| database | `string` | The name of the database, e.g., `database`. | +| schema | `string` | The name of the schema in the database, e.g., `dbo`. | | driver | `string` | The name of the supported driver, e.g., `ODBC Driver 18 for SQL Server`. | @@ -635,208 +634,6 @@ Models fields can refer to definitions using the `$ref` field to link to existin -### Schema Object - -The schema of the data contract describes the physical schema. -The type of the schema depends on the data platform. - -| Field | Type | Description | -|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| -| type | `string` | REQUIRED. The type of the schema.
Typical values are: `dbt`, `bigquery`, `json-schema`, `sql-ddl`, `avro`, `protobuf`, `custom` | -| specification | [dbt Schema Object](#dbt-schema-object) \|
[BigQuery Schema Object](#bigquery-schema-object) \|
[JSON Schema Schema Object](#bigquery-schema-object) \|
[SQL DDL Schema Object](#sql-ddl-schema-object) \|
`string` | REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. | - - -#### dbt Schema Object - -https://docs.getdbt.com/reference/model-properties - -Example (inline YAML): - -```yaml -schema: - type: dbt - specification: - version: 2 - models: - - name: "My Table" - description: "My description" - columns: - - name: "My column" - data_type: text - description: "My description" -``` - -Example (string): - -```yaml -schema: - type: dbt - specification: |- - version: 2 - models: - - name: "My Table" - description: "My description" - columns: - - name: "My column" - data_type: text - description: "My description" -``` - -#### BigQuery Schema Object - -The schema structure is defined by the [Google BigQuery Table](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource:-table) object. You can extract such a Table object via the [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) endpoint. - -Instead of providing a single Table object, you can also provide an array of such objects. Be aware that [tables.list](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list) only returns a subset of the full Table object. You need to call every Table object via [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) to get the full Table object, including the actual schema. - -Learn more: [Google BigQuery REST Reference v2](https://cloud.google.com/bigquery/docs/reference/rest) - - - -Example: - -```yaml -schema: - type: bigquery - specification: |- - { - "tableReference": { - "projectId": "my-project", - "datasetId": "my_dataset", - "tableId": "my_table" - }, - "description": "This is a description", - "type": "TABLE", - "schema": { - "fields": [ - { - "name": "name", - "type": "STRING", - "mode": "NULLABLE", - "description": "This is a description" - } - ] - } - } -``` - -#### JSON Schema Schema Object - -JSON Schema can be defined as JSON or rendered as YAML, following the [OpenAPI Schema Object dialect](https://spec.openapis.org/oas/v3.1.0#properties) - -Example (inline YAML): - -```yaml -schema: - type: json-schema - specification: - orders: - description: One record per order. Includes cancelled and deleted orders. - type: object - properties: - order_id: - type: string - description: Primary key of the orders table - order_timestamp: - type: string - format: date-time - description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. - order_total: - type: integer - description: Total amount of the order in the smallest monetary unit (e.g., cents). - line_items: - type: object - properties: - lines_item_id: - type: string - description: Primary key of the lines_item_id table - order_id: - type: string - description: Foreign key to the orders table - sku: - type: string - description: The purchased article number -``` - -Example (string): - -```yaml -schema: - type: json-schema - specification: |- - { - "$schema": "http://json-schema.org/draft-07/schema#", - "type": "object", - "properties": { - "orders": { - "type": "object", - "description": "One record per order. Includes cancelled and deleted orders.", - "properties": { - "order_id": { - "type": "string", - "description": "Primary key of the orders table" - }, - "order_timestamp": { - "type": "string", - "format": "date-time", - "description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful." - }, - "order_total": { - "type": "integer", - "description": "Total amount of the order in the smallest monetary unit (e.g., cents)." - } - }, - "required": ["order_id", "order_timestamp", "order_total"] - }, - "line_items": { - "type": "object", - "properties": { - "lines_item_id": { - "type": "string", - "description": "Primary key of the lines_item_id table" - }, - "order_id": { - "type": "string", - "description": "Foreign key to the orders table" - }, - "sku": { - "type": "string", - "description": "The purchased article number" - } - }, - "required": ["lines_item_id", "order_id", "sku"] - } - }, - "required": ["orders", "line_items"] - } -``` - -#### SQL DDL Schema Object - -Classical SQL DDLs can be used to describe the structure. - - -Example (string): - -```yaml -schema: - type: sql-ddl - specification: |- - -- One record per order. Includes cancelled and deleted orders. - CREATE TABLE orders ( - order_id TEXT PRIMARY KEY, -- Primary key of the orders table - order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. - order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents) - ); - - -- The items that are part of an order - CREATE TABLE line_items ( - lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table - order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table - sku TEXT NOT NULL -- The purchased article number - ); - -``` - ### Example Object | Field | Type | Description | @@ -988,18 +785,17 @@ Quality attributes are checks that can be applied to the data to ensure its qual Quality attributes can be: - Text: A human-readable text that describes the quality of the data. - SQL: An individual SQL query that returns a single value that can be compared. -- Predefined Types: Some commonly-used predefined quality attributes such as `row_count`, `unique`, `freshness` -- Vendor-specific: Quality attributes that are specific to a vendor, such as Great Expectations, SodaCL or Montecarlo. +- Engine-specific Types: Currently engines `soda` and `great-expectations` are supported. -A quality object can be specified on field level, or on model level. The top-level quality object are deprecated. - -The fields of the quality object depends on the quality `type`. +A quality object can be specified on field level, or on model level. +The top-level quality object are deprecated. #### Text - Applicable on: [ ] top-level, [x] model, [x] field +Applicable on: [x] model, [x] field -A human-readable text that describe the quality of the data. Later in the development process, these might be translated into an executable check (such as `sql`), or checked through an AI engine. +A human-readable text that describes the quality of the data. +Later in the development process, these might be translated into an executable check (such as `sql`), or checked through an AI engine. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------| @@ -1013,29 +809,34 @@ Example: models: my_table: fields: - email: + iban: quality: - type: text - description: The email address was verified by the system + description: Must be a valid IBAN. ``` + #### SQL -Applicable on: [ ] top-level, [x] model, [x] field +Applicable on: [x] model, [x] field An individual SQL query that returns a single number or boolean value that can be compared. The SQL query must be in the SQL dialect of the provided server. -| Field | Type | Description | -|----------------------------------|------------------------|------------------------------------------------------------| -| type | `string` | `sql` | -| query | `string` | A SQL query that returns a single number or boolean value. | -| must_be_equal_to | `integer` or `boolean` | The threshold to check the return value of the query | -| must_be_greater_than | `integer` | The threshold to check the return value of the query | -| must_be_greater_than_or_equal_to | `integer` | The threshold to check the return value of the query | -| must_be_less_than | `integer` | The threshold to check the return value of the query | -| must_be_less_than_or_equal_to | `integer` | The threshold to check the return value of the query | -| name | `string` | Optional. A human-readable name for this check | -| description | `string` | A plain text describing the quality of the data. | +| Field | Type | Description | +|----------------------------------|-----------------------|---------------------------------------------------------------------------------| +| type | `string` | `sql` | +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | A plain text describing the quality of the data. | +| query | `string` | A SQL query that returns a single number or a boolean value. | +| must_be | `integer` | The threshold to check the return value of the query | +| must_not_be | `integer` | The threshold to check the return value of the query | +| must_be_greater_than | `integer` | The threshold to check the return value of the query | +| must_be_greater_than_or_equal_to | `integer` | The threshold to check the return value of the query | +| must_be_less_than | `integer` | The threshold to check the return value of the query | +| must_be_less_than_or_equal_to | `integer` | The threshold to check the return value of the query | +| must_be_between | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | +| must_not_be_between | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | + ```yaml models: @@ -1050,47 +851,18 @@ models: ``` -#### Row Count - -Applicable on: [ ] top-level, [x] model, [ ] field +#### Soda Data Contract Checks +Applicable on: [x] model, [x] field -Counts the number of rows in a model. -| Field | Type | Description | -|----------------------------------|-----------|------------------------------------------------------| -| type | `string` | `row_count` | -| must_be_equal_to | `number` | The threshold to check the return value of the query | -| must_not_be_equal_to | `number` | The threshold to check the return value of the query | -| must_be_greater_than | `number` | The threshold to check the return value of the query | -| must_be_greater_than_or_equal_to | `number` | The threshold to check the return value of the query | -| must_be_less_than | `number` | The threshold to check the return value of the query | -| must_be_less_than_or_equal_to | `number` | The threshold to check the return value of the query | -| name | `string` | Optional. A human-readable name for this check | -| description | `string` | A plain text describing the quality of the data. | +Quality attributes can be defined with the engine `soda` as [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html). +##### Duplicate -```yaml -models: - my_table: - quality: - - type: row_count - must_be_greater_than: 500000 -``` - - -#### Unique - -Applicable on: [ ] top-level, [x] model, [ ] field - -A uniqueness check for multiple fields. (For a single field uniqueness check, use the `unique` field attribute.) - -| Field | Type | Description | -|----------------------------------|-------------------|------------------------------------------------------------------------| -| type | `string` | `unique` | -| fields | Array of `string` | An ordered list of fields that values need to be unique in combination | -| name | `string` | Optional. A human-readable name for this check | -| description | `string` | A plain text describing the quality of the data. | +- `no_duplicate_values` (equal to the property `unique: true`, but supports also multiple fields) +- `duplicate_count` +- `duplicate_percent` Example: @@ -1100,52 +872,77 @@ models: fields: order_id: type: string + quality: + - engine: soda + type: no_duplicate_values country: + type: carrier + shipment_numer: type: string quality: - - type: unique - fields: - - country - - order_id + - engine: soda + type: duplicate_percent + must_be_less_than: 1.0 + name: A shipment number is unique for one carrier + columns: + - carrier + - shipment_numer ``` +Freshness +- `freshness_in_days` +- `freshness_in_hours` +- `freshness_in_minutes` -#### Freshness +Missing +- `no_missing_values` (equal to the property `required: true`) +- `missing_count` +- `missing_percent` +Row count +- `rows_exist` (default) +- `row_count` -Applicable on: [ ] top-level, [ ] model, [x] field +Example: +```yaml +models: + my_table: + quality: + - type: row_count + must_be_greater_than: 500000 +``` -At least one element in the model must have a timestamp value that is less than a certain threshold. +SQL aggregation +- `avg` +- `sum` -| Field | Type | Description | -|---------------------------|----------|--------------------------------------------------| -| type | `string` | `freshness` | -| must_be_less_than_seconds | `number` | The threshold in seconds to compare | -| name | `string` | Optional. A human-readable name for this check | -| description | `string` | A plain text describing the quality of the data. | +SQL metric query +- `metric_expression` +Validity +- `no_invalid_values` +- `invalid_count` +- `invalid_percent` Example: - ```yaml models: my_table: fields: - some_timestamp: - type: timestamp + warehouse_id: + type: string quality: - - type: freshness - must_be_less_than_seconds: 3600 - description: At least one element in the model must have a timestamp value that is less than 1 hour + - engine: soda + type: no_invalid_values + valid_sql_regex: '^[A-Z]{2}[0-9]{3}$' ``` - #### Great Expectations -Applicable on: [ ] top-level, [x] model, [x] field +Applicable on: [x] model, [ ] field -Quality attributes defined as an Great Expectations [Expectation](https://greatexpectations.io/expectations/). +Quality attributes defined as Great Expectations [Expectation](https://greatexpectations.io/expectations/). Example: @@ -1154,7 +951,7 @@ Example: models: my_table: quality: - - type: great-expectations + - engine: great-expectations expectation_type: expect_table_row_count_to_be_between kwargs: min_value: 10000 @@ -1162,111 +959,6 @@ models: ``` - -#### Great Expectations (Expectation Suite) - -Applicable on: [ ] top-level, [x] model, [ ] field - -Quality attributes defined as Great Expectations [Expectations](https://greatexpectations.io/expectations/). - -The `specification` represents an expectation suite as JSON string. - -New with v0.9.4: This quality type is only applicable on model level. - -Example: - -```yaml -models: - my_table: - quality: - - type: great-expectations - specification: | - [ - { - "expectation_type": "expect_table_row_count_to_be_between", - "kwargs": { - "min_value": 10000, - "max_value": 50000, - }, - "meta": { - - } - } - ] -``` - - -#### SodaCL - -Applicable on: [x] top-level, [x] model, [ ] field - -Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). - -The `specification` represents the content of a `checks.yml` file. - -Example: - -```yaml -quality: - - type: SodaCL - specification: | - checks for orders: - - row_count > 0 - - duplicate_count(order_id) = 0 - checks for line_items: - - row_count > 0 -``` - -#### Monte Carlo - -Applicable on: [x] top-level, [x] model, [ ] field - -Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). - -The `specification` represents the content of a `montecarlo.yml` file. - -Example: - -```yaml -quality: - - type: montecarlo - specification: | - montecarlo: - field_health: - - table: project:dataset.table_name - timestamp_field: created - dimension_tracking: - - table: project:dataset.table_name - timestamp_field: created - field: order_status -``` - -#### Great Expectations Quality Object - -Quality attributes defined as Great Expectations [Expectations](https://greatexpectations.io/expectations/). - -The `specification` represents a list of expectations on a specific model. - -Example (string): - -```yaml -quality: - type: great-expectations - specification: - orders: |- - [ - { - "expectation_type": "expect_table_row_count_to_be_between", - "kwargs": { - "min_value": 10 - }, - "meta": { - - } - } - ] -``` - ### Config Object The config field can be used to set additional metadata that may be used by tools, e.g. to define a namespace for code generation, specify physical data types, toggle tests, etc. diff --git a/datacontract.init.yaml b/datacontract.init.yaml index 382ad8b..4c6ed27 100644 --- a/datacontract.init.yaml +++ b/datacontract.init.yaml @@ -1,4 +1,4 @@ -dataContractSpecification: 0.9.3 +dataContractSpecification: 1.0.1 id: my-data-contract-id info: title: My Data Contract @@ -99,11 +99,3 @@ info: # cron: 0 0 * * 0 # recoveryTime: 24 hours # recoveryPoint: 1 week - -### quality - -#quality: -# type: SodaCL -# specification: -# checks for my_model: |- -# - duplicate_count(id) = 0 diff --git a/examples/orders-latest/datacontract.yaml b/examples/orders-latest/datacontract.yaml index 11e16a5..1e27fce 100644 --- a/examples/orders-latest/datacontract.yaml +++ b/examples/orders-latest/datacontract.yaml @@ -1,4 +1,4 @@ -dataContractSpecification: 0.9.3 +dataContractSpecification: 1.0.1 id: urn:datacontract:checkout:orders-latest info: title: Orders Latest @@ -62,12 +62,15 @@ models: minLength: 10 maxLength: 20 customer_email_address: - description: The email address, as entered by the customer. The email address was not verified. + description: The email address, as entered by the customer. type: text format: email required: true pii: true classification: sensitive + quality: + - type: text + name: The email address was verified by a user processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp @@ -75,6 +78,16 @@ models: config: jsonType: string jsonFormat: date-time + quality: + - type: sql + description: The maximum duration between two orders should be less that 3600 seconds + query: | + SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration + FROM orders + must_be_less_than: 3600 + - type: row_count + engine: soda + must_be_greater_than: 5 line_items: description: A single article that is part of an order. type: table @@ -180,13 +193,4 @@ servicelevels: interval: weekly cron: 0 0 * * 0 recoveryTime: 24 hours - recoveryPoint: 1 week -quality: - type: SodaCL # data quality check format: SodaCL, montecarlo, custom - specification: # expressed as string or inline yaml or via "$ref: checks.yaml" - checks for orders: - - row_count >= 5 - - duplicate_count(order_id) = 0 - checks for line_items: - - values in (order_id) must exist in orders (order_id) - - row_count >= 5 + recoveryPoint: 1 week \ No newline at end of file diff --git a/versions/0.9.3/README.md b/versions/0.9.3/README.md index e0721f3..6463be2 100644 --- a/versions/0.9.3/README.md +++ b/versions/0.9.3/README.md @@ -1,4 +1,4 @@ -# Data Contract Specification +# Data Contract Specification Stars @@ -8,29 +8,28 @@ Data contracts bring data providers and data consumers together. -A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. -A data contract is implemented by a data product's output port or other data technologies. +A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. +Think of an API, but for data. +A data contract is implemented by a data product or other data technologies, even legacy data warehouses. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. -The _data contract specification_ defines a YAML format to describe attributes of provided data sets. -It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Azure, Databricks, and Snowflake. -The data contract specification is an open initiative to define a common data contract format. +The _data contract specification_ defines a YAML format to describe attributes of provided data sets. +It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Azure, Databricks, and Snowflake. +The data contract specification is an open initiative to define a common data contract format. It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. -Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). -First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. -They make semantic and quality expectations explicit. -They are often created collaboratively in [workshops](./workshop.md) together with data providers and data consumers. +Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). +First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. +They make semantic and quality expectations explicit. +They are often created collaboratively in [workshops](./workshop.md) together with data providers and data consumers. Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. The specification comes along with the [Data Contract CLI](https://github.com/datacontract/datacontract-cli), an open-source tool to develop, validate, and enforce data contracts. -IntelliJ, VS Code and other common IDEs allow you to use autocompletions without additional configuration. - -_Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. -The term "contract" may be somewhat misleading, but it is how it is used in practice. -The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. -Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ +> _Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. +> The term "contract" may be somewhat misleading, but it is how it is used by the industry. +> The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. +> Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ Version --- @@ -53,15 +52,24 @@ info: All orders since 2020-01-01. Orders with their line items are in their current state (no history included). owner: Checkout Team + slackChannel: "#checkout" contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout +tags: + - checkout + - orders + - s3 +links: + datacontractCli: https://cli.datacontract.com servers: production: type: s3 + environment: prod location: s3://datacontract-example-orders-latest/data/{model}/*.json format: json delimiter: new_line + description: "One folder per model. One file per day." terms: usage: | Data can be used for reports, analytics and machine learning use cases. @@ -108,6 +116,9 @@ models: description: The timestamp when the record was processed by the data platform. type: timestamp required: true + config: + jsonType: string + jsonFormat: date-time line_items: description: A single article that is part of an order. type: table @@ -135,6 +146,8 @@ definitions: example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted + tags: + - orders sku: domain: inventory name: sku @@ -145,6 +158,10 @@ definitions: description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. + links: + wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit + tags: + - inventory examples: - type: csv # csv, json, yaml, custom model: orders @@ -224,7 +241,7 @@ Data Contract CLI The [Data Contract CLI](https://cli.datacontract.com) is a command line tool and Python library to lint, test, import and export data contracts. -Here is short example how to verify that your actual dataset matches the data contract: +Here is short example how to verify that your actual dataset matches the data contract: ```bash pip3 install datacontract-cli @@ -236,34 +253,16 @@ or, if you prefer Docker: docker run datacontract/cli test https://datacontract.com/examples/orders-latest/datacontract.yaml ``` -The Data Contract contains all required information to verify data: +The Data Contract contains all required information to verify data: - The _servers_ block has the connection details to the actual data set. -- The _models_ define the syntax, formats, and constraints. +- The _models_ define the syntax, formats, and constraints. - The _quality_ defined further quality checks. The Data Contract CLI chooses the appropriate engine, formulates test cases, connects to the server, and executes the tests, based on the server type. More information and configuration options on [cli.datacontract.com](https://cli.datacontract.com). -IDE Integration ---- -IntelliJ comes with a built-in YAML plugin which will show you autocompletions. For VS Code we recommend to install the [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) plugin. No additional configuration is required. Autocompletion is then enabled for files following these patterns: - -``` -datacontract.yaml -datacontract.yml -*-datacontract.yaml -*-datacontract.yml -*.datacontract.yaml -*.datacontract.yml -datacontract-*.yaml -datacontract-*.yml -**/datacontract/*.yml -**/datacontract/*.yaml -**/datacontracts/*.yml -**/datacontracts/*.yaml -``` Specification --- @@ -299,14 +298,16 @@ It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. | dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | | id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | | info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | -| servers | Map[string, [Server Object](#server-object)] | Specifies the servers of the data contract. | +| servers | Map[`string`, [Server Object](#server-object)] | Specifies the servers of the data contract. | | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | -| models | Map[string, [Model Object](#model-object)] | Specifies the logical data model. | -| definitions | Map[string, [Definition Object](#definition-object)] | Specifies definitions. | +| models | Map[`string`, [Model Object](#model-object)] | Specifies the logical data model. | +| definitions | Map[`string`, [Definition Object](#definition-object)] | Specifies definitions. | | schema | [Schema Object](#schema-object) | Specifies the physical schema. The specification supports different schema format. | | examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | | servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | | quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | +| links | Map[`string`, `string`] | Additional external documentation links. | +| tags | Array of `string` | Custom metadata to provide additional context. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -322,7 +323,7 @@ Metadata and life cycle information about the data contract. |-------------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | title | `string` | REQUIRED. The title of the data contract. | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | -| status | `string` | The status of the data contract. Can be proposed, in development, active, retired. | +| status | `string` | The status of the data contract. Can be `proposed`, `in development`, `active`, `deprecated`, `retired`. | | description | `string` | A description of the data contract. | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | @@ -346,10 +347,11 @@ This object _MAY_ be extended with [Specification Extensions](#specification-ext The fields are dependent on the defined type. -| Field | Type | Description | -|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `glue`, `redshift`, `azure`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `local` | -| description | `string` | An optional string describing the server. | +| Field | Type | Description | +|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `glue`, `redshift`, `azure`, `sqlserver`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `kinesis`, `trino`, `local` | +| description | `string` | An optional string describing the server. | +| environment | `string` | An optional string describing the environment, e.g., prod, sit, stg. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -422,6 +424,18 @@ servers: | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | +#### SQL-Server Server Object + +| Field | Type | Description | +|----------|-----------|------------------------------------------------------| +| type | `string` | `sqlserver` | +| host | `string` | The host to the database server | +| port | `integer` | The port to the database server, default: `1433` | +| database | `string` | The name of the database, e.g., `database`. | +| schema | `string` | The name of the schema in the database, e.g., `dbo`. | +| driver | `string` | The name of the supported driver, e.g., `ODBC Driver 18 for SQL Server`. | + + #### Snowflake Server Object | Field | Type | Description | @@ -485,6 +499,25 @@ servers: | format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | +#### AWS Kinesis Data Streams Server Object + +| Field | Type | Description | +|--------|----------|---------------------------------------------------------------------------| +| type | `string` | `kinesis` | +| stream | `string` | The name of the Kinesis data stream. | +| region | `string` | AWS region, e.g., `eu-west-1`. | +| format | `string` | The format of the records. Examples: json, avro, protobuf. | + +#### Trino Server Object + +| Field | Type | Description | +|----------|-----------|-----------------------------------------------------------| +| type | `string` | `trino` | +| host | `string` | The Trino host | +| port | `integer` | The Trino port | +| catalog | `string` | The name of the catalog, e.g., `my_catalog`. | +| schema | `string` | The name of the schema in the catalog, e.g., `my_schema`. | + #### Local Server Object | Field | Type | Description | @@ -517,6 +550,8 @@ The name of the data model (table name) is defined by the key that refers to thi | description | `string` | An optional string describing the data model. | | title | `string` | An optional string for the title of the data model. Especially useful if the name of the model is cryptic or contains abbreviations. | | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | +| config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | + @@ -548,9 +583,14 @@ The Field Objects describes one field (column, property, nested field) of a data | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | | tags | Array of `string` | Custom metadata to provide additional context. | +| links | Map[`string`,`string`] | Additional external documentation links. | | $ref | `string` | A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. | -| fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is object, record, or struct. | -| items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is array. | +| fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is `object`, `record`, or `struct`. | +| items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is `array`. | +| keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | +| values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | +| config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | + ### Definition Object @@ -558,33 +598,39 @@ The Definition Object includes a clear and concise explanations of syntax, seman It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields. Models fields can refer to definitions using the `$ref` field to link to existing definitions and avoid duplicate documentations. -| Field | Type | Description | -|------------------|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| name | `string` | REQUIRED. The technical name of this definition. | -| type | [Data Type](#data-types) | REQUIRED. The logical data type | -| domain | `string` | The domain in which this definition is valid. Default: `global`. | -| title | `string` | The business name of this definition. | -| description | `string` | Clear and concise explanations related to the domain | -| enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | -| format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | -| precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | -| scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | -| minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | -| maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | -| pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | -| minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | -| exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | -| maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | -| exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | -| example | `string` | An example value. | -| pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | -| classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | -| tags | Array of `string` | Custom metadata to provide additional context. | +| Field | Type | Description | +|------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| name | `string` | REQUIRED. The technical name of this definition. | +| type | [Data Type](#data-types) | REQUIRED. The logical data type | +| domain | `string` | The domain in which this definition is valid. Default: `global`. | +| title | `string` | The business name of this definition. | +| description | `string` | Clear and concise explanations related to the domain | +| enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | +| format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | +| scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | +| minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | +| minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | +| example | `string` | An example value. | +| pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | +| classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | +| tags | Array of `string` | Custom metadata to provide additional context. | +| links | Map[`string`, `string`] | Additional external documentation links. | +| fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is `object`, `record`, or `struct`. | +| items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is `array`. | +| keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | +| values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | + ### Schema Object -The schema of the data contract describes the physical schema. +The schema of the data contract describes the physical schema. The type of the schema depends on the data platform. | Field | Type | Description | @@ -816,7 +862,7 @@ examples: ### Service Levels Object A service level is defined as an agreed-upon, measurable level of performance for provided the data. -Data Contract Specification defines well-known service levels. +Data Contract Specification defines well-known service levels. This list can be extended with custom service levels. One can either describe each service level informally using the `description` field, or make use of the predefined fields for automation support, e.g., via the [Data Contract CLI](https://cli.datacontract.com). @@ -825,7 +871,7 @@ One can either describe each service level informally using the `description` fi |--------------|-----------------------------------------------|-------------------------------------------------------------------------| | availability | [Availability Object](#availability-object) | The promised uptime of the system that provides the data | | retention | [Retention Object](#retention-object) | The period how long data will be available. | -| latency | [Latency Object](#latency-object) | The maximum amount of time from the from the source to its destination. | +| latency | [Latency Object](#latency-object) | The maximum amount of time from the the source to its destination. | | freshness | [Freshness Object](#freshness-object) | The maximum age of the youngest entry. | | frequency | [Frequency Object](#frequency-object) | The update frequency. | | support | [Support Object](#support-object) | The times when support is provided. | @@ -907,7 +953,7 @@ Support describes the times when support will be available for contact. | description | `string` | An optional string describing the support service level. | | time | `string` | An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`. | | responseTime | `string` | An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with. | - + This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -930,10 +976,10 @@ Backup specifies details about data backup procedures. The quality object contains quality attributes and checks. -| Field | Type | Description | -|---------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------| -| type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `great-expectations`, `custom` | -| specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | +| Field | Type | Description | +|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `great-expectations`, `custom` | +| specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
[Great Expectations Quality Object](#great-expectations-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | #### SodaCL Quality Object @@ -993,7 +1039,7 @@ quality: Quality attributes defined as Great Expectations [Expectations](https://greatexpectations.io/expectations/). -The `specification` represents a list of expectations on a specific model. +The `specification` represents a list of expectations on a specific model. Example (string): @@ -1015,6 +1061,47 @@ quality: ] ``` +### Config Object + +The config field can be used to set additional metadata that may be used by tools, e.g. to define a namespace for code generation, specify physical data types, toggle tests, etc. + +A config field can be added with any name. The value can be null, a primitive, an array or an object. + +For developer experience, a list of well-known field names is maintained here, as these fields are used in the Data Contract CLI: + + +| Field | Type | Description | +|-----------------|----------|----------------------------------------------------------------------------------------------------------------| +| avroNamespace | `string` | (Only on model level) The namespace to use when importing and exporting the data model from / to Apache Avro. | +| avroType | `string` | (Only on field level) Specify the field type to use when exporting the data model to Apache Avro. | +| avroLogicalType | `string` | (Only on field level) Specify the logical field type to use when exporting the data model to Apache Avro. | +| bigqueryType | `string` | (Only on field level) Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)` | +| snowflakeType | `string` | (Only on field level) Specify the physical column type that is used in a Snowflake table, e.g, `TIMESTAMP_LTZ` | +| redshiftType | `string` | (Only on field level) Specify the physical column type that is used in a Redshift table, e.g, `SMALLINT` | +| sqlserverType | `string` | (Only on field level) Specify the physical column type that is used in a Snowflake table, e.g, `DATETIME2` | +| databricksType | `string` | (Only on field level) Specify the physical column type that is used in a Databricks table | +| glueType | `string` | (Only on field level) Specify the physical column type that is used in a AWS Glue Data Catalog table | + +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + +Example: + +``` +models: + orders: + config: + avroNamespace: "my.namespace" + fields: + my_field_1: + description: Example for AVRO with Timestamp (millisecond precision) + type: timestamp + config: + avroType: long + avroLogicalType: timestamp-millis + snowflakeType: timestamp_tz +``` + + ### Data Types The following data types are supported for model fields and definitions: @@ -1030,6 +1117,7 @@ The following data types are supported for model fields and definitions: - Timestamp with no timezone: `timestamp_ntz` - Date with no time information: `date` - Array: `array` +- Map: `map` (may not be supported by some server types) - Sequence of 8-bit unsigned bytes: `bytes` - Complex type: `object`, `record`, `struct` - No value: `null` @@ -1038,32 +1126,40 @@ The following data types are supported for model fields and definitions: While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. -A custom fields can be added with any name. The value can be null, a primitive, an array or an object. +A custom field can be added with any name. The value can be null, a primitive, an array or an object. -### Design Principles - -The Data Contract Specification follows these design principles: - -- A free, open, and open-sourced standard -- Follow OpenAPI and AsyncAPI conventions so that it feels immediately familiar -- Support contract-first approaches -- Support code-first approaches -- Support tooling by being machine-readable Tooling --- -- [Data Contract CLI](https://github.com/datacontract/datacontract-cli) is a free CLI tool to help you create, develop, and maintain your data contracts. -- [Data Mesh Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data products and data contracts. It supports the data contract specification and allows the user to import or export data contracts using this specification. - +- [Data Contract CLI](https://github.com/datacontract/datacontract-cli) is an open-source CLI tool to help you create, develop, and maintain your data contracts. +- [Data Contract Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data contracts. It includes a data contract catalog, a Web-Editor, and a request and approval workflow to automate access to data products for a full enterprise data marketplace. +- [Data Contract GPT](https://gpt.datacontract.com) is a custom GPT that can help you write data contracts. +- [Data Contract Editor](https://editor.datacontract.com) is an open-source editor for Data Contracts, including a live html preview. -Other Data Contract Specifications +Code Completion --- -- [AIDA User Group's Open Data Contract Standard](https://github.com/AIDAUserGroup/open-data-contract-standard) -- [PayPal's Data Contract Template](https://github.com/paypal/data-contract-template/blob/main/docs/README.md) +The [JSON Schema](https://datacontract.com/datacontract.schema.json) of the current data contract specification is registered in [Schema Store](https://www.schemastore.org/), which brings code completion and syntax checks for all major IDEs. +IntelliJ comes with a built-in YAML plugin which will show you autocompletions. +For VS Code we recommend to install the [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) plugin. +No additional configuration is required. + +Autocompletion is then enabled for files following these patterns: + +``` +datacontract.yaml +datacontract.yml +*-datacontract.yaml +*-datacontract.yml +*.datacontract.yaml +*.datacontract.yml +datacontract-*.yaml +datacontract-*.yml +**/datacontract/*.yml +**/datacontract/*.yaml +**/datacontracts/*.yml +**/datacontracts/*.yaml +``` -Literature ---- -- [Driving Data Quality with Data Contracts](https://www.amazon.com/dp/B0C37FPH3D) by Andrew Jones Authors --- @@ -1079,4 +1175,4 @@ License [MIT License](LICENSE) - + \ No newline at end of file diff --git a/versions/0.9.3/datacontract.init.yaml b/versions/0.9.3/datacontract.init.yaml index 382ad8b..29dbe19 100644 --- a/versions/0.9.3/datacontract.init.yaml +++ b/versions/0.9.3/datacontract.init.yaml @@ -1,109 +1,98 @@ -dataContractSpecification: 0.9.3 -id: my-data-contract-id -info: - title: My Data Contract - version: 0.0.1 -# description: -# owner: -# contact: -# name: -# url: -# email: - - -### servers - -#servers: -# production: -# type: s3 -# location: s3:// -# format: parquet -# delimiter: new_line - -### terms - -#terms: -# usage: -# limitations: -# billing: -# noticePeriod: - - -### models - -# models: -# my_model: -# description: -# type: -# fields: -# my_field: -# type: -# description: - - -### definitions - -# definitions: -# my_field: -# domain: -# name: -# title: -# type: -# description: -# example: -# pii: -# classification: - - -### examples - -#examples: -# - type: csv -# model: my_model -# data: |- -# id,timestamp,amount -# "1001","2023-09-09T08:30:00Z",2500 -# "1002","2023-09-08T15:45:00Z",1800 - -### servicelevels - -#servicelevels: -# availability: -# description: The server is available during support hours -# percentage: 99.9% -# retention: -# description: Data is retained for one year because! -# period: P1Y -# unlimited: false -# latency: -# description: Data is available within 25 hours after the order was placed -# threshold: 25h -# sourceTimestampField: orders.order_timestamp -# processedTimestampField: orders.processed_timestamp -# freshness: -# description: The age of the youngest row in a table. -# threshold: 25h -# timestampField: orders.order_timestamp -# frequency: -# description: Data is delivered once a day -# type: batch # or streaming -# interval: daily # for batch, either or cron -# cron: 0 0 * * * # for batch, either or interval -# support: -# description: The data is available during typical business hours at headquarters -# time: 9am to 5pm in EST on business days -# responseTime: 1h -# backup: -# description: Data is backed up once a week, every Sunday at 0:00 UTC. -# interval: weekly -# cron: 0 0 * * 0 -# recoveryTime: 24 hours -# recoveryPoint: 1 week - -### quality - -#quality: -# type: SodaCL -# specification: -# checks for my_model: |- -# - duplicate_count(id) = 0 +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "type": "object", + "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", + "properties": { + "domain": { + "type": "string", + "description": "The domain in which this definition is valid.", + "default": "global" + }, + "name": { + "type": "string", + "description": "The technical name of this definition." + }, + "title": { + "type": "string", + "description": "The business name of this definition." + }, + "description": { + "type": "string", + "description": "Clear and concise explanations related to the domain." + }, + "type": { + "type": "string", + "description": "The logical data type." + }, + "minLength": { + "type": "integer", + "description": "A value must be greater than or equal to this value. Applies only to string types." + }, + "maxLength": { + "type": "integer", + "description": "A value must be less than or equal to this value. Applies only to string types." + }, + "format": { + "type": "string", + "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." + }, + "precision": { + "type": "integer", + "examples": [ + 38 + ], + "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." + }, + "scale": { + "type": "integer", + "examples": [ + 0 + ], + "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." + }, + "pattern": { + "type": "string", + "description": "A regular expression pattern the value must match. Applies only to string types." + }, + "example": { + "type": "string", + "description": "An example value." + }, + "pii": { + "type": "boolean", + "description": "Indicates if the field contains Personal Identifiable Information (PII)." + }, + "classification": { + "type": "string", + "description": "The data class defining the sensitivity level for this field." + }, + "tags": { + "type": "array", + "items": { + "type": "string" + }, + "description": "Custom metadata to provide additional context." + }, + "links": { + "type": "object", + "description": "Links to external resources.", + "minProperties": 1, + "propertyNames": { + "pattern": "^[a-zA-Z0-9_-]+$" + }, + "additionalProperties": { + "type": "string", + "title": "Link", + "description": "A URL to an external resource.", + "format": "uri", + "examples": [ + "https://example.com" + ] + } + } + }, + "required": [ + "name", + "type" + ] +} \ No newline at end of file diff --git a/versions/0.9.3/datacontract.schema.json b/versions/0.9.3/datacontract.schema.json index 9c65c5d..a0904be 100644 --- a/versions/0.9.3/datacontract.schema.json +++ b/versions/0.9.3/datacontract.schema.json @@ -36,6 +36,7 @@ "proposed", "in development", "active", + "deprecated", "retired" ] }, @@ -78,6 +79,16 @@ }, "servers": { "type": "object", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the servers." + }, + "environment": { + "type": "string", + "description": "The environment in which the servers are running. Examples: prod, sit, stg." + } + }, "additionalProperties": { "oneOf": [ { @@ -280,6 +291,55 @@ "format" ] }, + { + "type": "object", + "title": "SqlserverServer", + "properties": { + "type": { + "type": "string", + "enum": [ + "sqlserver" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "host": { + "type": "string", + "description": "The host to the database server", + "examples": [ + "localhost" + ] + }, + "port": { + "type": "integer", + "description": "The port to the database server.", + "default": 1433, + "examples": [ + 1433 + ] + }, + "database": { + "type": "string", + "description": "The name of the database.", + "examples": [ + "database" + ] + }, + "schema": { + "type": "string", + "description": "The name of the schema in the database.", + "examples": [ + "dbo" + ] + } + }, + "additionalProperties": true, + "required": [ + "type", + "host", + "database", + "schema" + ] + }, { "type": "object", "title": "SnowflakeServer", @@ -340,11 +400,25 @@ "additionalProperties": true, "required": [ "type", - "host", "catalog", "schema" ] }, + { + "type": "object", + "title": "DataframeServer", + "properties": { + "type": { + "type": "string", + "const": "dataframe", + "description": "The type of the data product technology that implements the data contract." + } + }, + "additionalProperties": true, + "required": [ + "type" + ] + }, { "type": "object", "title": "GlueServer", @@ -537,6 +611,89 @@ "topic" ] }, + { + "type": "object", + "title": "KinesisDataStreamsServer", + "description": "Kinesis Data Streams Server", + "properties": { + "type": { + "type": "string", + "enum": [ + "kinesis" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "stream": { + "type": "string", + "description": "The name of the Kinesis data stream." + }, + "region": { + "type": "string", + "description": "AWS region.", + "examples": [ + "eu-west-1" + ] + }, + "format": { + "type": "string", + "description": "The format of the record", + "examples": [ + "json", + "avro", + "protobuf" + ] + } + }, + "additionalProperties": true, + "required": [ + "type", + "stream" + ] + }, + { + "type": "object", + "title": "TrinoServer", + "properties": { + "type": { + "type": "string", + "const": "trino", + "description": "The type of the data product technology that implements the data contract." + }, + "host": { + "type": "string", + "description": "The host to the database server", + "examples": [ + "localhost" + ] + }, + "port": { + "type": "integer", + "description": "The port to the database server." + }, + "catalog": { + "type": "string", + "description": "The name of the catalog.", + "examples": [ + "hive" + ] + }, + "schema": { + "type": "string", + "description": "The name of the schema in the database.", + "examples": [ + "my_schema" + ] + } + }, + "additionalProperties": true, + "required": [ + "type", + "host", + "port", + "catalog", + "schema" + ] + }, { "type": "object", "title": "LocalServer", @@ -664,6 +821,12 @@ "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, + "keys": { + "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" + }, + "values": { + "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" + }, "primary": { "type": "boolean", "default": false, @@ -725,7 +888,7 @@ "type": "string", "description": "A regular expression the value must match. Only applies to string types.", "examples": [ - "^[a-zA-Z0-9_-]+$" + "^[a-zA-Z0-9_-]+$" ] }, "minimum": { @@ -769,12 +932,99 @@ }, "description": "Custom metadata to provide additional context." }, + "links": { + "type": "object", + "description": "Links to external resources.", + "minProperties": 1, + "propertyNames": { + "pattern": "^[a-zA-Z0-9_-]+$" + }, + "additionalProperties": { + "type": "string", + "title": "Link", + "description": "A URL to an external resource.", + "format": "uri", + "examples": [ + "https://example.com" + ] + } + }, "$ref": { "type": "string", "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." + }, + "config": { + "type": "object", + "description": "Additional metadata for field configuration.", + "additionalProperties": { + "type": [ + "string", + "number", + "boolean", + "object", + "array", + "null" + ] + }, + "properties": { + "avroType": { + "type": "string", + "description": "Specify the field type to use when exporting the data model to Apache Avro." + }, + "avroLogicalType": { + "type": "string", + "description": "Specify the logical field type to use when exporting the data model to Apache Avro." + }, + "bigqueryType": { + "type": "string", + "description": "Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)`." + }, + "snowflakeType": { + "type": "string", + "description": "Specify the physical column type that is used in a Snowflake table, e.g., `TIMESTAMP_LTZ`." + }, + "redshiftType": { + "type": "string", + "description": "Specify the physical column type that is used in a Redshift table, e.g., `SMALLINT`." + }, + "sqlserverType": { + "type": "string", + "description": "Specify the physical column type that is used in a SQL Server table, e.g., `DATETIME2`." + }, + "databricksType": { + "type": "string", + "description": "Specify the physical column type that is used in a Databricks Unity Catalog table." + }, + "glueType": { + "type": "string", + "description": "Specify the physical column type that is used in an AWS Glue Data Catalog table." + } + } } } } + }, + "config": { + "type": "object", + "description": "Additional metadata for model configuration.", + "additionalProperties": { + "type": [ + "string", + "number", + "boolean", + "object", + "array", + "null" + + + ] + }, + "properties": { + "avroNamespace": { + "type": "string", + "description": "The namespace to use when importing and exporting the data model from / to Apache Avro." + } + } } } } @@ -809,6 +1059,22 @@ "type": { "$ref": "#/$defs/FieldType" }, + "fields": { + "description": "The nested fields (e.g. columns) of the object, record, or struct.", + "type": "object", + "additionalProperties": { + "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" + } + }, + "items": { + "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" + }, + "keys": { + "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" + }, + "values": { + "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" + }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." @@ -873,6 +1139,23 @@ "type": "string" }, "description": "Custom metadata to provide additional context." + }, + "links": { + "type": "object", + "description": "Links to external resources.", + "minProperties": 1, + "propertyNames": { + "pattern": "^[a-zA-Z0-9_-]+$" + }, + "additionalProperties": { + "type": "string", + "title": "Link", + "description": "A URL to an external resource.", + "format": "uri", + "examples": [ + "https://example.com" + ] + } } }, "required": [ @@ -1045,7 +1328,7 @@ }, "threshold": { "type": "string", - "description": "An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`).", + "description": "An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g., `PT24H`).", "example": "25h" }, "timestampField": { @@ -1173,6 +1456,36 @@ "specification" ], "description": "The quality object contains quality attributes and checks." + }, + "links": { + "type": "object", + "description": "Links to external resources.", + "minProperties": 1, + "propertyNames": { + "pattern": "^[a-zA-Z0-9_-]+$" + }, + "additionalProperties": { + "type": "string", + "title": "Link", + "description": "A URL to an external resource.", + "format": "uri", + "examples": [ + "https://example.com" + ] + } + }, + "tags": { + "type": "array", + "items": { + "type": "string", + "description": "Tags to facilitate searching and filtering.", + "examples": [ + "databricks", + "pii", + "sensitive" + ] + }, + "description": "Tags to facilitate searching and filtering." } }, "required": [ @@ -1204,6 +1517,7 @@ "timestamp_ntz", "date", "array", + "map", "object", "record", "struct", @@ -1212,4 +1526,4 @@ ] } } -} +} \ No newline at end of file diff --git a/versions/0.9.3/definition.schema.json b/versions/0.9.3/definition.schema.json index 1cd561d..29dbe19 100644 --- a/versions/0.9.3/definition.schema.json +++ b/versions/0.9.3/definition.schema.json @@ -72,10 +72,27 @@ "type": "string" }, "description": "Custom metadata to provide additional context." + }, + "links": { + "type": "object", + "description": "Links to external resources.", + "minProperties": 1, + "propertyNames": { + "pattern": "^[a-zA-Z0-9_-]+$" + }, + "additionalProperties": { + "type": "string", + "title": "Link", + "description": "A URL to an external resource.", + "format": "uri", + "examples": [ + "https://example.com" + ] + } } }, "required": [ "name", "type" ] -} +} \ No newline at end of file From 4f5ebe217101a11c4841b2e8ebd9c6f18cd665d8 Mon Sep 17 00:00:00 2001 From: jochen Date: Sat, 20 Jul 2024 22:12:06 +0200 Subject: [PATCH 04/31] Update quality --- README.md | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 2d34560..5672d89 100644 --- a/README.md +++ b/README.md @@ -785,7 +785,7 @@ Quality attributes are checks that can be applied to the data to ensure its qual Quality attributes can be: - Text: A human-readable text that describes the quality of the data. - SQL: An individual SQL query that returns a single value that can be compared. -- Engine-specific Types: Currently engines `soda` and `great-expectations` are supported. +- Engine-specific types: Pre-defined quality checks, as defined by data quality libraries. Currently, the engines `soda` and `great-expectations` are supported. A quality object can be specified on field level, or on model level. The top-level quality object are deprecated. @@ -809,10 +809,11 @@ Example: models: my_table: fields: - iban: + account_iban: quality: - type: text - description: Must be a valid IBAN. + name: Valid IBAN + description: Must be a valid IBAN. Must not be empty. ``` @@ -843,6 +844,7 @@ models: my_table: quality: - type: sql + name: Maximum duration between two orders description: The maximum duration between two orders should be less that 3600 seconds query: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration @@ -851,13 +853,16 @@ models: ``` -#### Soda Data Contract Checks +#### Engine: Soda Applicable on: [x] model, [x] field - Quality attributes can be defined with the engine `soda` as [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html). +Note: Soda Data contract check reference is experimental and may change in the future. + +Note: Currently only supported by types Postgres, Snowflake, and Spark (Databricks) + ##### Duplicate - `no_duplicate_values` (equal to the property `unique: true`, but supports also multiple fields) @@ -881,12 +886,12 @@ models: type: string quality: - engine: soda + name: A shipment number should be unique for one carrier type: duplicate_percent - must_be_less_than: 1.0 - name: A shipment number is unique for one carrier columns: - carrier - shipment_numer + must_be_less_than: 1.0 ``` Freshness @@ -938,7 +943,7 @@ models: valid_sql_regex: '^[A-Z]{2}[0-9]{3}$' ``` -#### Great Expectations +#### Engine: Great Expectations Applicable on: [x] model, [ ] field From 9cb1bcb34d74e26c90619f87a458605e240e87b9 Mon Sep 17 00:00:00 2001 From: jochen Date: Sun, 21 Jul 2024 07:57:23 +0200 Subject: [PATCH 05/31] Update quality --- README.md | 57 +++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 41 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 5672d89..ab02201 100644 --- a/README.md +++ b/README.md @@ -790,16 +790,17 @@ Quality attributes can be: A quality object can be specified on field level, or on model level. The top-level quality object are deprecated. -#### Text +#### Description Text Applicable on: [x] model, [x] field -A human-readable text that describes the quality of the data. -Later in the development process, these might be translated into an executable check (such as `sql`), or checked through an AI engine. +A description in natural language that defines the expected quality of the data. +This is useful to express requirements or expectation when discussing the data contract with stakeholders. +Later in the development process, these might be translated into an executable check (such as `sql`). +It can also be used as a prompt to check the data with an AI engine. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------| -| type | `string` | `text` | | name | `string` | Optional. A human-readable name for this check | | description | `string` | A plain text describing the quality attribute in natural language. | @@ -811,8 +812,7 @@ models: fields: account_iban: quality: - - type: text - name: Valid IBAN + - name: Valid IBAN description: Must be a valid IBAN. Must not be empty. ``` @@ -825,10 +825,9 @@ An individual SQL query that returns a single number or boolean value that can b | Field | Type | Description | |----------------------------------|-----------------------|---------------------------------------------------------------------------------| -| type | `string` | `sql` | | name | `string` | Optional. A human-readable name for this check | | description | `string` | A plain text describing the quality of the data. | -| query | `string` | A SQL query that returns a single number or a boolean value. | +| sql | `string` | A SQL query that returns a single number to compare with the threshold | | must_be | `integer` | The threshold to check the return value of the query | | must_not_be | `integer` | The threshold to check the return value of the query | | must_be_greater_than | `integer` | The threshold to check the return value of the query | @@ -843,10 +842,9 @@ An individual SQL query that returns a single number or boolean value that can b models: my_table: quality: - - type: sql - name: Maximum duration between two orders + - name: Maximum duration between two orders description: The maximum duration between two orders should be less that 3600 seconds - query: | + sql: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration FROM orders must_be_less_than: 3600 @@ -863,6 +861,16 @@ Note: Soda Data contract check reference is experimental and may change in the f Note: Currently only supported by types Postgres, Snowflake, and Spark (Databricks) +| Field | Type | Description | +|-------------------------|----------|-----------------------------------------------------------------------------------------------------------------------------| +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | Optional. A plain text describing the quality attribute in natural language. | +| engine | `string` | `soda` | +| type | `string` | A check type as defined in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | +| _additional properties_ | | As defined for this check type in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | + + + ##### Duplicate - `no_duplicate_values` (equal to the property `unique: true`, but supports also multiple fields) @@ -949,6 +957,14 @@ Applicable on: [x] model, [ ] field Quality attributes defined as Great Expectations [Expectation](https://greatexpectations.io/expectations/). +| Field | Type | Description | +|------------------|-------------------------|--------------------------------------------------------------------------------------------| +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | Optional. A plain text describing the quality attribute in natural language. | +| engine | `string` | `soda` | +| expectation_type | `string` | An expectation type as listed in [Expectation](https://greatexpectations.io/expectations/) | +| kwargs | Map[`string`, `string`] | The keyword arguments for this expectation type. | +| meta | Map[`string`, `string`] | Optional. Additional meta information. | Example: @@ -956,11 +972,20 @@ Example: models: my_table: quality: - - engine: great-expectations - expectation_type: expect_table_row_count_to_be_between - kwargs: - min_value: 10000 - max_value: 50000 + - engine: great-expectations + expectation_type: expect_table_row_count_to_be_between + kwargs: + min_value: 10000 + max_value: 50000 + - engine: great-expectations + expectation_type: expect_column_values_to_be_between + kwargs: + column: "passenger_count" + max_value: 6 + min_value: 1 + mostly: 1.0 + strict_max: false + strict_min: false ``` From d9f1cd9449b2801857817be61c1a61a48769a030 Mon Sep 17 00:00:00 2001 From: jochen Date: Sun, 21 Jul 2024 07:57:48 +0200 Subject: [PATCH 06/31] Update quality --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ab02201..b2dc85f 100644 --- a/README.md +++ b/README.md @@ -980,7 +980,7 @@ models: - engine: great-expectations expectation_type: expect_column_values_to_be_between kwargs: - column: "passenger_count" + column: passenger_count max_value: 6 min_value: 1 mostly: 1.0 From 4a3d001a88e5f83361b5ad01831ffa092d3cdd99 Mon Sep 17 00:00:00 2001 From: jochen Date: Sun, 21 Jul 2024 08:21:20 +0200 Subject: [PATCH 07/31] Update quality --- README.md | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index b2dc85f..5aba009 100644 --- a/README.md +++ b/README.md @@ -957,14 +957,14 @@ Applicable on: [x] model, [ ] field Quality attributes defined as Great Expectations [Expectation](https://greatexpectations.io/expectations/). -| Field | Type | Description | -|------------------|-------------------------|--------------------------------------------------------------------------------------------| -| name | `string` | Optional. A human-readable name for this check | -| description | `string` | Optional. A plain text describing the quality attribute in natural language. | -| engine | `string` | `soda` | -| expectation_type | `string` | An expectation type as listed in [Expectation](https://greatexpectations.io/expectations/) | -| kwargs | Map[`string`, `string`] | The keyword arguments for this expectation type. | -| meta | Map[`string`, `string`] | Optional. Additional meta information. | +| Field | Type | Description | +|------------------|----------|--------------------------------------------------------------------------------------------| +| name | `string` | Optional. A human-readable name for this check | +| description | `string` | Optional. A plain text describing the quality attribute in natural language. | +| engine | `string` | `soda` | +| expectation_type | `string` | An expectation type as listed in [Expectation](https://greatexpectations.io/expectations/) | +| kwargs | Map | The keyworded arguments for this expectation type. | +| meta | Map | Optional. Additional meta information. | Example: @@ -977,7 +977,10 @@ models: kwargs: min_value: 10000 max_value: 50000 + meta: + notes: "This expectation is crucial to avoid processing datasets that are too small or too large." - engine: great-expectations + description: "Check that passenger_count values are between 1 and 6." expectation_type: expect_column_values_to_be_between kwargs: column: passenger_count @@ -986,6 +989,10 @@ models: mostly: 1.0 strict_max: false strict_min: false + meta: + tags: + - business-critical + - range_check ``` From 69688b08ea22697526ac6ccdb362b7ddc6999022 Mon Sep 17 00:00:00 2001 From: jochen Date: Sun, 21 Jul 2024 10:53:22 +0200 Subject: [PATCH 08/31] Update quality --- README.md | 60 +++++++++++++++++++++++++++++++------------------------ 1 file changed, 34 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index 5aba009..d0ca7ac 100644 --- a/README.md +++ b/README.md @@ -113,8 +113,8 @@ models: pii: true classification: sensitive quality: - - type: text - name: The email address was verified by a user + - name: Verified email address + description: The email address was verified by a user with double opt-in. processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp @@ -123,14 +123,13 @@ models: jsonType: string jsonFormat: date-time quality: - - type: sql - description: The maximum duration between two orders should be less that 3600 seconds - query: | + - description: The maximum duration between two orders should be less that 3600 seconds + sql: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration FROM orders must_be_less_than: 3600 - - type: row_count - engine: soda + - engine: soda + type: row_count must_be_greater_than: 5 line_items: description: A single article that is part of an order. @@ -778,22 +777,21 @@ Backup specifies details about data backup procedures. ### Quality Object -The quality object defined a quality attribute. +The quality object defines quality attributes. -Quality attributes are checks that can be applied to the data to ensure its quality. Data can be verified by executing these checks through a data quality engine. +Quality attributes are checks that can be applied to the data to ensure its quality. +Data can be verified by executing these checks through a data quality engine. Quality attributes can be: -- Text: A human-readable text that describes the quality of the data. +- Text: A text in natural language that describes the quality of the data. - SQL: An individual SQL query that returns a single value that can be compared. - Engine-specific types: Pre-defined quality checks, as defined by data quality libraries. Currently, the engines `soda` and `great-expectations` are supported. -A quality object can be specified on field level, or on model level. -The top-level quality object are deprecated. +A quality object can be specified on field level and on model level. +The top-level quality object is deprecated. #### Description Text -Applicable on: [x] model, [x] field - A description in natural language that defines the expected quality of the data. This is useful to express requirements or expectation when discussing the data contract with stakeholders. Later in the development process, these might be translated into an executable check (such as `sql`). @@ -821,13 +819,16 @@ models: Applicable on: [x] model, [x] field -An individual SQL query that returns a single number or boolean value that can be compared. The SQL query must be in the SQL dialect of the provided server. +An individual SQL query that returns a single number that can be compared with a threshold. The SQL query must be in the SQL dialect of the provided server. + +> __Note:__ Establish a secure development process and use read-only connections, as the misuse of SQL queries can lead to SQL injection attacks. + | Field | Type | Description | |----------------------------------|-----------------------|---------------------------------------------------------------------------------| | name | `string` | Optional. A human-readable name for this check | | description | `string` | A plain text describing the quality of the data. | -| sql | `string` | A SQL query that returns a single number to compare with the threshold | +| sql | `string` | A SQL query that returns a single number to compare with the threshold. | | must_be | `integer` | The threshold to check the return value of the query | | must_not_be | `integer` | The threshold to check the return value of the query | | must_be_greater_than | `integer` | The threshold to check the return value of the query | @@ -837,6 +838,7 @@ An individual SQL query that returns a single number or boolean value that can b | must_be_between | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | | must_not_be_between | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | +Example: ```yaml models: @@ -850,16 +852,15 @@ models: must_be_less_than: 3600 ``` +SQL queries allow powerful checks. A SQL query should run not longer than 10 minutes. #### Engine: Soda -Applicable on: [x] model, [x] field +Soda has a number of predefined quality [checks](https://docs.soda.io/soda/data-contracts-checks.html) that can be referenced as quality attributes. -Quality attributes can be defined with the engine `soda` as [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html). +Soda checks can be applied on model and field level. -Note: Soda Data contract check reference is experimental and may change in the future. - -Note: Currently only supported by types Postgres, Snowflake, and Spark (Databricks) +> Note: Soda Data contract check reference is experimental and may change in the future. Currently only supported by Postgres, Snowflake, and Spark (Databricks) | Field | Type | Description | |-------------------------|----------|-----------------------------------------------------------------------------------------------------------------------------| @@ -870,6 +871,9 @@ Note: Currently only supported by types Postgres, Snowflake, and Spark (Databric | _additional properties_ | | As defined for this check type in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | +See the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) for all possible types and configuration values. + + ##### Duplicate @@ -886,15 +890,18 @@ models: order_id: type: string quality: - - engine: soda + - name: Order ID must be unique + description: This is a check on field level + engine: soda type: no_duplicate_values country: type: carrier shipment_numer: type: string quality: - - engine: soda - name: A shipment number should be unique for one carrier + - name: A shipment number should be unique for one carrier + description: This is a check on model level + engine: soda type: duplicate_percent columns: - carrier @@ -953,10 +960,11 @@ models: #### Engine: Great Expectations -Applicable on: [x] model, [ ] field - Quality attributes defined as Great Expectations [Expectation](https://greatexpectations.io/expectations/). +Expectations are applied on model level. + + | Field | Type | Description | |------------------|----------|--------------------------------------------------------------------------------------------| | name | `string` | Optional. A human-readable name for this check | From e6bb26c6b3a95fce1ca5e8fd90c51c99adf6eb8a Mon Sep 17 00:00:00 2001 From: jochen Date: Sun, 21 Jul 2024 11:08:18 +0200 Subject: [PATCH 09/31] Update quality --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index d0ca7ac..d5ae0fa 100644 --- a/README.md +++ b/README.md @@ -123,12 +123,14 @@ models: jsonType: string jsonFormat: date-time quality: - - description: The maximum duration between two orders should be less that 3600 seconds + - name: Completeness check + description: If there is a gap of orders longer than one hour, it clearly indicates a problem. sql: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration FROM orders must_be_less_than: 3600 - - engine: soda + - name: Number of rows + engine: soda type: row_count must_be_greater_than: 5 line_items: @@ -817,8 +819,6 @@ models: #### SQL -Applicable on: [x] model, [x] field - An individual SQL query that returns a single number that can be compared with a threshold. The SQL query must be in the SQL dialect of the provided server. > __Note:__ Establish a secure development process and use read-only connections, as the misuse of SQL queries can lead to SQL injection attacks. From 1e92013a3166991664c005da4d91d5bb5ca93715 Mon Sep 17 00:00:00 2001 From: jochen Date: Sun, 21 Jul 2024 11:12:54 +0200 Subject: [PATCH 10/31] Update quality --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d5ae0fa..1afc0ae 100644 --- a/README.md +++ b/README.md @@ -894,8 +894,8 @@ models: description: This is a check on field level engine: soda type: no_duplicate_values - country: - type: carrier + carrier: + type: string shipment_numer: type: string quality: From 06945410a3aabf83e362c7c2c19a406536e78b21 Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 9 Sep 2024 12:02:06 +0200 Subject: [PATCH 11/31] Update quality --- README.md | 220 +++++++++++++++++++++++------------------------------- 1 file changed, 92 insertions(+), 128 deletions(-) diff --git a/README.md b/README.md index 32ebd96..982bbbf 100644 --- a/README.md +++ b/README.md @@ -127,7 +127,7 @@ models: description: If there is a gap of orders longer than one hour, it clearly indicates a problem. sql: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration - FROM orders + FROM {orders} must_be_less_than: 3600 - name: Number of rows engine: soda @@ -793,8 +793,8 @@ Quality attributes are checks that can be applied to the data to ensure its qual Data can be verified by executing these checks through a data quality engine. Quality attributes can be: -- Text: A text in natural language that describes the quality of the data. -- SQL: An individual SQL query that returns a single value that can be compared. +- A text in natural language that describes the quality of the data. +- An individual SQL query that returns a single value that can be compared. - Engine-specific types: Pre-defined quality checks, as defined by data quality libraries. Currently, the engines `soda` and `great-expectations` are supported. A quality object can be specified on field level and on model level. @@ -809,7 +809,7 @@ It can also be used as a prompt to check the data with an AI engine. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------| -| name | `string` | Optional. A human-readable name for this check | +| type | `string` | `text` | | description | `string` | A plain text describing the quality attribute in natural language. | Example: @@ -820,11 +820,10 @@ models: fields: account_iban: quality: - - name: Valid IBAN + - type: text description: Must be a valid IBAN. Must not be empty. ``` - #### SQL An individual SQL query that returns a single number that can be compared with a threshold. The SQL query must be in the SQL dialect of the provided server. @@ -832,37 +831,51 @@ An individual SQL query that returns a single number that can be compared with a > __Note:__ Establish a secure development process and use read-only connections, as the misuse of SQL queries can lead to SQL injection attacks. -| Field | Type | Description | -|----------------------------------|-----------------------|---------------------------------------------------------------------------------| -| name | `string` | Optional. A human-readable name for this check | -| description | `string` | A plain text describing the quality of the data. | -| sql | `string` | A SQL query that returns a single number to compare with the threshold. | -| must_be | `integer` | The threshold to check the return value of the query | -| must_not_be | `integer` | The threshold to check the return value of the query | -| must_be_greater_than | `integer` | The threshold to check the return value of the query | -| must_be_greater_than_or_equal_to | `integer` | The threshold to check the return value of the query | -| must_be_less_than | `integer` | The threshold to check the return value of the query | -| must_be_less_than_or_equal_to | `integer` | The threshold to check the return value of the query | -| must_be_between | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | -| must_not_be_between | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | +| Field | Type | Description | +|----------------------------|-----------------------|---------------------------------------------------------------------------------| +| type | `string` | `sql` | +| description | `string` | A plain text describing the quality of the data. | +| query | `string` | A SQL query that returns a single number to compare with the threshold. | +| mustBe | `integer` | The threshold to check the return value of the query | +| mustNotBe | `integer` | The threshold to check the return value of the query | +| mustBeGreaterThan | `integer` | The threshold to check the return value of the query | +| mustBeGreaterThanOrEqualTo | `integer` | The threshold to check the return value of the query | +| mustBeLessThan | `integer` | The threshold to check the return value of the query | +| mustBeLessThanOrEqualTo | `integer` | The threshold to check the return value of the query | +| mustBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | +| mustBeNotBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | + +In the query the following placeholders can be used: + +| Placeholder | Description | +|-------------|----------------------------------------------------------------------------------------| +| `{model}` | The name of the model that is checked. | +| `{table}` | Alias for `{model}`. | +| `{field}` | The name of the field that is checked (only if the quality is defined on field-level). | +| `{column}` | Alias for `{field}`. | Example: ```yaml models: - my_table: + orders: quality: - - name: Maximum duration between two orders - description: The maximum duration between two orders should be less that 3600 seconds - sql: | + - type: sql + description: The maximum duration between two orders must be less that 3600 seconds + query: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration - FROM orders - must_be_less_than: 3600 + FROM {model} + mustBeLessThan: 3600 ``` -SQL queries allow powerful checks. A SQL query should run not longer than 10 minutes. +SQL queries allow powerful checks for custom business logic. +A SQL query should run not longer than 10 minutes. -#### Engine: Soda +#### Custom + +You can define custom quality attributes that are specific to a data quality engine. + +#### Custom (Engine: Soda) Soda has a number of predefined quality [checks](https://docs.soda.io/soda/data-contracts-checks.html) that can be referenced as quality attributes. @@ -870,25 +883,17 @@ Soda checks can be applied on model and field level. > Note: Soda Data contract check reference is experimental and may change in the future. Currently only supported by Postgres, Snowflake, and Spark (Databricks) -| Field | Type | Description | -|-------------------------|----------|-----------------------------------------------------------------------------------------------------------------------------| -| name | `string` | Optional. A human-readable name for this check | -| description | `string` | Optional. A plain text describing the quality attribute in natural language. | -| engine | `string` | `soda` | -| type | `string` | A check type as defined in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | -| _additional properties_ | | As defined for this check type in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | +| Field | Type | Description | +|---------------|----------|-----------------------------------------------------------------------------------------------------------------------------| +| type | `string` | `custom` | +| description | `string` | Optional. A plain text describing the quality attribute in natural language. | +| engine | `string` | `soda` | +| specification | `object` | A check type as defined in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | See the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) for all possible types and configuration values. - -##### Duplicate - -- `no_duplicate_values` (equal to the property `unique: true`, but supports also multiple fields) -- `duplicate_count` -- `duplicate_percent` - Example: ```yaml @@ -898,89 +903,44 @@ models: order_id: type: string quality: - - name: Order ID must be unique + - type: custom description: This is a check on field level engine: soda - type: no_duplicate_values + specification: + type: no_duplicate_values carrier: type: string shipment_numer: type: string quality: - - name: A shipment number should be unique for one carrier + - type: custom description: This is a check on model level engine: soda - type: duplicate_percent - columns: - - carrier - - shipment_numer - must_be_less_than: 1.0 -``` - -Freshness -- `freshness_in_days` -- `freshness_in_hours` -- `freshness_in_minutes` - -Missing -- `no_missing_values` (equal to the property `required: true`) -- `missing_count` -- `missing_percent` - -Row count -- `rows_exist` (default) -- `row_count` - -Example: -```yaml -models: - my_table: - quality: - - type: row_count - must_be_greater_than: 500000 -``` - - -SQL aggregation -- `avg` -- `sum` - -SQL metric query -- `metric_expression` - -Validity -- `no_invalid_values` -- `invalid_count` -- `invalid_percent` - -Example: -```yaml -models: - my_table: - fields: - warehouse_id: - type: string - quality: - - engine: soda - type: no_invalid_values - valid_sql_regex: '^[A-Z]{2}[0-9]{3}$' + specification: + type: duplicate_percent + columns: + - carrier + - shipment_numer + must_be_less_than: 1.0 + - type: custom + description: This is a check on model level + engine: soda + specification: + type: row_count + must_be_greater_than: 500000 ``` -#### Engine: Great Expectations +#### Custom (Engine: Great Expectations) Quality attributes defined as Great Expectations [Expectation](https://greatexpectations.io/expectations/). Expectations are applied on model level. - -| Field | Type | Description | -|------------------|----------|--------------------------------------------------------------------------------------------| -| name | `string` | Optional. A human-readable name for this check | -| description | `string` | Optional. A plain text describing the quality attribute in natural language. | -| engine | `string` | `soda` | -| expectation_type | `string` | An expectation type as listed in [Expectation](https://greatexpectations.io/expectations/) | -| kwargs | Map | The keyworded arguments for this expectation type. | -| meta | Map | Optional. Additional meta information. | +| Field | Type | Description | +|---------------|----------|-----------------------------------------------------------------------------------------------------| +| description | `string` | Optional. A plain text describing the quality attribute in natural language. | +| engine | `string` | `great-expectations` | +| specification | `object` | An expectation type as listed in [Expectation](https://greatexpectations.io/expectations/) as YAML. | Example: @@ -988,27 +948,31 @@ Example: models: my_table: quality: - - engine: great-expectations - expectation_type: expect_table_row_count_to_be_between - kwargs: - min_value: 10000 - max_value: 50000 - meta: - notes: "This expectation is crucial to avoid processing datasets that are too small or too large." - - engine: great-expectations + - type: custom + engine: great-expectations + specification: + expectation_type: expect_table_row_count_to_be_between + kwargs: + min_value: 10000 + max_value: 50000 + meta: + notes: "This expectation is crucial to avoid processing datasets that are too small or too large." + - type: custom + engine: great-expectations description: "Check that passenger_count values are between 1 and 6." - expectation_type: expect_column_values_to_be_between - kwargs: - column: passenger_count - max_value: 6 - min_value: 1 - mostly: 1.0 - strict_max: false - strict_min: false - meta: - tags: - - business-critical - - range_check + specification: + expectation_type: expect_column_values_to_be_between + kwargs: + column: passenger_count + max_value: 6 + min_value: 1 + mostly: 1.0 + strict_max: false + strict_min: false + meta: + tags: + - business-critical + - range_check ``` From eb1a456a39a210b3df10d7ef5d6cf9394832e832 Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 9 Sep 2024 12:12:03 +0200 Subject: [PATCH 12/31] Update quality --- CHANGELOG.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 7f22319..7a74af8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,10 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] -## [1.0.1] - 2024-07-20 +## [1.1.0] - 2024-09-09 ### Added -- Data quality attributes on model and field level +- Data quality on model and field level - Support for server-specific data types as config map ([#63](https://github.com/datacontract/datacontract-specification/issues/63)) - AWS Glue Catalog server support - sftp server support @@ -26,8 +26,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Removed -- `quality` on top-level removed (is now considered as specification extension) -- `schema` removed (is now considered as specification extension) +- `quality` on top-level removed +- `schema` removed ## [0.9.3] - 2024-03-06 From 5e7faa63500bbf546bb7a1853833c1c73ccb4d10 Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 9 Sep 2024 12:20:39 +0200 Subject: [PATCH 13/31] Remove schema --- README.md | 1 - datacontract.schema.json | 37 +------------------------------------ 2 files changed, 1 insertion(+), 37 deletions(-) diff --git a/README.md b/README.md index 982bbbf..b70e287 100644 --- a/README.md +++ b/README.md @@ -282,7 +282,6 @@ Specification - [Model Object](#model-object) - [Field Object](#field-object) - [Definition Object](#definition-object) -- [Schema Object](#schema-object) - [Example Object](#example-object) - [Service Level Object](#service-levels-object) - [Quality Object](#quality-object) diff --git a/datacontract.schema.json b/datacontract.schema.json index 844afc4..f78ff14 100644 --- a/datacontract.schema.json +++ b/datacontract.schema.json @@ -7,6 +7,7 @@ "type": "string", "title": "DataContractSpecificationVersion", "enum": [ + "1.1.0", "0.9.3", "0.9.2", "0.9.1", @@ -1208,42 +1209,6 @@ ] } }, - "schema": { - "type": "object", - "properties": { - "type": { - "type": "string", - "title": "SchemaType", - "enum": [ - "dbt", - "bigquery", - "json-schema", - "sql-ddl", - "avro", - "protobuf", - "custom" - ], - "description": "The type of the schema. Typical values are dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." - }, - "specification": { - "oneOf": [ - { - "type": "string", - "description": "The specification of the schema as a string." - }, - { - "type": "object", - "description": "The specification of the schema as an object." - } - ] - } - }, - "required": [ - "type", - "specification" - ], - "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." - }, "examples": { "type": "array", "items": { From 05f1ccf8cee079e15d1825e3dafde40398c1ce58 Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 9 Sep 2024 12:21:59 +0200 Subject: [PATCH 14/31] Update changelog --- CHANGELOG.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 7a74af8..3c85d46 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,7 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [1.1.0] - 2024-09-09 ### Added -- Data quality on model and field level +- Data quality on model and field level ([#55](https://github.com/datacontract/datacontract-specification/issues/55)) - Support for server-specific data types as config map ([#63](https://github.com/datacontract/datacontract-specification/issues/63)) - AWS Glue Catalog server support - sftp server support @@ -27,7 +27,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Removed - `quality` on top-level removed -- `schema` removed +- `schema` removed in favor of encoding any physical schema configuration in the `model` using the `config` map at the field level and supporting import/export ([#21](https://github.com/datacontract/datacontract-specification/issues/21)). ## [0.9.3] - 2024-03-06 From aab791a1e66a80772ff9d8d405d5590fd8b81e6b Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 9 Sep 2024 12:36:09 +0200 Subject: [PATCH 15/31] Extract out different types of servers to defs, ensure server required fields are populated. Credits to pflooky --- datacontract.schema.json | 1376 +++++++++++++++++++++----------------- 1 file changed, 746 insertions(+), 630 deletions(-) diff --git a/datacontract.schema.json b/datacontract.schema.json index f78ff14..5c35fbf 100644 --- a/datacontract.schema.json +++ b/datacontract.schema.json @@ -80,705 +80,279 @@ }, "servers": { "type": "object", - "properties": { - "description": { - "type": "string", - "description": "An optional string describing the servers." - }, - "environment": { - "type": "string", - "description": "The environment in which the servers are running. Examples: prod, sit, stg." - } - }, + "description": "Information about the servers.", "additionalProperties": { - "oneOf": [ + "$ref": "#/$defs/BaseServer", + "allOf": [ { - "type": "object", - "title": "BigQueryServer", - "properties": { - "type": { - "type": "string", - "enum": [ - "bigquery", - "BigQuery" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "project": { - "type": "string", - "description": "An optional string describing the server." - }, - "dataset": { - "type": "string", - "description": "An optional string describing the server." + "if": { + "properties": { + "type": { + "const": "bigquery" + } } }, - "additionalProperties": true, - "required": [ - "type", - "project", - "dataset" - ] + "then": { + "$ref": "#/$defs/BigQueryServer" + } }, { - "type": "object", - "title": "S3Server", - "properties": { - "type": { - "type": "string", - "enum": [ - "s3" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "location": { - "type": "string", - "format": "uri", - "description": "An optional string describing the server. Must be in the form of a URL.", - "examples": [ - "s3://datacontract-example-orders-latest/data/{model}/*.json" - ] - }, - "endpointUrl": { - "type": "string", - "format": "uri", - "description": "The server endpoint for S3-compatible servers.", - "examples": ["https://minio.example.com"] - }, - "format": { - "type": "string", - "enum": [ - "parquet", - "delta", - "json", - "csv" - ], - "description": "File format." + "if": { + "properties": { + "type": { + "const": "postgres" + } }, - "delimiter": { - "type": "string", - "enum": [ - "new_line", - "array" - ], - "description": "Only for format = json. How multiple json documents are delimited within one file" - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "location" - ] + "then": { + "$ref": "#/$defs/PostgresServer" + } }, { - "type": "object", - "title": "GcsServer", - "properties": { - "type": { - "type": "string", - "enum": [ - "gcs" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "location": { - "type": "string", - "format": "uri", - "description": "The GS/GCS url to the data.", - "examples": [ - "gs://example-storage/data/*/*.json" - ] - }, - "format": { - "type": "string", - "enum": [ - "parquet", - "delta", - "json", - "csv" - ], - "description": "File format." + "if": { + "properties": { + "type": { + "const": "s3" + } }, - "delimiter": { - "type": "string", - "enum": [ - "new_line", - "array" - ], - "description": "Only for format = json. How multiple json documents are delimited within one file" - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "location" - ] + "then": { + "$ref": "#/$defs/S3Server" + } }, { - "type": "object", - "title": "SftpServer", - "properties": { - "type": { - "type": "string", - "enum": [ - "sftp" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "location": { - "type": "string", - "format": "uri", - "description": "An optional string describing the server. Must be in the form of a sftp URL.", - "examples": [ - "sftp://123.123.12.123/{model}/*.json" - ] - }, - "format": { - "type": "string", - "enum": [ - "parquet", - "delta", - "json", - "csv" - ], - "description": "File format." + "if": { + "properties": { + "type": { + "const": "sftp" + } }, - "delimiter": { - "type": "string", - "enum": [ - "new_line", - "array" - ], - "description": "Only for format = json. How multiple json documents are delimited within one file" - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "location" - ] + "then": { + "$ref": "#/$defs/SftpServer" + } }, { - "type": "object", - "title": "RedshiftServer", - "properties": { - "type": { - "type": "string", - "enum": [ - "redshift" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "account": { - "type": "string", - "description": "An optional string describing the server." - }, - "database": { - "type": "string", - "description": "An optional string describing the server." + "if": { + "properties": { + "type": { + "const": "redshift" + } }, - "schema": { - "type": "string", - "description": "An optional string describing the server." - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "account", - "database", - "schema" - ] + "then": { + "$ref": "#/$defs/RedshiftServer" + } }, { - "type": "object", - "title": "AzureServer", - "properties": { - "type": { - "type": "string", - "enum": [ - "azure" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "location": { - "type": "string", - "format": "uri", - "description": "Fully qualified path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs.", - "examples": [ - "az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet", - "abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet" - ] - }, - "format": { - "type": "string", - "enum": [ - "parquet", - "delta", - "json", - "csv" - ], - "description": "File format." + "if": { + "properties": { + "type": { + "const": "azure" + } }, - "delimiter": { - "type": "string", - "enum": [ - "new_line", - "array" - ], - "description": "Only for format = json. How multiple json documents are delimited within one file" - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "location", - "format" - ] + "then": { + "$ref": "#/$defs/AzureServer" + } }, { - "type": "object", - "title": "SqlserverServer", - "properties": { - "type": { - "type": "string", - "enum": [ - "sqlserver" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "host": { - "type": "string", - "description": "The host to the database server", - "examples": [ - "localhost" - ] - }, - "port": { - "type": "integer", - "description": "The port to the database server.", - "default": 1433, - "examples": [ - 1433 - ] - }, - "database": { - "type": "string", - "description": "The name of the database.", - "examples": [ - "database" - ] + "if": { + "properties": { + "type": { + "const": "sqlserver" + } }, - "schema": { - "type": "string", - "description": "The name of the schema in the database.", - "examples": [ - "dbo" - ] - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "host", - "database", - "schema" - ] + "then": { + "$ref": "#/$defs/SqlserverServer" + } }, { - "type": "object", - "title": "SnowflakeServer", - "properties": { - "type": { - "type": "string", - "enum": [ - "snowflake" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "account": { - "type": "string", - "description": "An optional string describing the server." - }, - "database": { - "type": "string", - "description": "An optional string describing the server." + "if": { + "properties": { + "type": { + "const": "snowflake" + } }, - "schema": { - "type": "string", - "description": "An optional string describing the server." - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "account", - "database", - "schema" - ] + "then": { + "$ref": "#/$defs/SnowflakeServer" + } }, { - "type": "object", - "title": "DatabricksServer", - "properties": { - "type": { - "type": "string", - "const": "databricks", - "description": "The type of the data product technology that implements the data contract." - }, - "host": { - "type": "string", - "description": "The Databricks host", - "examples": [ - "dbc-abcdefgh-1234.cloud.databricks.com" - ] - }, - "catalog": { - "type": "string", - "description": "The name of the Hive or Unity catalog" + "if": { + "properties": { + "type": { + "const": "databricks" + } }, - "schema": { - "type": "string", - "description": "The schema name in the catalog" - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "catalog", - "schema" - ] + "then": { + "$ref": "#/$defs/DatabricksServer" + } }, { - "type": "object", - "title": "DataframeServer", - "properties": { - "type": { - "type": "string", - "const": "dataframe", - "description": "The type of the data product technology that implements the data contract." - } + "if": { + "properties": { + "type": { + "const": "dataframe" + } + }, + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type" - ] + "then": { + "$ref": "#/$defs/DataframeServer" + } }, { - "type": "object", - "title": "GlueServer", - "properties": { - "type": { - "type": "string", - "const": "glue", - "description": "The type of the data product technology that implements the data contract." - }, - "account": { - "type": "string", - "description": "The AWS Glue account", - "examples": [ - "1234-5678-9012" - ] - }, - "database": { - "type": "string", - "description": "The AWS Glue database name", - "examples": [ - "my_database" - ] - }, - "location": { - "type": "string", - "format": "uri", - "description": "The AWS S3 path. Must be in the form of a URL.", - "examples": [ - "s3://datacontract-example-orders-latest/data/{model}" - ] + "if": { + "properties": { + "type": { + "const": "glue" + } }, - "format": { - "type": "string", - "description": "The format of the files", - "examples": [ - "parquet", - "csv", - "json", - "delta" - ] - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "account", - "database" - ] + "then": { + "$ref": "#/$defs/GlueServer" + } }, { - "type": "object", - "title": "PostgresServer", - "properties": { - "type": { - "type": "string", - "const": "postgres", - "description": "The type of the data product technology that implements the data contract." - }, - "host": { - "type": "string", - "description": "The host to the database server", - "examples": [ - "localhost" - ] - }, - "port": { - "type": "integer", - "description": "The port to the database server." - }, - "database": { - "type": "string", - "description": "The name of the database.", - "examples": [ - "postgres" - ] + "if": { + "properties": { + "type": { + "const": "postgres" + } }, - "schema": { - "type": "string", - "description": "The name of the schema in the database.", - "examples": [ - "public" - ] - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "host", - "port", - "database", - "schema" - ] + "then": { + "$ref": "#/$defs/PostgresServer" + } }, { - "type": "object", - "title": "OracleServer", - "properties": { - "type": { - "type": "string", - "const": "oracle", - "description": "The type of the data product technology that implements the data contract." - }, - "host": { - "type": "string", - "description": "The host to the oracle server", - "examples": [ - "localhost" - ] - }, - "port": { - "type": "integer", - "description": "The port to the oracle server.", - "examples": [ - 1523 - ] + "if": { + "properties": { + "type": { + "const": "oracle" + } }, - "serviceName": { - "type": "string", - "description": "The name of the service.", - "examples": [ - "service" - ] - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "host", - "port", - "serviceName" - ] + "then": { + "$ref": "#/$defs/OracleServer" + } }, { - "type": "object", - "title": "KafkaServer", - "description": "Kafka Server", - "properties": { - "type": { - "type": "string", - "enum": [ - "kafka" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "host": { - "type": "string", - "description": "The bootstrap server of the kafka cluster." - }, - "topic": { - "type": "string", - "description": "The topic name." + "if": { + "properties": { + "type": { + "const": "kafka" + } }, - "format": { - "type": "string", - "description": "The format of the message. Examples: json, avro, protobuf. Default: json.", - "default": "json" - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "host", - "topic" - ] + "then": { + "$ref": "#/$defs/KafkaServer" + } }, { - "type": "object", - "title": "PubSubServer", - "properties": { - "type": { - "type": "string", - "enum": [ - "pubsub" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "project": { - "type": "string", - "description": "The GCP project name." + "if": { + "properties": { + "type": { + "const": "pubsub" + } }, - "topic": { - "type": "string", - "description": "The topic name." - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "project", - "topic" - ] + "then": { + "$ref": "#/$defs/PubSubServer" + } }, { - "type": "object", - "title": "KinesisDataStreamsServer", - "description": "Kinesis Data Streams Server", - "properties": { - "type": { - "type": "string", - "enum": [ - "kinesis" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "stream": { - "type": "string", - "description": "The name of the Kinesis data stream." - }, - "region": { - "type": "string", - "description": "AWS region.", - "examples": [ - "eu-west-1" - ] + "if": { + "properties": { + "type": { + "const": "kinesis" + } }, - "format": { - "type": "string", - "description": "The format of the record", - "examples": [ - "json", - "avro", - "protobuf" - ] - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "stream" - ] + "then": { + "$ref": "#/$defs/KinesisDataStreamsServer" + } }, { - "type": "object", - "title": "TrinoServer", - "properties": { - "type": { - "type": "string", - "const": "trino", - "description": "The type of the data product technology that implements the data contract." - }, - "host": { - "type": "string", - "description": "The host to the database server", - "examples": [ - "localhost" - ] - }, - "port": { - "type": "integer", - "description": "The port to the database server." - }, - "catalog": { - "type": "string", - "description": "The name of the catalog.", - "examples": [ - "hive" - ] + "if": { + "properties": { + "type": { + "const": "trino" + } }, - "schema": { - "type": "string", - "description": "The name of the schema in the database.", - "examples": [ - "my_schema" - ] - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "host", - "port", - "catalog", - "schema" - ] + "then": { + "$ref": "#/$defs/TrinoServer" + } }, { - "type": "object", - "title": "LocalServer", - "properties": { - "type": { - "type": "string", - "enum": [ - "local" - ], - "description": "The type of the data product technology that implements the data contract." - }, - "path": { - "type": "string", - "description": "The relative or absolute path to the data file(s).", - "examples": [ - "./folder/data.parquet", - "./folder/*.parquet" - ] + "if": { + "properties": { + "type": { + "const": "local" + } }, - "format": { - "type": "string", - "description": "The format of the file(s)", - "examples": [ - "json", - "parquet", - "delta", - "csv" - ] - } + "required": [ + "type" + ] }, - "additionalProperties": true, - "required": [ - "type", - "path", - "format" - ] + "then": { + "$ref": "#/$defs/LocalServer" + } } ] - }, - "description": "Information about the servers." + } }, "terms": { "type": "object", @@ -831,7 +405,10 @@ "title": { "type": "string", "description": "An optional string providing a human readable name for the model. Especially useful if the model name is cryptic or contains abbreviations.", - "examples": ["Purchase Orders", "Air Shipments"] + "examples": [ + "Purchase Orders", + "Air Shipments" + ] }, "fields": { "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", @@ -933,7 +510,7 @@ "type": "string", "description": "A regular expression the value must match. Only applies to string types.", "examples": [ - "^[a-zA-Z0-9_-]+$" + "^[a-zA-Z0-9_-]+$" ] }, "minimum": { @@ -1060,8 +637,6 @@ "object", "array", "null" - - ] }, "properties": { @@ -1533,6 +1108,547 @@ "bytes", "null" ] + }, + "BaseServer": { + "type": "object", + "properties": { + "description": { + "type": "string", + "description": "An optional string describing the servers." + }, + "environment": { + "type": "string", + "description": "The environment in which the servers are running. Examples: prod, sit, stg." + }, + "type": { + "type": "string", + "description": "The type of the data product technology that implements the data contract.", + "enum": [ + "bigquery", + "BigQuery", + "s3", + "sftp", + "redshift", + "azure", + "sqlserver", + "snowflake", + "databricks", + "dataframe", + "glue", + "postgres", + "oracle", + "kafka", + "pubsub", + "kinesis", + "trino", + "local" + ] + } + }, + "additionalProperties": true, + "required": [ + "type" + ] + }, + "BigQueryServer": { + "type": "object", + "title": "BigQueryServer", + "properties": { + "project": { + "type": "string", + "description": "The GCP project name." + }, + "dataset": { + "type": "string", + "description": "The GCP dataset name." + } + }, + "required": [ + "project", + "dataset" + ] + }, + "S3Server": { + "type": "object", + "title": "S3Server", + "properties": { + "location": { + "type": "string", + "format": "uri", + "description": "S3 URL, starting with `s3://`", + "examples": [ + "s3://datacontract-example-orders-latest/data/{model}/*.json" + ] + }, + "endpointUrl": { + "type": "string", + "format": "uri", + "description": "The server endpoint for S3-compatible servers.", + "examples": [ + "https://minio.example.com" + ] + }, + "format": { + "type": "string", + "enum": [ + "parquet", + "delta", + "json", + "csv" + ], + "description": "File format." + }, + "delimiter": { + "type": "string", + "enum": [ + "new_line", + "array" + ], + "description": "Only for format = json. How multiple json documents are delimited within one file" + } + }, + "required": [ + "location" + ] + }, + "SftpServer": { + "type": "object", + "title": "SftpServer", + "properties": { + "location": { + "type": "string", + "format": "uri", + "pattern": "^sftp://.*", + "description": "SFTP URL, starting with `sftp://`", + "examples": [ + "sftp://123.123.12.123/{model}/*.json" + ] + }, + "format": { + "type": "string", + "enum": [ + "parquet", + "delta", + "json", + "csv" + ], + "description": "File format." + }, + "delimiter": { + "type": "string", + "enum": [ + "new_line", + "array" + ], + "description": "Only for format = json. How multiple json documents are delimited within one file" + } + }, + "required": [ + "location" + ] + }, + "RedshiftServer": { + "type": "object", + "title": "RedshiftServer", + "properties": { + "account": { + "type": "string", + "description": "An optional string describing the server." + }, + "database": { + "type": "string", + "description": "An optional string describing the server." + }, + "schema": { + "type": "string", + "description": "An optional string describing the server." + } + }, + "required": [ + "account", + "database", + "schema" + ] + }, + "AzureServer": { + "type": "object", + "title": "AzureServer", + "properties": { + "location": { + "type": "string", + "format": "uri", + "description": "Fully qualified path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs.", + "examples": [ + "az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet", + "abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet" + ] + }, + "format": { + "type": "string", + "enum": [ + "parquet", + "delta", + "json", + "csv" + ], + "description": "File format." + }, + "delimiter": { + "type": "string", + "enum": [ + "new_line", + "array" + ], + "description": "Only for format = json. How multiple json documents are delimited within one file" + } + }, + "required": [ + "location", + "format" + ] + }, + "SqlserverServer": { + "type": "object", + "title": "SqlserverServer", + "properties": { + "host": { + "type": "string", + "description": "The host to the database server", + "examples": [ + "localhost" + ] + }, + "port": { + "type": "integer", + "description": "The port to the database server.", + "default": 1433, + "examples": [ + 1433 + ] + }, + "database": { + "type": "string", + "description": "The name of the database.", + "examples": [ + "database" + ] + }, + "schema": { + "type": "string", + "description": "The name of the schema in the database.", + "examples": [ + "dbo" + ] + } + }, + "required": [ + "host", + "database", + "schema" + ] + }, + "SnowflakeServer": { + "type": "object", + "title": "SnowflakeServer", + "properties": { + "account": { + "type": "string", + "description": "An optional string describing the server." + }, + "database": { + "type": "string", + "description": "An optional string describing the server." + }, + "schema": { + "type": "string", + "description": "An optional string describing the server." + } + }, + "required": [ + "account", + "database", + "schema" + ] + }, + "DatabricksServer": { + "type": "object", + "title": "DatabricksServer", + "properties": { + "host": { + "type": "string", + "description": "The Databricks host", + "examples": [ + "dbc-abcdefgh-1234.cloud.databricks.com" + ] + }, + "catalog": { + "type": "string", + "description": "The name of the Hive or Unity catalog" + }, + "schema": { + "type": "string", + "description": "The schema name in the catalog" + } + }, + "required": [ + "catalog", + "schema" + ] + }, + "DataframeServer": { + "type": "object", + "title": "DataframeServer", + "required": [ + "type" + ] + }, + "GlueServer": { + "type": "object", + "title": "GlueServer", + "properties": { + "account": { + "type": "string", + "description": "The AWS Glue account", + "examples": [ + "1234-5678-9012" + ] + }, + "database": { + "type": "string", + "description": "The AWS Glue database name", + "examples": [ + "my_database" + ] + }, + "location": { + "type": "string", + "format": "uri", + "description": "The AWS S3 path. Must be in the form of a URL.", + "examples": [ + "s3://datacontract-example-orders-latest/data/{model}" + ] + }, + "format": { + "type": "string", + "description": "The format of the files", + "examples": [ + "parquet", + "csv", + "json", + "delta" + ] + } + }, + "required": [ + "account", + "database" + ] + }, + "PostgresServer": { + "type": "object", + "title": "PostgresServer", + "properties": { + "host": { + "type": "string", + "description": "The host to the database server", + "examples": [ + "localhost" + ] + }, + "port": { + "type": "integer", + "description": "The port to the database server." + }, + "database": { + "type": "string", + "description": "The name of the database.", + "examples": [ + "postgres" + ] + }, + "schema": { + "type": "string", + "description": "The name of the schema in the database.", + "examples": [ + "public" + ] + } + }, + "required": [ + "host", + "port", + "database", + "schema" + ] + }, + "OracleServer": { + "type": "object", + "title": "OracleServer", + "properties": { + "host": { + "type": "string", + "description": "The host to the oracle server", + "examples": [ + "localhost" + ] + }, + "port": { + "type": "integer", + "description": "The port to the oracle server.", + "examples": [ + 1523 + ] + }, + "serviceName": { + "type": "string", + "description": "The name of the service.", + "examples": [ + "service" + ] + } + }, + "required": [ + "host", + "port", + "serviceName" + ] + }, + "KafkaServer": { + "type": "object", + "title": "KafkaServer", + "description": "Kafka Server", + "properties": { + "host": { + "type": "string", + "description": "The bootstrap server of the kafka cluster." + }, + "topic": { + "type": "string", + "description": "The topic name." + }, + "format": { + "type": "string", + "description": "The format of the message. Examples: json, avro, protobuf.", + "default": "json" + } + }, + "required": [ + "host", + "topic" + ] + }, + "PubSubServer": { + "type": "object", + "title": "PubSubServer", + "properties": { + "project": { + "type": "string", + "description": "The GCP project name." + }, + "topic": { + "type": "string", + "description": "The topic name." + } + }, + "required": [ + "project", + "topic" + ] + }, + "KinesisDataStreamsServer": { + "type": "object", + "title": "KinesisDataStreamsServer", + "description": "Kinesis Data Streams Server", + "properties": { + "stream": { + "type": "string", + "description": "The name of the Kinesis data stream." + }, + "region": { + "type": "string", + "description": "AWS region.", + "examples": [ + "eu-west-1" + ] + }, + "format": { + "type": "string", + "description": "The format of the record", + "examples": [ + "json", + "avro", + "protobuf" + ] + } + }, + "required": [ + "stream" + ] + }, + "TrinoServer": { + "type": "object", + "title": "TrinoServer", + "properties": { + "host": { + "type": "string", + "description": "The Trino host URL.", + "examples": [ + "localhost" + ] + }, + "port": { + "type": "integer", + "description": "The Trino port." + }, + "catalog": { + "type": "string", + "description": "The name of the catalog.", + "examples": [ + "hive" + ] + }, + "schema": { + "type": "string", + "description": "The name of the schema in the database.", + "examples": [ + "my_schema" + ] + } + }, + "required": [ + "host", + "port", + "catalog", + "schema" + ] + }, + "LocalServer": { + "type": "object", + "title": "LocalServer", + "properties": { + "path": { + "type": "string", + "description": "The relative or absolute path to the data file(s).", + "examples": [ + "./folder/data.parquet", + "./folder/*.parquet" + ] + }, + "format": { + "type": "string", + "description": "The format of the file(s)", + "examples": [ + "json", + "parquet", + "delta", + "csv" + ] + } + }, + "required": [ + "path", + "format" + ] } } } From 60d314600e81fcecf0c3e02ee9243c512ff0dc76 Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 9 Sep 2024 12:55:50 +0200 Subject: [PATCH 16/31] Fix example --- examples/orders-latest/datacontract.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/orders-latest/datacontract.yaml b/examples/orders-latest/datacontract.yaml index 1e27fce..1809bc2 100644 --- a/examples/orders-latest/datacontract.yaml +++ b/examples/orders-latest/datacontract.yaml @@ -1,4 +1,4 @@ -dataContractSpecification: 1.0.1 +dataContractSpecification: 1.1.0 id: urn:datacontract:checkout:orders-latest info: title: Orders Latest From ad2b5c52b198adcd64c9f9c60529212eb9412f64 Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 9 Sep 2024 16:16:18 +0200 Subject: [PATCH 17/31] Server Role resoves #81 --- README.md | 18 +++++++++++++----- datacontract.schema.json | 20 ++++++++++++++++++++ 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index b70e287..6daa141 100644 --- a/README.md +++ b/README.md @@ -349,11 +349,12 @@ This object _MAY_ be extended with [Specification Extensions](#specification-ext The fields are dependent on the defined type. -| Field | Type | Description | -|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `glue`, `redshift`, `azure`, `sqlserver`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `kinesis`, `trino`, `local` | -| description | `string` | An optional string describing the server. | -| environment | `string` | An optional string describing the environment, e.g., prod, sit, stg. | +| Field | Type | Description | +|-------------|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `glue`, `redshift`, `azure`, `sqlserver`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `kinesis`, `trino`, `local` | +| description | `string` | An optional string describing the server. | +| environment | `string` | An optional string describing the environment, e.g., prod, sit, stg. | +| roles | Array of `Server Role Object` | An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -530,6 +531,13 @@ servers: | path | `string` | The relative or absolute path to the data file(s), such as `./folder/data.parquet`. | | format | `string` | The format of the file(s), such as `parquet`, `delta`, `csv`, or `json`. | +#### Server Role Object + +| Field | Type | Description | +|-------------|----------|--------------------------------------------------------------| +| name | `string` | Name of the role | +| description | `string` | A description of the role and what access the role provides. | + ### Terms Object The terms and conditions of the data contract. diff --git a/datacontract.schema.json b/datacontract.schema.json index 5c35fbf..e9d8689 100644 --- a/datacontract.schema.json +++ b/datacontract.schema.json @@ -1143,6 +1143,26 @@ "trino", "local" ] + }, + "roles": { + "description": " An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data.", + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string", + "description": "The name of the role." + }, + "description": { + "type": "string", + "description": "A description of the role and what access the role provides." + } + }, + "required": [ + "name" + ] + } } }, "additionalProperties": true, From bc8263cc4890ba4195c93a98200fd02d29101b99 Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 9 Sep 2024 16:54:22 +0200 Subject: [PATCH 18/31] Change example format to examples array --- CHANGELOG.md | 1 + README.md | 118 +++++++++-------------- examples/orders-latest/datacontract.yaml | 69 +++++++------ 3 files changed, 79 insertions(+), 109 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 3c85d46..b5576a5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - Data quality on model and field level ([#55](https://github.com/datacontract/datacontract-specification/issues/55)) +- Field and definition `examples` as array of any type, instead of `example` as a single value - Support for server-specific data types as config map ([#63](https://github.com/datacontract/datacontract-specification/issues/63)) - AWS Glue Catalog server support - sftp server support diff --git a/README.md b/README.md index b70e287..1a6cac5 100644 --- a/README.md +++ b/README.md @@ -94,12 +94,14 @@ models: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp required: true - example: "2024-09-09T08:30:00Z" + examples: + - "2024-09-09T08:30:00Z" order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true - example: "9999" + examples: + - "9999" customer_id: description: Unique identifier for the customer. type: text @@ -113,8 +115,8 @@ models: pii: true classification: sensitive quality: - - name: Verified email address - description: The email address was verified by a user with double opt-in. + - type: text + name: The email address was verified by a user processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp @@ -123,16 +125,28 @@ models: jsonType: string jsonFormat: date-time quality: - - name: Completeness check - description: If there is a gap of orders longer than one hour, it clearly indicates a problem. - sql: | + - type: sql + description: The maximum duration between two orders should be less that 3600 seconds + query: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration - FROM {orders} + FROM orders must_be_less_than: 3600 - - name: Number of rows + - type: row_count engine: soda - type: row_count must_be_greater_than: 5 + examples: + - | + order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp + "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" + "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" + "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" + "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" + "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" + "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" + "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" + "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" line_items: description: A single article that is part of an order. type: table @@ -149,6 +163,19 @@ models: sku: description: The purchased article number $ref: '#/definitions/sku' + examples: + - | + lines_item_id,order_id,sku + "LI-1","1001","5901234123457" + "LI-2","1001","4001234567890" + "LI-3","1002","5901234123457" + "LI-4","1002","2001234567893" + "LI-5","1003","4001234567890" + "LI-6","1003","5001234567892" + "LI-7","1004","5901234123457" + "LI-8","1005","2001234567893" + "LI-9","1005","5001234567892" + "LI-10","1005","6001234567891" definitions: order_id: domain: checkout @@ -157,7 +184,8 @@ definitions: type: text format: uuid description: An internal ID that identifies an order in the online shop. - example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 + examples: + - 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted tags: @@ -168,7 +196,8 @@ definitions: title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ - example: "96385074" + examples: + - "96385074" description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. @@ -176,37 +205,6 @@ definitions: wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit tags: - inventory -examples: - - type: csv # csv, json, yaml, custom - model: orders - description: An example list of order records. - data: | # expressed as string or inline yaml or via "$ref: data.csv" - order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp - "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" - "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" - "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" - "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" - "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" - "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" - "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" - "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" - "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" - "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" - - type: csv - model: line_items - description: An example list of line items. - data: | - lines_item_id,order_id,sku - "LI-1","1001","5901234123457" - "LI-2","1001","4001234567890" - "LI-3","1002","5901234123457" - "LI-4","1002","2001234567893" - "LI-5","1003","4001234567890" - "LI-6","1003","5001234567892" - "LI-7","1004","5901234123457" - "LI-8","1005","2001234567893" - "LI-9","1005","5001234567892" - "LI-10","1005","6001234567891" servicelevels: availability: description: The server is available during support hours @@ -282,7 +280,6 @@ Specification - [Model Object](#model-object) - [Field Object](#field-object) - [Definition Object](#definition-object) -- [Example Object](#example-object) - [Service Level Object](#service-levels-object) - [Quality Object](#quality-object) - [Data Types](#data-types) @@ -306,7 +303,6 @@ It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | | models | Map[`string`, [Model Object](#model-object)] | Specifies the logical data model. | | definitions | Map[`string`, [Definition Object](#definition-object)] | Specifies definitions. | -| examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | | servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | | links | Map[`string`, `string`] | Additional external documentation links. | | tags | Array of `string` | Custom metadata to provide additional context. | @@ -559,6 +555,7 @@ The name of the data model (table name) is defined by the key that refers to thi | config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | | quality | Array of [Quality Object](#quality-object) | Specifies the quality attributes on model level. | +| examples | Array of `string` | Specifies example data sets for the model. Typical in CSV or JSON format. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -588,7 +585,8 @@ The Field Objects describes one field (column, property, nested field) of a data | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | -| example | `string` | An example value. | +| example | `string` | DEPRECATED, use examples. An example value. | +| examples | Array of Any | A list of example values. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | | tags | Array of `string` | Custom metadata to provide additional context. | @@ -642,34 +640,6 @@ Models fields can refer to definitions using the `$ref` field to link to existin This object _MAY_ be extended with [Specification Extensions](#specification-extensions). -### Example Object - -| Field | Type | Description | -|-------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------| -| type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `csv`, `json`, `yaml`, `custom` | -| description | `string` | An optional string describing the example. | -| model | `string` | The reference to the model in the schema, e.g. a table name. | -| data | `string` | Example data for this model. | - -Example: - -```yaml -examples: -- type: csv - model: orders - data: |- - order_id,order_timestamp,order_total - "1001","2023-09-09T08:30:00Z",2500 - "1002","2023-09-08T15:45:00Z",1800 - "1003","2023-09-07T12:15:00Z",3200 - "1004","2023-09-06T19:20:00Z",1500 - "1005","2023-09-05T10:10:00Z",4200 - "1006","2023-09-04T14:55:00Z",2800 - "1007","2023-09-03T21:05:00Z",1900 - "1008","2023-09-02T17:40:00Z",3600 - "1009","2023-09-01T09:25:00Z",3100 - "1010","2023-08-31T22:50:00Z",2700 -``` ### Service Levels Object diff --git a/examples/orders-latest/datacontract.yaml b/examples/orders-latest/datacontract.yaml index 1809bc2..b901950 100644 --- a/examples/orders-latest/datacontract.yaml +++ b/examples/orders-latest/datacontract.yaml @@ -50,12 +50,14 @@ models: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp required: true - example: "2024-09-09T08:30:00Z" + examples: + - "2024-09-09T08:30:00Z" order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true - example: "9999" + examples: + - "9999" customer_id: description: Unique identifier for the customer. type: text @@ -88,6 +90,19 @@ models: - type: row_count engine: soda must_be_greater_than: 5 + examples: + - | + order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp + "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" + "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" + "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" + "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" + "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" + "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" + "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" + "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" line_items: description: A single article that is part of an order. type: table @@ -104,6 +119,19 @@ models: sku: description: The purchased article number $ref: '#/definitions/sku' + examples: + - | + lines_item_id,order_id,sku + "LI-1","1001","5901234123457" + "LI-2","1001","4001234567890" + "LI-3","1002","5901234123457" + "LI-4","1002","2001234567893" + "LI-5","1003","4001234567890" + "LI-6","1003","5001234567892" + "LI-7","1004","5901234123457" + "LI-8","1005","2001234567893" + "LI-9","1005","5001234567892" + "LI-10","1005","6001234567891" definitions: order_id: domain: checkout @@ -112,7 +140,8 @@ definitions: type: text format: uuid description: An internal ID that identifies an order in the online shop. - example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 + examples: + - 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted tags: @@ -123,7 +152,8 @@ definitions: title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ - example: "96385074" + examples: + - "96385074" description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. @@ -131,37 +161,6 @@ definitions: wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit tags: - inventory -examples: - - type: csv # csv, json, yaml, custom - model: orders - description: An example list of order records. - data: | # expressed as string or inline yaml or via "$ref: data.csv" - order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp - "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" - "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" - "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" - "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" - "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" - "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" - "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" - "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" - "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" - "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" - - type: csv - model: line_items - description: An example list of line items. - data: | - lines_item_id,order_id,sku - "LI-1","1001","5901234123457" - "LI-2","1001","4001234567890" - "LI-3","1002","5901234123457" - "LI-4","1002","2001234567893" - "LI-5","1003","4001234567890" - "LI-6","1003","5001234567892" - "LI-7","1004","5901234123457" - "LI-8","1005","2001234567893" - "LI-9","1005","5001234567892" - "LI-10","1005","6001234567891" servicelevels: availability: description: The server is available during support hours From 1475cfef22108c0c00eea5027dbaff0c3ab8b526 Mon Sep 17 00:00:00 2001 From: Simon Harrer Date: Thu, 12 Sep 2024 13:58:39 +0200 Subject: [PATCH 19/31] Deprecate definiton.domain, use definiton.id instead. --- datacontract.schema.json | 3 ++- definition.schema.json | 10 +++++++++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/datacontract.schema.json b/datacontract.schema.json index 844afc4..1a98b43 100644 --- a/datacontract.schema.json +++ b/datacontract.schema.json @@ -1086,7 +1086,8 @@ "domain": { "type": "string", "description": "The domain in which this definition is valid.", - "default": "global" + "default": "global", + "deprecationMessage": "This field is deprecated. Encode the domain into the ID using slashes." }, "name": { "type": "string", diff --git a/definition.schema.json b/definition.schema.json index e07bccd..c93447b 100644 --- a/definition.schema.json +++ b/definition.schema.json @@ -6,7 +6,15 @@ "domain": { "type": "string", "description": "The domain in which this definition is valid.", - "default": "global" + "default": "global", + "deprecationMessage": "This field is deprecated. Encode the domain into the ID." + }, + "id": { + "type": "string", + "description": "A unique identifier for this definition. Encode the domain into the ID, separated by slashes.", + "examples": [ + "checkout/order_id" + ] }, "name": { "type": "string", From 21de1b0be37f720f7f7b2bc70eb52bd22e503363 Mon Sep 17 00:00:00 2001 From: Simon Harrer Date: Thu, 12 Sep 2024 13:59:56 +0200 Subject: [PATCH 20/31] UPDATE --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a419951..c829425 100644 --- a/README.md +++ b/README.md @@ -608,7 +608,7 @@ Models fields can refer to definitions using the `$ref` field to link to existin |------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | name | `string` | REQUIRED. The technical name of this definition. | | type | [Data Type](#data-types) | REQUIRED. The logical data type | -| domain | `string` | The domain in which this definition is valid. Default: `global`. | +| domain | `string` | DEPRECATED. Use definition id instead. The domain in which this definition is valid. Default: `global`. | | title | `string` | The business name of this definition. | | description | `string` | Clear and concise explanations related to the domain | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | From beed4acb20835b346e8edddc6d26d328ff7ba71f Mon Sep 17 00:00:00 2001 From: jochen Date: Thu, 12 Sep 2024 18:40:45 +0200 Subject: [PATCH 21/31] Fix threshold keys --- README.md | 16 ++++++++-------- examples/orders-latest/datacontract.yaml | 18 +++++++++--------- 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index 186b799..1ae98c0 100644 --- a/README.md +++ b/README.md @@ -56,12 +56,6 @@ info: contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout -tags: - - checkout - - orders - - s3 -links: - datacontractCli: https://cli.datacontract.com servers: production: type: s3 @@ -130,10 +124,10 @@ models: query: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration FROM orders - must_be_less_than: 3600 + mustBeLessThan: 3600 - type: row_count engine: soda - must_be_greater_than: 5 + mustBeGreaterThan: 5 examples: - | order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp @@ -237,6 +231,12 @@ servicelevels: cron: 0 0 * * 0 recoveryTime: 24 hours recoveryPoint: 1 week +tags: + - checkout + - orders + - s3 +links: + datacontractCli: https://cli.datacontract.com ``` Data Contract CLI diff --git a/examples/orders-latest/datacontract.yaml b/examples/orders-latest/datacontract.yaml index b901950..6caf815 100644 --- a/examples/orders-latest/datacontract.yaml +++ b/examples/orders-latest/datacontract.yaml @@ -12,12 +12,6 @@ info: contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout -tags: - - checkout - - orders - - s3 -links: - datacontractCli: https://cli.datacontract.com servers: production: type: s3 @@ -86,10 +80,10 @@ models: query: | SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration FROM orders - must_be_less_than: 3600 + mustBeLessThan: 3600 - type: row_count engine: soda - must_be_greater_than: 5 + mustBeGreaterThan: 5 examples: - | order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp @@ -192,4 +186,10 @@ servicelevels: interval: weekly cron: 0 0 * * 0 recoveryTime: 24 hours - recoveryPoint: 1 week \ No newline at end of file + recoveryPoint: 1 week +tags: + - checkout + - orders + - s3 +links: + datacontractCli: https://cli.datacontract.com From a84508ed967452ed261e5dd3654f439bb9c3d814 Mon Sep 17 00:00:00 2001 From: jochen Date: Thu, 12 Sep 2024 19:27:45 +0200 Subject: [PATCH 22/31] Fix example --- CHANGELOG.md | 1 + README.md | 71 +++++++++++++----------- datacontract.schema.json | 2 +- examples/orders-latest/datacontract.yaml | 25 ++++++--- 4 files changed, 59 insertions(+), 40 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index b5576a5..857447e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -27,6 +27,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Removed +- `definitions.domain` removed (use a hierarchical structure instead) - `quality` on top-level removed - `schema` removed in favor of encoding any physical schema configuration in the `model` using the `config` map at the field level and supporting import/export ([#21](https://github.com/datacontract/datacontract-specification/issues/21)). diff --git a/README.md b/README.md index be1eb26..1466210 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,6 @@ info: All orders since 2020-01-01. Orders with their line items are in their current state (no history included). owner: Checkout Team - slackChannel: "#checkout" contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout @@ -90,12 +89,20 @@ models: required: true examples: - "2024-09-09T08:30:00Z" + tags: ["business-timestamp"] order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true examples: - "9999" + quality: + - type: sql + description: 95% of all order total values are expected to be between 10 and 499 EUR. + query: | + SELECT quantile_cont(order_total, 0.95) AS percentile_95 + FROM orders + mustBeBetween: [1000, 49900] customer_id: description: Unique identifier for the customer. type: text @@ -125,22 +132,25 @@ models: SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration FROM orders mustBeLessThan: 3600 - - type: row_count - engine: soda + - type: sql + description: Row Count + query: | + SELECT count(*) as row_count + FROM orders mustBeGreaterThan: 5 examples: - | - order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp - "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" - "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" - "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" - "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" - "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" - "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" - "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" - "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" - "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" - "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" + order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp + "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" + "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" + "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" + "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" + "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" + "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" + "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" + "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" line_items: description: A single article that is part of an order. type: table @@ -152,27 +162,26 @@ models: unique: true primary: true order_id: - $ref: '#/definitions/order_id' + $ref: '#/definitions/checkout/order_id' references: orders.order_id sku: description: The purchased article number - $ref: '#/definitions/sku' + $ref: '#/definitions/checkout/sku' examples: - - | - lines_item_id,order_id,sku - "LI-1","1001","5901234123457" - "LI-2","1001","4001234567890" - "LI-3","1002","5901234123457" - "LI-4","1002","2001234567893" - "LI-5","1003","4001234567890" - "LI-6","1003","5001234567892" - "LI-7","1004","5901234123457" - "LI-8","1005","2001234567893" - "LI-9","1005","5001234567892" - "LI-10","1005","6001234567891" + - | + lines_item_id,order_id,sku + "LI-1","1001","5901234123457" + "LI-2","1001","4001234567890" + "LI-3","1002","5901234123457" + "LI-4","1002","2001234567893" + "LI-5","1003","4001234567890" + "LI-6","1003","5001234567892" + "LI-7","1004","5901234123457" + "LI-8","1005","2001234567893" + "LI-9","1005","5001234567892" + "LI-10","1005","6001234567891" definitions: - order_id: - domain: checkout + checkout/order_id: name: order_id title: Order ID type: text @@ -184,7 +193,7 @@ definitions: classification: restricted tags: - orders - sku: + checkout/sku: domain: inventory name: sku title: Stock Keeping Unit diff --git a/datacontract.schema.json b/datacontract.schema.json index dec038a..1c77c3a 100644 --- a/datacontract.schema.json +++ b/datacontract.schema.json @@ -653,7 +653,7 @@ "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "type": "object", "propertyNames": { - "pattern": "^[a-zA-Z0-9_-]+$" + "pattern": "^[a-zA-Z0-9/_-]+$" }, "additionalProperties": { "type": "object", diff --git a/examples/orders-latest/datacontract.yaml b/examples/orders-latest/datacontract.yaml index 6caf815..a589b31 100644 --- a/examples/orders-latest/datacontract.yaml +++ b/examples/orders-latest/datacontract.yaml @@ -8,7 +8,6 @@ info: All orders since 2020-01-01. Orders with their line items are in their current state (no history included). owner: Checkout Team - slackChannel: "#checkout" contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout @@ -46,12 +45,20 @@ models: required: true examples: - "2024-09-09T08:30:00Z" + tags: ["business-timestamp"] order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true examples: - "9999" + quality: + - type: sql + description: 95% of all order total values are expected to be between 10 and 499 EUR. + query: | + SELECT quantile_cont(order_total, 0.95) AS percentile_95 + FROM orders + mustBeBetween: [1000, 49900] customer_id: description: Unique identifier for the customer. type: text @@ -81,8 +88,11 @@ models: SELECT MAX(EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp)))) AS max_duration FROM orders mustBeLessThan: 3600 - - type: row_count - engine: soda + - type: sql + description: Row Count + query: | + SELECT count(*) as row_count + FROM orders mustBeGreaterThan: 5 examples: - | @@ -108,11 +118,11 @@ models: unique: true primary: true order_id: - $ref: '#/definitions/order_id' + $ref: '#/definitions/checkout/order_id' references: orders.order_id sku: description: The purchased article number - $ref: '#/definitions/sku' + $ref: '#/definitions/checkout/sku' examples: - | lines_item_id,order_id,sku @@ -127,8 +137,7 @@ models: "LI-9","1005","5001234567892" "LI-10","1005","6001234567891" definitions: - order_id: - domain: checkout + checkout/order_id: name: order_id title: Order ID type: text @@ -140,7 +149,7 @@ definitions: classification: restricted tags: - orders - sku: + checkout/sku: domain: inventory name: sku title: Stock Keeping Unit From 59cebdb4c1f716ff0b83c36d99d230f2526fb0f2 Mon Sep 17 00:00:00 2001 From: jochen Date: Fri, 13 Sep 2024 17:49:19 +0200 Subject: [PATCH 23/31] Prepare release v1.1.0 --- CHANGELOG.md | 5 +- README.md | 209 ++++++++++---- datacontract.init.yaml | 12 +- datacontract.schema.json | 339 +++++++++++++++++------ definition.schema.json | 18 +- examples/orders-latest/datacontract.yaml | 40 ++- 6 files changed, 443 insertions(+), 180 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 857447e..861ea16 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - Data quality on model and field level ([#55](https://github.com/datacontract/datacontract-specification/issues/55)) -- Field and definition `examples` as array of any type, instead of `example` as a single value +- Lineage support ([#90](https://github.com/datacontract/datacontract-specification/issues/90)) +- Field and definition `examples` as array of any type, instead of `example` as a single value ([#29](https://github.com/datacontract/datacontract-specification/issues/29) - Support for server-specific data types as config map ([#63](https://github.com/datacontract/datacontract-specification/issues/63)) - AWS Glue Catalog server support - sftp server support @@ -28,7 +29,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Removed - `definitions.domain` removed (use a hierarchical structure instead) +- `definitions.name` removed (use a hierarchical structure instead) - `quality` on top-level removed +- `examples` on top-level removed - `schema` removed in favor of encoding any physical schema configuration in the `model` using the `config` map at the field level and supporting import/export ([#21](https://github.com/datacontract/datacontract-specification/issues/21)). diff --git a/README.md b/README.md index 1466210..61c7ffa 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ Version Example --- -[![Data Contract Catalog](https://img.shields.io/badge/Data%20Contract-Catalog-blue)](https://datacontract.com/examples/index.html) +View in [Data Contract Catalog](https://datacontract.com/examples/index.html) ```yaml dataContractSpecification: 1.1.0 @@ -63,6 +63,11 @@ servers: format: json delimiter: new_line description: "One folder per model. One file per day." + roles: + - name: analyst_us + description: Access to the data for US region + - name: analyst_cn + description: Access to the data for China region terms: usage: | Data can be used for reports, analytics and machine learning use cases. @@ -71,6 +76,12 @@ terms: Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB + policies: + - name: privacy-policy + url: https://example.com/privacy-policy + - name: license + description: External data is licensed under agreement 1234. + url: https://example.com/license/1234 billing: 5000 USD per month noticePeriod: P3M models: @@ -79,10 +90,10 @@ models: type: table fields: order_id: - $ref: '#/definitions/order_id' + $ref: '#/definitions/checkout/order_id' required: true unique: true - primary: true + primaryKey: true order_timestamp: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp @@ -95,7 +106,7 @@ models: type: long required: true examples: - - "9999" + - 9999 quality: - type: sql description: 95% of all order total values are expected to be between 10 and 499 EUR. @@ -117,7 +128,12 @@ models: classification: sensitive quality: - type: text - name: The email address was verified by a user + description: The email address is not verified and may be invalid. + lineage: + inputFields: + - namespace: com.example.service.checkout + name: checkout_db.orders + field: email_address processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp @@ -140,49 +156,47 @@ models: mustBeGreaterThan: 5 examples: - | - order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp - "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" - "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" - "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" - "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" - "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" - "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" - "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" - "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" - "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" - "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" + order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp + "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" + "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" + "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" + "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" + "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" + "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" + "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" + "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" + "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" line_items: description: A single article that is part of an order. type: table fields: - lines_item_id: + line_item_id: type: text description: Primary key of the lines_item_id table required: true - unique: true - primary: true order_id: $ref: '#/definitions/checkout/order_id' references: orders.order_id sku: description: The purchased article number - $ref: '#/definitions/checkout/sku' + $ref: '#/definitions/inventory/sku' + primaryKey: ["order_id", "line_item_id"] examples: - - | - lines_item_id,order_id,sku - "LI-1","1001","5901234123457" - "LI-2","1001","4001234567890" - "LI-3","1002","5901234123457" - "LI-4","1002","2001234567893" - "LI-5","1003","4001234567890" - "LI-6","1003","5001234567892" - "LI-7","1004","5901234123457" - "LI-8","1005","2001234567893" - "LI-9","1005","5001234567892" - "LI-10","1005","6001234567891" + - | + line_item_id,order_id,sku + "LI-1","1001","5901234123457" + "LI-2","1001","4001234567890" + "LI-3","1002","5901234123457" + "LI-4","1002","2001234567893" + "LI-5","1003","4001234567890" + "LI-6","1003","5001234567892" + "LI-7","1004","5901234123457" + "LI-8","1005","2001234567893" + "LI-9","1005","5001234567892" + "LI-10","1005","6001234567891" definitions: checkout/order_id: - name: order_id title: Order ID type: text format: uuid @@ -193,9 +207,7 @@ definitions: classification: restricted tags: - orders - checkout/sku: - domain: inventory - name: sku + inventory/sku: title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ @@ -291,6 +303,7 @@ Specification - [Definition Object](#definition-object) - [Service Level Object](#service-levels-object) - [Quality Object](#quality-object) +- [Lineage Object](#lineage-object) - [Data Types](#data-types) - [Specification Extensions](#specification-extensions) @@ -354,12 +367,12 @@ This object _MAY_ be extended with [Specification Extensions](#specification-ext The fields are dependent on the defined type. -| Field | Type | Description | -|-------------|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `glue`, `redshift`, `azure`, `sqlserver`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `kinesis`, `trino`, `local` | -| description | `string` | An optional string describing the server. | -| environment | `string` | An optional string describing the environment, e.g., prod, sit, stg. | -| roles | Array of `Server Role Object` | An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data. | +| Field | Type | Description | +|-------------|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `glue`, `redshift`, `azure`, `sqlserver`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `kinesis`, `trino`, `local` | +| description | `string` | An optional string describing the server. | +| environment | `string` | An optional string describing the environment, e.g., prod, sit, stg. | +| roles | Array of [Server Role Object](#server-role-object) | An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -547,15 +560,24 @@ servers: The terms and conditions of the data contract. -| Field | Type | Description | -|--------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | -| limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | -| billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | -| noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | +| Field | Type | Description | +|--------------|------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | +| limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | +| policies | Array of [Policy Object](#policy-object) | A list of policies, licenses, standards, that are applicable for this data contract and that must be acknowledged by data consumers. | +| billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | +| noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). +#### Policy Object + +| Field | Type | Description | +|-------------|----------|-----------------------------------| +| name | `string` | Name of the policy. | +| description | `string` | A description of the policy. | +| url | `string` | An URL that refers to the policy. | + ### Model Object @@ -569,10 +591,10 @@ The name of the data model (table name) is defined by the key that refers to thi | description | `string` | An optional string describing the data model. | | title | `string` | An optional string for the title of the data model. Especially useful if the name of the model is cryptic or contains abbreviations. | | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | -| config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | - +| primaryKey | Array of `string` | If the primary key is a compound key, list the field names that constitute the primary key. Alternative to field-level `primaryKey`. | | quality | Array of [Quality Object](#quality-object) | Specifies the quality attributes on model level. | -| examples | Array of `string` | Specifies example data sets for the model. Typical in CSV or JSON format. | +| examples | Array of `Any` | Specifies example data sets for the model. | +| config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -589,7 +611,7 @@ The Field Objects describes one field (column, property, nested field) of a data | title | `string` | An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations. | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | | required | `boolean` | An indication, if this field must contain a value and may not be null. Default: `false` | -| primary | `boolean` | If this field is a primary key. Default: `false` | +| primaryKey | `boolean` | If this field is a primary key. Default: `false` | | references | `string` | The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship. | | unique | `boolean` | An indication, if the value must be unique within the model. Default: `false` | | format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | @@ -602,7 +624,7 @@ The Field Objects describes one field (column, property, nested field) of a data | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | -| example | `string` | DEPRECATED, use examples. An example value. | +| ~~example~~ | `string` | DEPRECATED, use examples. An example value. | | examples | Array of Any | A list of example values. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | @@ -613,9 +635,9 @@ The Field Objects describes one field (column, property, nested field) of a data | items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is `array`. | | keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | | values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | -| config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | - | quality | Array of [Quality Object](#quality-object) | Specifies the quality attributes on field level. | +| lineage | [Lineage Object](#lineage-object) | Provides information where the data comes from. | +| config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). @@ -628,9 +650,7 @@ Models fields can refer to definitions using the `$ref` field to link to existin | Field | Type | Description | |------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| name | `string` | REQUIRED. The technical name of this definition. | | type | [Data Type](#data-types) | REQUIRED. The logical data type | -| domain | `string` | DEPRECATED. Use definition id instead. The domain in which this definition is valid. Default: `global`. | | title | `string` | The business name of this definition. | | description | `string` | Clear and concise explanations related to the domain | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | @@ -644,7 +664,7 @@ Models fields can refer to definitions using the `$ref` field to link to existin | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | -| example | `string` | An example value. | +| examples | Array of Any | A list of example values. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | | tags | Array of `string` | Custom metadata to provide additional context. | @@ -829,7 +849,7 @@ An individual SQL query that returns a single number that can be compared with a | mustBeLessThan | `integer` | The threshold to check the return value of the query | | mustBeLessThanOrEqualTo | `integer` | The threshold to check the return value of the query | | mustBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | -| mustBeNotBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | +| mustNotBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | In the query the following placeholders can be used: @@ -962,6 +982,77 @@ models: ``` +### Lineage Object + +Field level lineage provides optional fine-grained information where the data comes from and how it was transformed. + +The lineage object is based on the OpenLinage [Column Level Lineage Dataset Facet](https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet) to describe the input fields. + + + +| Field | Type | Description | +|-------------|---------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| inputFields | Array of [InputField Object](#inputfield-object) | The input fields refer to specific fields, columns, or data points from source systems or other data contracts that feed into a particular transformation, calculation, or final result. | + + +#### InputField Object + +| Field | Type | Description | +|-----------------|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| namespace | `string` | The input dataset namespace, such as the name of the source system or the domain of another data contract. Examples: `com.example.crm`, `checkout`, snowflake://{account name}. [More on namespace](https://openlineage.io/blog/whats-in-a-namespace/#namespaces-in-the-spec) | +| name | `string` | The input dataset name, such as a reference to a data contract, a fully qualified table name, a Kafka topic. | +| field | `string` | The input field name, such as the field in an upstream data contract, a table column or a JSON Path. | +| transformations | Array of [Transformation Object](#transformation-object) | Optional. This describes how the input field data was used to generate the final result. | + +#### Transformation Object + +| Field | Type | Description | +|-------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| type | `string` | Indicates how direct is the relationship e.g. in query. Allows values are: `DIRECT` and `INDIRECT`. | +| subtype | `string` | Optional. Contains more specific information about the transformation.
Allowed values for type `DIRECT`: `IDENTITY`, `TRANSFORMATION`, `AGGREGATION`.
Allowed values for type `INDIRECT`: `JOIN`, `GROUP_BY`, `FILTER`, `SORT`, `WINDOW`, `CONDITIONAL`. | +| description | `string` | Optional. A string representation of the transformation applied. | +| masking | `boolean` | Optional. Boolean value indicating if the input value was obfuscated during the transformation. | + + +Example: + +```yaml +models: + orders: + fields: + order_id: + type: string + lineage: + inputFields: + - namespace: com.example.service.checkout + name: checkout_db.orders + field: order_id + transformations: + - type: DIRECT + subtype: IDENTITY + description: The order ID from the checkout order + - namespace: com.example.service.checkout + name: checkout_db.orders + field: order_timestamp + - type: INDIRECT + subtype: SORT + customer_email_address_hash: + type: string + lineage: + inputFields: + - namespace: com.example.service.checkout + name: checkout_db.orders + field: email_address + transformations: + - type: DIRECT + subtype: Transformation + description: The email address from the checkout order, hashed with SHA-256 + masking: true + +``` + + + ### Config Object The config field can be used to set additional metadata that may be used by tools, e.g. to define a namespace for code generation, specify physical data types, toggle tests, etc. diff --git a/datacontract.init.yaml b/datacontract.init.yaml index 4c6ed27..81d1c7d 100644 --- a/datacontract.init.yaml +++ b/datacontract.init.yaml @@ -1,4 +1,4 @@ -dataContractSpecification: 1.0.1 +dataContractSpecification: 1.1.0 id: my-data-contract-id info: title: My Data Contract @@ -55,16 +55,6 @@ info: # classification: -### examples - -#examples: -# - type: csv -# model: my_model -# data: |- -# id,timestamp,amount -# "1001","2023-09-09T08:30:00Z",2500 -# "1002","2023-09-08T15:45:00Z",1800 - ### servicelevels #servicelevels: diff --git a/datacontract.schema.json b/datacontract.schema.json index 1c77c3a..d52f6b1 100644 --- a/datacontract.schema.json +++ b/datacontract.schema.json @@ -33,7 +33,7 @@ "status": { "type": "string", "description": "The status of the data contract. Can be proposed, in development, active, retired.", - "x-extensible-enum": [ + "examples": [ "proposed", "in development", "active", @@ -366,6 +366,35 @@ "type": "string", "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, + "policies": { + "type": "array", + "items": { + "type": "object", + "properties": { + "type": { + "type": "string", + "description": "The type of the policy.", + "examples": [ + "privacy", + "security", + "retention", + "compliance" + ] + }, + "description": { + "type": "string", + "description": "A description of the policy." + }, + "url": { + "type": "string", + "format": "uri", + "description": "A URL to the policy document." + } + }, + "additionalProperties": true + }, + "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." + }, "billing": { "type": "string", "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." @@ -450,6 +479,10 @@ "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "primary": { + "type": "boolean", + "deprecationMessage": "Use the primaryKey field instead." + }, + "primaryKey": { "type": "boolean", "default": false, "description": "If this field is a primary key." @@ -531,7 +564,12 @@ }, "example": { "type": "string", - "description": "An example value for this field." + "description": "An example value for this field.", + "deprecationMessage": "Use the examples field instead." + }, + "examples": { + "type": "array", + "description": "A examples value for this field." }, "pii": { "type": "boolean", @@ -575,6 +613,15 @@ "type": "string", "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." }, + "quality": { + "type": "array", + "items": { + "$ref": "#/$defs/Quality" + } + }, + "lineage": { + "$ref": "#/$defs/Lineage" + }, "config": { "type": "object", "description": "Additional metadata for field configuration.", @@ -626,6 +673,16 @@ } } }, + "primaryKey": { + "type": "array", + "items": { + "type": "string" + }, + "description": "The compound primary key of the model." + }, + "examples": { + "type": "array" + }, "config": { "type": "object", "description": "Additional metadata for model configuration.", @@ -667,7 +724,8 @@ }, "name": { "type": "string", - "description": "The technical name of this definition." + "description": "The technical name of this definition.", + "deprecationMessage": "This field is deprecated. Encode the name into the ID using slashes." }, "title": { "type": "string", @@ -744,7 +802,12 @@ }, "example": { "type": "string", - "description": "An example value." + "description": "An example value.", + "deprecationMessage": "Use the examples field instead." + }, + "examples": { + "type": "array", + "description": "Example value." }, "pii": { "type": "boolean", @@ -780,55 +843,10 @@ } }, "required": [ - "name", "type" ] } }, - "examples": { - "type": "array", - "items": { - "type": "object", - "properties": { - "type": { - "type": "string", - "title": "ExampleType", - "enum": [ - "csv", - "json", - "yaml", - "custom" - ], - "description": "The type of the example data. Well-known types are csv, json, yaml, custom." - }, - "description": { - "type": "string", - "description": "An optional string describing the example." - }, - "model": { - "type": "string", - "description": "The reference to the model in the schema, e.g., a table name." - }, - "data": { - "oneOf": [ - { - "type": "string", - "description": "Example data for this model." - }, - { - "type": "array", - "description": "Example data for this model in a structured format. Use this for type json or yaml." - } - ] - } - }, - "required": [ - "type", - "data" - ] - }, - "description": "The Examples Object is an array of Example Objects." - }, "servicelevels": { "type": "object", "description": "Specifies the service level agreements for the provided data, including availability, data retention policies, latency requirements, data freshness, update frequency, support availability, and backup policies.", @@ -1009,39 +1027,6 @@ } } }, - "quality": { - "type": "object", - "properties": { - "type": { - "type": "string", - "title": "QualityType", - "enum": [ - "SodaCL", - "montecarlo", - "great-expectations", - "custom" - ], - "description": "The type of the quality check. Typical values are SodaCL, montecarlo, great-expectations, custom." - }, - "specification": { - "oneOf": [ - { - "type": "string", - "description": "The specification of the quality attributes as a string." - }, - { - "type": "object", - "description": "The specification of the quality attributes as an object." - } - ] - } - }, - "required": [ - "type", - "specification" - ], - "description": "The quality object contains quality attributes and checks." - }, "links": { "type": "object", "description": "Links to external resources.", @@ -1670,6 +1655,194 @@ "path", "format" ] + }, + "Quality": { + "allOf": [ + { + "type": "object", + "properties": { + "type": { + "type": "string", + "description": "The type of quality check", + "enum": [ + "text", + "sql", + "custom" + ] + }, + "description": { + "type": "string", + "description": "A plain text describing the quality attribute in natural language." + } + }, + "required": [ + "type" + ] + }, + { + "if": { + "properties": { + "type": { + "const": "sql" + } + } + }, + "then": { + "properties": { + "query": { + "type": "string", + "description": "A SQL query that returns a single number to compare with the threshold." + }, + "mustBe": { + "type": "integer" + }, + "mustNotBe": { + "type": "integer" + }, + "mustBeGreaterThan": { + "type": "integer" + }, + "mustBeGreaterThanOrEqualTo": { + "type": "integer" + }, + "mustBeLessThan": { + "type": "integer" + }, + "mustBeLessThanOrEqualTo": { + "type": "integer" + }, + "mustBeBetween": { + "type": "array", + "items": { + "type": "integer" + }, + "minItems": 2, + "maxItems": 2 + }, + "mustNotBeBetween": { + "type": "array", + "items": { + "type": "integer" + }, + "minItems": 2, + "maxItems": 2 + } + }, + "required": [ + "query" + ] + } + }, + { + "if": { + "properties": { + "type": { + "const": "custom" + } + } + }, + "then": { + "properties": { + "description": { + "type": "string", + "description": "A plain text describing the quality attribute in natural language." + }, + "engine": { + "type": "string", + "examples": [ + "soda", + "great-expectations" + ], + "description": "The engine used for custom quality checks." + }, + "specification": { + "type": [ + "object", + "array", + "string" + ], + "description": "Engine-specific quality checks and expectations." + } + }, + "required": [ + "engine" + ] + } + } + ] + }, + "Lineage": { + "type": "object", + "properties": { + "inputFields": { + "type": "array", + "items": { + "type": "object", + "properties": { + "namespace": { + "type": "string", + "description": "The input dataset namespace" + }, + "name": { + "type": "string", + "description": "The input dataset name" + }, + "field": { + "type": "string", + "description": "The input field" + }, + "transformations": { + "type": "array", + "items": { + "type": "object", + "properties": { + "type": { + "description": "The type of the transformation. Allowed values are: DIRECT, INDIRECT", + "type": "string" + }, + "subtype": { + "type": "string", + "description": "The subtype of the transformation" + }, + "description": { + "type": "string", + "description": "a string representation of the transformation applied" + }, + "masking": { + "type": "boolean", + "description": "is transformation masking the data or not" + } + }, + "required": [ + "type" + ], + "additionalProperties": true + } + } + }, + "additionalProperties": true, + "required": [ + "namespace", + "name", + "field" + ] + } + }, + "transformationDescription": { + "type": "string", + "description": "a string representation of the transformation applied", + "deprecated": true + }, + "transformationType": { + "type": "string", + "description": "IDENTITY|MASKED reflects a clearly defined behavior. IDENTITY: exact same as input; MASKED: no original data available (like a hash of PII for example)", + "deprecated": true + } + }, + "additionalProperties": true, + "required": [ + "inputFields" + ] } } -} +} \ No newline at end of file diff --git a/definition.schema.json b/definition.schema.json index c93447b..d0d30ac 100644 --- a/definition.schema.json +++ b/definition.schema.json @@ -3,12 +3,6 @@ "type": "object", "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "properties": { - "domain": { - "type": "string", - "description": "The domain in which this definition is valid.", - "default": "global", - "deprecationMessage": "This field is deprecated. Encode the domain into the ID." - }, "id": { "type": "string", "description": "A unique identifier for this definition. Encode the domain into the ID, separated by slashes.", @@ -16,10 +10,6 @@ "checkout/order_id" ] }, - "name": { - "type": "string", - "description": "The technical name of this definition." - }, "title": { "type": "string", "description": "The business name of this definition." @@ -64,7 +54,12 @@ }, "example": { "type": "string", - "description": "An example value." + "description": "An example value for this field.", + "deprecationMessage": "Use the examples field instead." + }, + "examples": { + "type": "array", + "description": "A examples value for this field." }, "pii": { "type": "boolean", @@ -100,7 +95,6 @@ } }, "required": [ - "name", "type" ] } diff --git a/examples/orders-latest/datacontract.yaml b/examples/orders-latest/datacontract.yaml index a589b31..4e28a41 100644 --- a/examples/orders-latest/datacontract.yaml +++ b/examples/orders-latest/datacontract.yaml @@ -19,6 +19,11 @@ servers: format: json delimiter: new_line description: "One folder per model. One file per day." + roles: + - name: analyst_us + description: Access to the data for US region + - name: analyst_cn + description: Access to the data for China region terms: usage: | Data can be used for reports, analytics and machine learning use cases. @@ -27,6 +32,12 @@ terms: Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB + policies: + - name: privacy-policy + url: https://example.com/privacy-policy + - name: license + description: External data is licensed under agreement 1234. + url: https://example.com/license/1234 billing: 5000 USD per month noticePeriod: P3M models: @@ -35,10 +46,10 @@ models: type: table fields: order_id: - $ref: '#/definitions/order_id' + $ref: '#/definitions/checkout/order_id' required: true unique: true - primary: true + primaryKey: true order_timestamp: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp @@ -51,7 +62,7 @@ models: type: long required: true examples: - - "9999" + - 9999 quality: - type: sql description: 95% of all order total values are expected to be between 10 and 499 EUR. @@ -73,7 +84,12 @@ models: classification: sensitive quality: - type: text - name: The email address was verified by a user + description: The email address is not verified and may be invalid. + lineage: + inputFields: + - namespace: com.example.service.checkout + name: checkout_db.orders + field: email_address processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp @@ -111,21 +127,20 @@ models: description: A single article that is part of an order. type: table fields: - lines_item_id: + line_item_id: type: text description: Primary key of the lines_item_id table required: true - unique: true - primary: true order_id: $ref: '#/definitions/checkout/order_id' references: orders.order_id sku: description: The purchased article number - $ref: '#/definitions/checkout/sku' + $ref: '#/definitions/inventory/sku' + primaryKey: ["order_id", "line_item_id"] examples: - | - lines_item_id,order_id,sku + line_item_id,order_id,sku "LI-1","1001","5901234123457" "LI-2","1001","4001234567890" "LI-3","1002","5901234123457" @@ -138,7 +153,6 @@ models: "LI-10","1005","6001234567891" definitions: checkout/order_id: - name: order_id title: Order ID type: text format: uuid @@ -149,9 +163,7 @@ definitions: classification: restricted tags: - orders - checkout/sku: - domain: inventory - name: sku + inventory/sku: title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ @@ -201,4 +213,4 @@ tags: - orders - s3 links: - datacontractCli: https://cli.datacontract.com + datacontractCli: https://cli.datacontract.com \ No newline at end of file From 2d35ff1f930edc86434fe07a09a9aaa171e401dc Mon Sep 17 00:00:00 2001 From: jochen Date: Fri, 13 Sep 2024 17:52:03 +0200 Subject: [PATCH 24/31] Archive 0.9.3 version --- versions/0.9.3/README.md | 57 ++++--- versions/0.9.3/datacontract.init.yaml | 207 +++++++++++++----------- versions/0.9.3/datacontract.schema.json | 44 +++++ 3 files changed, 185 insertions(+), 123 deletions(-) diff --git a/versions/0.9.3/README.md b/versions/0.9.3/README.md index 6463be2..a90cbc3 100644 --- a/versions/0.9.3/README.md +++ b/versions/0.9.3/README.md @@ -2,7 +2,7 @@ Stars -Slack Status +Slack Status ![datacontract.png](images/datacontract.png) @@ -365,42 +365,44 @@ This object _MAY_ be extended with [Specification Extensions](#specification-ext #### S3 Server Object -| Field | Type | Description | -|-------------|----------|------------------------------------------------------------------------------------------------------------------| -| type | `string` | `s3` | -| location | `string` | S3 URL, starting with `s3://` | -| endpointUrl | `string` | The server endpoint for S3-compatible servers, such as `https://minio.example.com` | -| format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | -| delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | +| Field | Type | Description | +|-------------|----------|-------------------------------------------------------------------------------------------------------------------------| +| type | `string` | `s3` | +| location | `string` | S3 URL, starting with `s3://` | +| endpointUrl | `string` | The server endpoint for S3-compatible servers, such as MioIO or Google Cloud Storage, e.g., `https://minio.example.com` | +| format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | +| delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | -Example: +Example (AWS S3): ```yaml servers: production: type: s3 location: s3://acme-orders-prod/orders/ + format: json + delimiter: new_line ``` -#### AWS Glue Server Object +Example (MinIO): -| Field | Type | Description | -|----------|----------|------------------------------------------------------------| -| type | `string` | `glue` | -| account | `string` | REQUIRED. The AWS account, e.g., `1234-5678-9012` | -| database | `string` | REQUIRED. The AWS Glue Catalog database | -| location | `string` | S3 path, starting with `s3://` | -| format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | +```yaml +servers: + minio: + type: s3 + endpointUrl: http://localhost:9000 + location: s3://my-bucket/path/ + format: delta +``` -Example: +Example (Google Cloud Storage): ```yaml servers: - production: - type: glue - account: "1234-5678-9012" - database: acme-orders - location: s3://acme-orders-prod/orders/ + gcs: + type: s3 + endpointUrl: https://storage.googleapis.com + location: s3://my-bucket/path/*/*/*/*/*.parquet format: parquet ``` @@ -537,6 +539,8 @@ The terms and conditions of the data contract. | billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | | noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + ### Model Object @@ -552,7 +556,7 @@ The name of the data model (table name) is defined by the key that refers to thi | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | | config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | - +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Field Object @@ -591,6 +595,8 @@ The Field Objects describes one field (column, property, nested field) of a data | values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | | config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). + ### Definition Object @@ -626,9 +632,10 @@ Models fields can refer to definitions using the `$ref` field to link to existin | keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | | values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | +This object _MAY_ be extended with [Specification Extensions](#specification-extensions). -### Schema Object +### Schema Object (DEPRECATED) The schema of the data contract describes the physical schema. The type of the schema depends on the data platform. diff --git a/versions/0.9.3/datacontract.init.yaml b/versions/0.9.3/datacontract.init.yaml index 29dbe19..cec58b6 100644 --- a/versions/0.9.3/datacontract.init.yaml +++ b/versions/0.9.3/datacontract.init.yaml @@ -1,98 +1,109 @@ -{ - "$schema": "http://json-schema.org/draft-07/schema#", - "type": "object", - "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", - "properties": { - "domain": { - "type": "string", - "description": "The domain in which this definition is valid.", - "default": "global" - }, - "name": { - "type": "string", - "description": "The technical name of this definition." - }, - "title": { - "type": "string", - "description": "The business name of this definition." - }, - "description": { - "type": "string", - "description": "Clear and concise explanations related to the domain." - }, - "type": { - "type": "string", - "description": "The logical data type." - }, - "minLength": { - "type": "integer", - "description": "A value must be greater than or equal to this value. Applies only to string types." - }, - "maxLength": { - "type": "integer", - "description": "A value must be less than or equal to this value. Applies only to string types." - }, - "format": { - "type": "string", - "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." - }, - "precision": { - "type": "integer", - "examples": [ - 38 - ], - "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." - }, - "scale": { - "type": "integer", - "examples": [ - 0 - ], - "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." - }, - "pattern": { - "type": "string", - "description": "A regular expression pattern the value must match. Applies only to string types." - }, - "example": { - "type": "string", - "description": "An example value." - }, - "pii": { - "type": "boolean", - "description": "Indicates if the field contains Personal Identifiable Information (PII)." - }, - "classification": { - "type": "string", - "description": "The data class defining the sensitivity level for this field." - }, - "tags": { - "type": "array", - "items": { - "type": "string" - }, - "description": "Custom metadata to provide additional context." - }, - "links": { - "type": "object", - "description": "Links to external resources.", - "minProperties": 1, - "propertyNames": { - "pattern": "^[a-zA-Z0-9_-]+$" - }, - "additionalProperties": { - "type": "string", - "title": "Link", - "description": "A URL to an external resource.", - "format": "uri", - "examples": [ - "https://example.com" - ] - } - } - }, - "required": [ - "name", - "type" - ] -} \ No newline at end of file +dataContractSpecification: 0.9.3 +id: my-data-contract-id +info: + title: My Data Contract + version: 0.0.1 +# description: +# owner: +# contact: +# name: +# url: +# email: + + +### servers + +#servers: +# production: +# type: s3 +# location: s3:// +# format: parquet +# delimiter: new_line + +### terms + +#terms: +# usage: +# limitations: +# billing: +# noticePeriod: + + +### models + +# models: +# my_model: +# description: +# type: +# fields: +# my_field: +# type: +# description: + + +### definitions + +# definitions: +# my_field: +# domain: +# name: +# title: +# type: +# description: +# example: +# pii: +# classification: + + +### examples + +#examples: +# - type: csv +# model: my_model +# data: |- +# id,timestamp,amount +# "1001","2023-09-09T08:30:00Z",2500 +# "1002","2023-09-08T15:45:00Z",1800 + +### servicelevels + +#servicelevels: +# availability: +# description: The server is available during support hours +# percentage: 99.9% +# retention: +# description: Data is retained for one year because! +# period: P1Y +# unlimited: false +# latency: +# description: Data is available within 25 hours after the order was placed +# threshold: 25h +# sourceTimestampField: orders.order_timestamp +# processedTimestampField: orders.processed_timestamp +# freshness: +# description: The age of the youngest row in a table. +# threshold: 25h +# timestampField: orders.order_timestamp +# frequency: +# description: Data is delivered once a day +# type: batch # or streaming +# interval: daily # for batch, either or cron +# cron: 0 0 * * * # for batch, either or interval +# support: +# description: The data is available during typical business hours at headquarters +# time: 9am to 5pm in EST on business days +# responseTime: 1h +# backup: +# description: Data is backed up once a week, every Sunday at 0:00 UTC. +# interval: weekly +# cron: 0 0 * * 0 +# recoveryTime: 24 hours +# recoveryPoint: 1 week + +### quality + +#quality: +# type: SodaCL +# specification: +# checks for my_model: |- +# - duplicate_count(id) = 0 \ No newline at end of file diff --git a/versions/0.9.3/datacontract.schema.json b/versions/0.9.3/datacontract.schema.json index a0904be..e1db717 100644 --- a/versions/0.9.3/datacontract.schema.json +++ b/versions/0.9.3/datacontract.schema.json @@ -169,6 +169,50 @@ "location" ] }, + { + "type": "object", + "title": "GcsServer", + "properties": { + "type": { + "type": "string", + "enum": [ + "gcs" + ], + "description": "The type of the data product technology that implements the data contract." + }, + "location": { + "type": "string", + "format": "uri", + "description": "The GS/GCS url to the data.", + "examples": [ + "gs://example-storage/data/*/*.json" + ] + }, + "format": { + "type": "string", + "enum": [ + "parquet", + "delta", + "json", + "csv" + ], + "description": "File format." + }, + "delimiter": { + "type": "string", + "enum": [ + "new_line", + "array" + ], + "description": "Only for format = json. How multiple json documents are delimited within one file" + } + }, + "additionalProperties": true, + "required": [ + "type", + "location" + ] + }, { "type": "object", "title": "SftpServer", From 3fe21f0f732284828b18c7f152c8d62c297ddd85 Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 16 Sep 2024 07:54:09 +0200 Subject: [PATCH 25/31] Update definitions --- README.md | 10 +++++----- examples/orders-latest/datacontract.yaml | 10 +++++----- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 61c7ffa..052425f 100644 --- a/README.md +++ b/README.md @@ -90,7 +90,7 @@ models: type: table fields: order_id: - $ref: '#/definitions/checkout/order_id' + $ref: '#/definitions/order_id' required: true unique: true primaryKey: true @@ -176,11 +176,11 @@ models: description: Primary key of the lines_item_id table required: true order_id: - $ref: '#/definitions/checkout/order_id' + $ref: '#/definitions/order_id' references: orders.order_id sku: description: The purchased article number - $ref: '#/definitions/inventory/sku' + $ref: '#/definitions/sku' primaryKey: ["order_id", "line_item_id"] examples: - | @@ -196,7 +196,7 @@ models: "LI-9","1005","5001234567892" "LI-10","1005","6001234567891" definitions: - checkout/order_id: + order_id: title: Order ID type: text format: uuid @@ -207,7 +207,7 @@ definitions: classification: restricted tags: - orders - inventory/sku: + sku: title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ diff --git a/examples/orders-latest/datacontract.yaml b/examples/orders-latest/datacontract.yaml index 4e28a41..013dc36 100644 --- a/examples/orders-latest/datacontract.yaml +++ b/examples/orders-latest/datacontract.yaml @@ -46,7 +46,7 @@ models: type: table fields: order_id: - $ref: '#/definitions/checkout/order_id' + $ref: '#/definitions/order_id' required: true unique: true primaryKey: true @@ -132,11 +132,11 @@ models: description: Primary key of the lines_item_id table required: true order_id: - $ref: '#/definitions/checkout/order_id' + $ref: '#/definitions/order_id' references: orders.order_id sku: description: The purchased article number - $ref: '#/definitions/inventory/sku' + $ref: '#/definitions/sku' primaryKey: ["order_id", "line_item_id"] examples: - | @@ -152,7 +152,7 @@ models: "LI-9","1005","5001234567892" "LI-10","1005","6001234567891" definitions: - checkout/order_id: + order_id: title: Order ID type: text format: uuid @@ -163,7 +163,7 @@ definitions: classification: restricted tags: - orders - inventory/sku: + sku: title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ From 9565da63177cdeb929723190428a6f379f5f05fd Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 16 Sep 2024 13:09:46 +0200 Subject: [PATCH 26/31] Update example --- README.md | 4 ++-- examples/orders-latest/datacontract.yaml | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 052425f..ce8c47c 100644 --- a/README.md +++ b/README.md @@ -46,7 +46,7 @@ dataContractSpecification: 1.1.0 id: urn:datacontract:checkout:orders-latest info: title: Orders Latest - version: 1.0.0 + version: 2.0.0 description: | Successful customer orders in the webshop. All orders since 2020-01-01. @@ -59,7 +59,7 @@ servers: production: type: s3 environment: prod - location: s3://datacontract-example-orders-latest/data/{model}/*.json + location: s3://datacontract-example-orders-latest/v2/{model}/*.json format: json delimiter: new_line description: "One folder per model. One file per day." diff --git a/examples/orders-latest/datacontract.yaml b/examples/orders-latest/datacontract.yaml index 013dc36..257f080 100644 --- a/examples/orders-latest/datacontract.yaml +++ b/examples/orders-latest/datacontract.yaml @@ -2,7 +2,7 @@ dataContractSpecification: 1.1.0 id: urn:datacontract:checkout:orders-latest info: title: Orders Latest - version: 1.0.0 + version: 2.0.0 description: | Successful customer orders in the webshop. All orders since 2020-01-01. @@ -15,7 +15,7 @@ servers: production: type: s3 environment: prod - location: s3://datacontract-example-orders-latest/data/{model}/*.json + location: s3://datacontract-example-orders-latest/v2/{model}/*.json format: json delimiter: new_line description: "One folder per model. One file per day." From 3cc9b2b4c922b7bd1efc33f4da95ef85377c4b62 Mon Sep 17 00:00:00 2001 From: jochen Date: Mon, 16 Sep 2024 13:26:28 +0200 Subject: [PATCH 27/31] Skip workflow --- .github/workflows/ci.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml index f573404..1861b5a 100644 --- a/.github/workflows/ci.yaml +++ b/.github/workflows/ci.yaml @@ -6,6 +6,7 @@ on: name: CI jobs: test: + if: false # skip as the example structure has changed with v1.1.0 runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 From 4575c33c3aff21c5db1b1f59d24cba03325847ef Mon Sep 17 00:00:00 2001 From: Simon Harrer Date: Mon, 16 Sep 2024 18:19:40 +0200 Subject: [PATCH 28/31] UPDATE --- CHANGELOG.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 861ea16..c629d0e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -25,6 +25,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Trino support - Field `type: map` support with properties `keys` and `values` - Definitions: `fields`, for type `object`, `record`, and `struct` +- Field `field.primaryKey` (Replaces `field.primary`) +- Field `model.primaryKey` to describe a composite primary key + ### Removed @@ -34,6 +37,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - `examples` on top-level removed - `schema` removed in favor of encoding any physical schema configuration in the `model` using the `config` map at the field level and supporting import/export ([#21](https://github.com/datacontract/datacontract-specification/issues/21)). +### Deprecated + +- `field.primary` (use `field.primaryKey` instead) + ## [0.9.3] - 2024-03-06 From 32ada883d7feee6bee90858f211ebc4d89e5954c Mon Sep 17 00:00:00 2001 From: jochenchrist Date: Sun, 13 Oct 2024 14:44:25 +0200 Subject: [PATCH 29/31] Update README.md --- README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index ce8c47c..b154ffa 100644 --- a/README.md +++ b/README.md @@ -894,7 +894,7 @@ Soda checks can be applied on model and field level. | type | `string` | `custom` | | description | `string` | Optional. A plain text describing the quality attribute in natural language. | | engine | `string` | `soda` | -| specification | `object` | A check type as defined in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | +| implementation | `object` | A check type as defined in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | See the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) for all possible types and configuration values. @@ -912,7 +912,7 @@ models: - type: custom description: This is a check on field level engine: soda - specification: + implementation: type: no_duplicate_values carrier: type: string @@ -922,7 +922,7 @@ models: - type: custom description: This is a check on model level engine: soda - specification: + implementation: type: duplicate_percent columns: - carrier @@ -931,7 +931,7 @@ models: - type: custom description: This is a check on model level engine: soda - specification: + implementation: type: row_count must_be_greater_than: 500000 ``` @@ -946,7 +946,7 @@ Expectations are applied on model level. |---------------|----------|-----------------------------------------------------------------------------------------------------| | description | `string` | Optional. A plain text describing the quality attribute in natural language. | | engine | `string` | `great-expectations` | -| specification | `object` | An expectation type as listed in [Expectation](https://greatexpectations.io/expectations/) as YAML. | +| implementation | `object` | An expectation type as listed in [Expectation](https://greatexpectations.io/expectations/) as YAML. | Example: @@ -956,7 +956,7 @@ models: quality: - type: custom engine: great-expectations - specification: + implementation: expectation_type: expect_table_row_count_to_be_between kwargs: min_value: 10000 @@ -966,7 +966,7 @@ models: - type: custom engine: great-expectations description: "Check that passenger_count values are between 1 and 6." - specification: + implementation: expectation_type: expect_column_values_to_be_between kwargs: column: passenger_count From 41b1fcfeab10730eaf6fdc72b1282c605b8a908c Mon Sep 17 00:00:00 2001 From: jochen Date: Sun, 27 Oct 2024 07:59:25 +0100 Subject: [PATCH 30/31] Add library to quality --- datacontract.schema.json | 69 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 65 insertions(+), 4 deletions(-) diff --git a/datacontract.schema.json b/datacontract.schema.json index d52f6b1..9f74b7d 100644 --- a/datacontract.schema.json +++ b/datacontract.schema.json @@ -1666,6 +1666,7 @@ "description": "The type of quality check", "enum": [ "text", + "library", "sql", "custom" ] @@ -1674,10 +1675,7 @@ "type": "string", "description": "A plain text describing the quality attribute in natural language." } - }, - "required": [ - "type" - ] + } }, { "if": { @@ -1733,6 +1731,69 @@ ] } }, + { + "if": { + "properties": { + "type": { + "const": "library" + } + } + }, + "then": { + "properties": { + "rule": { + "type": "string", + "description": "Define a data quality check based on the predefined rules as per ODCS.", + "examples": ["duplicateCount", "validValues", "rowCount"] + }, + "mustBe": { + "description": "Must be equal to the value to be valid. When using numbers, it is equivalent to '='." + }, + "mustNotBe": { + "description": "Must not be equal to the value to be valid. When using numbers, it is equivalent to '!='." + }, + "mustBeGreaterThan": { + "type": "number", + "description": "Must be greater than the value to be valid. It is equivalent to '>'." + }, + "mustBeGreaterOrEqualTo": { + "type": "number", + "description": "Must be greater than or equal to the value to be valid. It is equivalent to '>='." + }, + "mustBeLessThan": { + "type": "number", + "description": "Must be less than the value to be valid. It is equivalent to '<'." + }, + "mustBeLessOrEqualTo": { + "type": "number", + "description": "Must be less than or equal to the value to be valid. It is equivalent to '<='." + }, + "mustBeBetween": { + "type": "array", + "description": "Must be between the two numbers to be valid. Smallest number first in the array.", + "minItems": 2, + "maxItems": 2, + "uniqueItems": true, + "items": { + "type": "number" + } + }, + "mustNotBeBetween": { + "type": "array", + "description": "Must not be between the two numbers to be valid. Smallest number first in the array.", + "minItems": 2, + "maxItems": 2, + "uniqueItems": true, + "items": { + "type": "number" + } + } + }, + "required": [ + "rule" + ] + } + }, { "if": { "properties": { From 169a89b1a2625f405b84606d6b5bcb4b7626d894 Mon Sep 17 00:00:00 2001 From: jochen Date: Wed, 30 Oct 2024 10:18:52 +0100 Subject: [PATCH 31/31] Update 0.9.3 --- versions/0.9.3/README.md | 59 +++++++++++++++++++++---- versions/0.9.3/datacontract.schema.json | 26 +++++++++++ 2 files changed, 76 insertions(+), 9 deletions(-) diff --git a/versions/0.9.3/README.md b/versions/0.9.3/README.md index a90cbc3..9fdbbd1 100644 --- a/versions/0.9.3/README.md +++ b/versions/0.9.3/README.md @@ -277,7 +277,7 @@ Specification - [Model Object](#model-object) - [Field Object](#field-object) - [Definition Object](#definition-object) -- [Schema Object](#schema-object) +- [Schema Object (DEPRECATED)](#schema-object-deprecated) - [Example Object](#example-object) - [Service Level Object](#service-levels-object) - [Quality Object](#quality-object) @@ -302,7 +302,7 @@ It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | | models | Map[`string`, [Model Object](#model-object)] | Specifies the logical data model. | | definitions | Map[`string`, [Definition Object](#definition-object)] | Specifies definitions. | -| schema | [Schema Object](#schema-object) | Specifies the physical schema. The specification supports different schema format. | +| schema | [Schema Object (DEPRECATED)](#schema-object-deprecated) | Specifies the physical schema. The specification supports different schema format. | | examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | | servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | | quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | @@ -409,12 +409,53 @@ servers: #### Redshift Server Object -| Field | Type | Description | -|----------|----------|-------------| -| type | `string` | `redshift` | -| account | `string` | | -| database | `string` | | -| schema | `string` | | +| Field | Type | Description | +|-------------------|----------|---------------------------------------------------------------------------------------------------------------------| +| type | `string` | `redshift` | +| account | `string` | | +| database | `string` | | +| schema | `string` | | +| clusterIdentifier | `string` | Identifier of the cluster.
Example: `analytics-cluster` | +| host | `string` | Host of the cluster.
Example: `analytics-cluster.example.eu-west-1.redshift.amazonaws.com` | +| port | `number` | Port of the cluster.
Example: `5439` | +| endpoint | `string` | Endpoint of the cluster
Example: `analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics` | + +Example, specifying an endpoint: + +```yaml +servers: + analytics: + type: redshift + account: '123456789012' + database: analytics + schema: analytics + endpoint: analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics +``` + +Example, specifying the cluster identifier: + +```yaml +servers: + analytics: + type: redshift + account: '123456789012' + database: analytics + schema: analytics + clusterIdentifier: analytics-cluster +``` + +Example, specifying the cluster host: + +```yaml +servers: + analytics: + type: redshift + account: '123456789012' + database: analytics + schema: analytics + host: analytics-cluster.example.eu-west-1.redshift.amazonaws.com + port: 5439 +``` #### Azure Server Object @@ -878,7 +919,7 @@ One can either describe each service level informally using the `description` fi |--------------|-----------------------------------------------|-------------------------------------------------------------------------| | availability | [Availability Object](#availability-object) | The promised uptime of the system that provides the data | | retention | [Retention Object](#retention-object) | The period how long data will be available. | -| latency | [Latency Object](#latency-object) | The maximum amount of time from the the source to its destination. | +| latency | [Latency Object](#latency-object) | The maximum amount of time from the source to its destination. | | freshness | [Freshness Object](#freshness-object) | The maximum age of the youngest entry. | | frequency | [Frequency Object](#frequency-object) | The update frequency. | | support | [Support Object](#support-object) | The times when support is provided. | diff --git a/versions/0.9.3/datacontract.schema.json b/versions/0.9.3/datacontract.schema.json index e1db717..02e69ef 100644 --- a/versions/0.9.3/datacontract.schema.json +++ b/versions/0.9.3/datacontract.schema.json @@ -272,6 +272,10 @@ "type": "string", "description": "An optional string describing the server." }, + "host": { + "type": "string", + "description": "An optional string describing the host name." + }, "database": { "type": "string", "description": "An optional string describing the server." @@ -279,6 +283,28 @@ "schema": { "type": "string", "description": "An optional string describing the server." + }, + "clusterIdentifier": { + "type": "string", + "description": "An optional string describing the cluster's identifier.", + "examples": [ + "redshift-prod-eu", + "analytics-cluster" + ] + }, + "port": { + "type": "integer", + "description": "An optional string describing the cluster's port.", + "examples": [ + 5439 + ] + }, + "endpoint": { + "type": "string", + "description": "An optional string describing the cluster's endpoint.", + "examples": [ + "analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics" + ] } }, "additionalProperties": true,