Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
jochenchrist committed Jan 27, 2024
1 parent e8312a9 commit 1d324d6
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 33 deletions.
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ RUN groupadd -r datacontract
RUN useradd -r -g datacontract datacontract
USER datacontract

WORKDIR /app
RUN chown -R datacontract:datacontract /app
WORKDIR /datacontract
RUN chown -R datacontract:datacontract /datacontract

ENV PYTHONUNBUFFERED=1

Expand Down
86 changes: 59 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,42 +7,60 @@
<img alt="Stars" src="https://img.shields.io/github/stars/datacontract/cli" /></a>
</p>

The `datacontract` CLI lets you work with your `datacontract.yaml` files locally, and in your CI pipeline. It uses the [Data Contract Specification](https://datacontract.com/) to validate the contract, connect to your data sources and execute tests. The CLI is open source and written in Python. It can be used as a CLI tool or directly as a Python library.
The `datacontract` CLI is an open source command-line tool for working with [Data Contracts](https://datacontract.com/).
It uses data contract YAML files to lint the data contract, connect to data sources and execute schema and quality tests, detect breaking changes, and export to different formats. The tool is written in Python. It can be used as a standalone CLI tool, in a CI/CD pipeline, or directly as a Python library.

> **_NOTE:_** This project has been migrated from Go to Python which adds the possibility to use `datacontract` within Python code as library, but it comes with some [breaking changes](CHANGELOG.md). The Go version has been [forked](https://github.com/datacontract/cli-go), if you still rely on that.

## Usage

`datacontract` usually works with a `datacontract.yaml` file in your current working directory. You can specify a different file or URL as an additional argument.
## Getting started

Let's use [pip](https://pip.pypa.io/en/stable/getting-started/) to install the CLI.
```bash
# create a new data contract from example
$ datacontract init --template https://raw.githubusercontent.com/datacontract/cli/main/tests/examples/s3-json-remote/datacontract.yaml
$ pip install datacontract-cli
```

# execute schema and quality checks
$ datacontract test
Now, let's look at this data contract: https://raw.githubusercontent.com/datacontract/datacontract-specification/main/examples/covid-cases/datacontract.yaml

We have a _servers_ section with endpoint details to the (public) S3 bucket, a _model_ for the structure of the data, and _quality_ attributes that describe the expected freshness and number of rows.

This data contract contains all data to connect to S3 and check if the actual data meets the defined schema and quality requirements.

We run the tests:

```bash
$ datacontract test https://raw.githubusercontent.com/datacontract/datacontract-specification/main/examples/covid-cases/datacontract.yaml
# returns: 🟢 data contract is valid. Tested 12 checks.
```

## Advanced Usage
Voilà, the CLI tested, that the _datacontract.yaml_ itself is valid, all records comply with the schema, and all quality attributes are met.

## Usage

```bash
# lint the data contract
# create a new data contract from example and write it to datacontract.yaml
$ datacontract init datacontract.yaml

# lint the datacontract.yaml
$ datacontract lint datacontract.yaml

# execute schema and quality checks
$ datacontract test datacontract.yaml

# find differences between to data contracts (Coming Soon)
$ datacontract diff datacontract-v1.yaml datacontract-v2.yaml

# fail pipeline on breaking changes (Coming Soon)
$ datacontract breaking datacontract-v1.yaml datacontract-v2.yaml

# export model as jsonschema
$ datacontract export --format jsonschema
$ datacontract export --format jsonschema datacontract.yaml

# export model as dbt (Coming Soon)
$ datacontract export --format dbt
$ datacontract export --format dbt datacontract.yaml

# import protobuf as model (Coming Soon)
$ datacontract import --format protobuf --source my_protobuf_file.proto
$ datacontract import --format protobuf --source my_protobuf_file.proto datacontract.yaml
```

## Programmatic (Python)
Expand All @@ -57,43 +75,57 @@ if not run.has_passed():
```

## Scenario: Integration with Data Mesh Manager

If you use [Data Mesh Manager](https://datamesh-manager.com/), you can use the data contract URL and append the `--publish` option to send and display the test results. Set an environment variable for your API key.

```bash
# Fetch current data contract, execute tests on production, and publish result to data mesh manager
$ EXPORT DATAMESH_MANAGER_API_KEY=xxx
$ datacontract test https://demo.datamesh-manager.com/demo279750347121/datacontracts/4df9d6ee-e55d-4088-9598-b635b2fdcbbc/datacontract.yaml --server production --publish
```

## Scenario: CI/CD testing for breaking changes
```bash
# fail pipeline on breaking changes in the data contract yaml (coming soon)
$ datacontract breaking datacontract.yaml https://raw.githubusercontent.com/datacontract/cli/main/examples/my-data-contract-id_v0.0.1.yaml
```




## Installation

### Pip
Choose the most appropriate installation method for your needs:

### pip
Python 3.11 recommended.

```bash
pip install datacontract-cli
```

### pipx
pipx installs into an isolated environment.
```bash
pipx install datacontract-cli
```

[//]: # (### Homebrew)
### Homebrew (coming soon)

[//]: # (```bash)
```bash
brew install datacontract/brew/datacontract
```

[//]: # (brew install datacontract/brew/datacontract)
### Docker (coming soon)

[//]: # (```)
```bash
docker pull datacontract/cli
docker run --rm -v ${PWD}:/datacontract datacontract/cli
```

## Documentation

### Tests

Data Contract CLI can connect to data sources and run schema and quality tests to verify that the data contract is valid.

```
datacontract test
```bash
$ datacontract test --server production datacontract.yaml
```

To connect to the databases the `server` block in the datacontract.yaml is used to set up the connection. In addition, credentials, such as username and passwords, may be defined with environment variables.
Expand Down Expand Up @@ -174,8 +206,8 @@ python3 -m twine upload --repository testpypi dist/*
Docker Build

```
docker build -t datacontract .
docker run --rm -v ${PWD}:/app datacontract
docker build -t datacontract/cli .
docker run --rm -v ${PWD}:/datacontract datacontract/cli
```

## Contribution
Expand Down
8 changes: 4 additions & 4 deletions tests/examples/s3-json-remote/datacontract.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dataContractSpecification: 0.9.2
id: enigma_jhu
id: covid_cases
info:
title: COVID-19 cases
description: Johns Hopkins University Consolidated data on COVID-19 cases, sourced from Enigma
Expand All @@ -9,13 +9,13 @@ info:
data-explorer: https://dj2taa9i652rf.cloudfront.net/
data: https://covid19-lake.s3.us-east-2.amazonaws.com/enigma-jhu/json/part-00000-adec1cd2-96df-4c6b-a5f2-780f092951ba-c000.json
servers:
s3:
s3-json:
type: s3
location: s3://covid19-lake/enigma-jhu/json/*.json
format: json
delimiter: new_line
models:
enigma_jhu:
covid_cases:
fields:
fips:
type: string
Expand Down Expand Up @@ -47,6 +47,6 @@ models:
quality:
type: SodaCL
specification:
checks for enigma_jhu:
checks for covid_cases:
- freshness(last_update::datetime) < 5000d # dataset is not updated anymore
- row_count > 1000

0 comments on commit 1d324d6

Please sign in to comment.