Skip to content

Test data management tool for any data source, batch or real-time

License

Notifications You must be signed in to change notification settings

robinfuller/data-caterer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Caterer - Test Data Management Tool

Data Catering

Overview

A test data management tool with automated data generation, validation and cleanup.

Basic data flow for Data Caterer

Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.

Full docs can be found here.

A demo of the UI found here.

Scala/Java examples found here.

Features

Basic flow

Quick start

  1. Mac download
  2. Windows download
    1. After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
    2. Double-click 'DataCaterer-1.0.0' to install Data Caterer
    3. Click on 'More info' then at the bottom, click 'Run anyway'
    4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application
    5. If your browser doesn't open, go to http://localhost:9898 in your preferred browser
  3. Linux download
  4. Docker
    docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer-basic:0.11.9
    Open localhost:9898.

Run Scala/Java examples

git clone [email protected]:data-catering/data-caterer-example.git
cd data-caterer-example && ./run.sh
#check results under docker/sample/report/index.html folder

Integrations

Supported data sources

Data Caterer supports the below data sources. Additional data sources can be added on a demand basis. Check here for the full roadmap.

Data Source Type Data Source Support Free
Cloud Storage AWS S3
Cloud Storage Azure Blob Storage
Cloud Storage GCP Cloud Storage
Database Cassandra
Database MySQL
Database Postgres
Database Elasticsearch
Database MongoDB
File CSV
File Delta Lake
File JSON
File Iceberg
File ORC
File Parquet
File Hudi
HTTP REST API
Messaging Kafka
Messaging Solace
Messaging ActiveMQ
Messaging Pulsar
Messaging RabbitMQ
Metadata Great Expectations
Metadata Marquez
Metadata OpenAPI/Swagger
Metadata OpenMetadata
Metadata Open Data Contract Standard (ODCS)
Metadata Amundsen
Metadata Datahub
Metadata Data Contract CLI
Metadata Solace Event Portal

Supported use cases

  1. Insert into single data sink
  2. Insert into multiple data sinks
    1. Foreign keys associated between data sources
    2. Number of records per column value
  3. Set random seed at column and whole data generation level
  4. Generate real-looking data (via DataFaker) and edge cases
    1. Names, addresses, places etc.
    2. Edge cases for each data type (e.g. newline character in string, maximum integer, NaN, 0)
    3. Nullability
  5. Send events progressively
  6. Automatically insert data into data source
    1. Read metadata from data source and insert for all sub data sources (e.g. tables)
    2. Get statistics from existing data in data source if exists
  7. Track and delete generated data
  8. Extract data profiling and metadata from given data sources
    1. Calculate the total number of combinations
  9. Validate data
    1. Basic column validations (not null, contains, equals, greater than)
    2. Aggregate validations (group by account_id and sum amounts should be less than 100, each account should have at least one transaction)
    3. Upstream data source validations (generate data and then check same data is inserted in another data source with potential transformations)
    4. Column name validations (check count and ordering of column names)
  10. Data migration validations
    1. Ensure row counts are equal
    2. Check both data sources have same values for key columns

Run Configurations

Different ways to run Data Caterer based on your use case:

Types of run configurations

Sponsorship

Data Caterer is set up under a sponsorware model where all features are available to sponsors. The core features are available here in this project for all to use/fork/update/improve etc., as the open core.

Sponsors have access to the following features:

Find out more details here to help with sponsorship.

This is inspired by the mkdocs-material project which follows the same model.

Contributing

View details here about how you can contribute to the project.

Additional Details

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list.

UI

  1. Allow the application to run with UI enabled
  2. Runs as a long-lived app with UI that interacts with the existing app as a single container
  3. Ability to run as UI, Spark job or both
  4. Persist data in files or database (Postgres)
  5. UI will show the history of data generation/validation runs, delete generated data, create new scenarios, define data connections

Distribution

Docker
gradle clean :api:shadowJar :app:shadowJar
docker build --build-arg "APP_VERSION=0.7.0" --build-arg "SPARK_VERSION=3.5.0" --no-cache -t datacatering/data-caterer:0.7.0 .
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone -v data-caterer-data:/opt/data-caterer --name datacaterer datacatering/data-caterer:0.7.0
#open localhost:9898
Jpackage
JPACKAGE_BUILD=true gradle clean :api:shadowJar :app:shadowJar
# Mac
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-mac.cfg"
# Windows
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-windows.cfg"
# Linux
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-linux.cfg"
Java 17 VM Options
--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED

-Dlog4j.configurationFile=classpath:log4j2.properties

About

Test data management tool for any data source, batch or real-time

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 73.7%
  • JavaScript 18.4%
  • Java 4.3%
  • HTML 2.3%
  • Shell 0.7%
  • CSS 0.5%
  • Dockerfile 0.1%