diff --git a/docs/sponsor.md b/docs/sponsor.md
index 43164df9..c7809b68 100644
--- a/docs/sponsor.md
+++ b/docs/sponsor.md
@@ -9,6 +9,9 @@ maintaining, documenting and updating it. I hope that it will help with develope
 by saving time and effort, allowing you to focus on what is important. If you fall under this boat, please consider
 sponsorship to allow me to further maintain and upgrade the solution. Any contributions are much appreciated.
 
+Those who are wanting to use this project for open source applications, [please contact me](#contact) as I would be 
+happy to contribute.
+
 This is inspired by the [mkdocs-material project](https://github.com/squidfunk/mkdocs-material) that
 [follows the same model](https://squidfunk.github.io/mkdocs-material/insiders/).
 
diff --git a/docs/use-case/blog/shift-left-data-quality.md b/docs/use-case/blog/shift-left-data-quality.md
index 919a1152..21c8330b 100644
--- a/docs/use-case/blog/shift-left-data-quality.md
+++ b/docs/use-case/blog/shift-left-data-quality.md
@@ -11,7 +11,7 @@ quality at the forefront of the development process.
 
 ``` mermaid
 graph LR
-  subgraph badQualityData[<b>Manually generated data, data quality always passes</b>]
+  subgraph badQualityData[<b>Manually generated data, limited data scenarios</b>]
   local[<b>Local</b>\nManual test, unit test]
   dev[<b>Dev</b>\nManual test, integration test]
   stg[<b>Staging</b>\nSanity checks]
@@ -34,7 +34,7 @@ graph LR
 
 ``` mermaid
 graph LR
-  subgraph qualityData[<b>Reliable data for testing anywhere</b>]
+  subgraph qualityData[<b>Reliable data for testing anywhere<br>Common testing tool</b>]
   direction LR
   local[<b>Local</b>\nManual test, unit test]
   dev[<b>Dev</b>\nManual test, integration test]
@@ -81,7 +81,6 @@ complex data flows, validate data sources, and ensure data quality before it rea
 6. **Improved Collaboration:**
     - Facilitate collaboration between developers, testers, and data professionals by providing a common platform for
       early data validation.
-    - No need to rely on seeking domain expertise or external teams for data testing.
 
 ## Realizing the Vision of Proactive Data Quality
 
diff --git a/docs/use-case/roadmap.md b/docs/use-case/roadmap.md
index 9cb48c1d..90d460d6 100644
--- a/docs/use-case/roadmap.md
+++ b/docs/use-case/roadmap.md
@@ -2,22 +2,22 @@
 
 Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.
 
-| Feature                                | Description                                                                                                                                         | Sub Tasks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-|----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Data source support                    | Batch or real time data sources that can be added to Data Caterer. Support data sources that users want                                             | - AWS, GCP and Azure related data services (:white_check_mark: [cloud storage](../setup/advanced.md#cloud-storage))<br>- Deltalake<br>- RabbitMQ<br>- ActiveMQ<br>- MongoDB<br>- Elasticsearch<br>- Snowflake<br>- Databricks<br>- Pulsar                                                                                                                                                                                                                                                                                             |
-| Metadata discovery                     | Allow for schema and data profiling from external metadata sources                                                                                  | - :white_check_mark: [HTTP (OpenAPI spec)](../setup/guide/data-source/http.md)<br>- JMS<br>- Read from samples<br>- :white_check_mark: [OpenLineage metadata (Marquez)](../setup/guide/data-source/marquez-metadata-source.md)<br>- :white_check_mark: [OpenMetadata](../setup/guide/data-source/open-metadata-source.md)<br>- ODCS (Open Data Contract Standard)<br>- Amundsen<br>- Datahub<br>- Solace Event Portal<br>- Airflow<br>- DBT                                                                                           |
-| Developer API                          | Scala/Java interface for developers/testers to create data generation and validation tasks                                                          | - :white_check_mark: [Scala](https://github.com/data-catering/data-caterer-example)<br>- :white_check_mark: [Java](https://github.com/data-catering/data-caterer-example)                                                                                                                                                                                                                                                                                                                                                             |
-| Report generation                      | Generate a report that summarises the data generation or validation results                                                                         | - :white_check_mark: [Report for data generated and validation rules](../sample/report/html/index.html)                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| UI portal                              | Allow users to access a UI to input data generation or validation tasks. Also be able to view report results                                        | - Metadata stored in database<br>- Store data generation/validation run information in file/database                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                  
-| Integration with data validation tools | Derive data validation rules from existing data validation tools                                                                                    | - [Great Expectation](https://greatexpectations.io/)<br>- [DBT constraints](https://docs.getdbt.com/reference/resource-properties/constraints)<br>- [SodaCL](https://docs.soda.io/soda-cl/soda-cl-overview.html)<br>- [MonteCarlo](https://docs.getmontecarlo.com/docs/monitors-as-code)                                                                                                                                                                                                                                              |
-| Data validation rule suggestions       | Based on metadata, generate data validation rules appropriate for the dataset                                                                       | - :white_check_mark: Suggest basic data validations (yet to document)                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
-| Wait conditions before data validation | Define certain conditions to be met before starting data validations                                                                                | - :white_check_mark: [Webhook](../setup/validation.md#webhook)<br>- :white_check_mark: [File exists](../setup/validation.md#file-exists)<br>- :white_check_mark: [Data exists via SQL expression](../setup/validation.md#data-exists)<br>- :white_check_mark: [Pause](../setup/validation.md#pause)                                                                                                                                                                                                                                   |
-| Validation types                       | Ability to define simple/complex data validations                                                                                                   | - :white_check_mark: [Basic validations](../setup/validation/basic-validation.md)<br>- :white_check_mark: [Aggregates](../setup/validation/group-by-validation.md) (sum of amount per account is > 500)<br>- Ordering (transactions are ordered by date)<br>- :white_check_mark: [Relationship](../setup/validation/upstream-data-source-validation.md) (at least one account entry in history table per account in accounts table)<br>- Data profile (how close the generated data profile is compared to the expected data profile) |
-| Data generation record count           | Generate scenarios where there are one to many, many to many situations relating to record count. Also ability to cover all edge cases or scenarios | - Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values etc.)<br>- Ability to override edge cases                                                                                                                                                                                                                                                                                                                                                                                      |
-| Alerting                               | When tasks have completed, ability to define alerts based on certain conditions                                                                     | - Slack<br>- Email                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| Metadata enhancements                  | Based on data profiling or inference, can add to existing metadata                                                                                  | - PII detection (can integrate with [Presidio](https://microsoft.github.io/presidio/analyzer/))<br>- Relationship detection across data sources<br>- SQL generation<br>- Ordering information                                                                                                                                                                                                                                                                                                                                         |
-| Data cleanup                           | Ability to clean up generated data                                                                                                                  | - :white_check_mark: [Clean up generated data](../setup/guide/scenario/delete-generated-data.md)<br>- Clean up data in consumer data sinks<br>- Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS)                                                                                                                                                                                                                                                                                                |
-| Trial version                          | Trial version of the full app for users to test out all the features                                                                                | - :white_check_mark: [Trial app to try out all features](../get-started/docker.md#paid-version-trial)                                                                                                                                                                                                                                                                                                                                                                                                                                 |
-| Code generation                        | Based on metadata or existing classes, code for data generation and validation could be generated                                                   | - Code generation<br>- Schema generation from Scala/Java class                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
-| Real time response data validations    | Ability to define data validations based on the response from real time data sources (e.g. HTTP response)                                           | - HTTP response data validation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| Feature                                | Description                                                                                                                                         | Sub Tasks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+|----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Data source support                    | Batch or real time data sources that can be added to Data Caterer. Support data sources that users want                                             | - AWS, GCP and Azure related data services (:white_check_mark: [cloud storage](../setup/advanced.md#cloud-storage))<br>- Deltalake<br>- RabbitMQ<br>- ActiveMQ<br>- MongoDB<br>- Elasticsearch<br>- Snowflake<br>- Databricks<br>- Pulsar                                                                                                                                                                                                                                                                                                                                                           |
+| Metadata discovery                     | Allow for schema and data profiling from external metadata sources                                                                                  | - :white_check_mark: [HTTP (OpenAPI spec)](../setup/guide/data-source/http.md)<br>- JMS<br>- Read from samples<br>- :white_check_mark: [OpenLineage metadata (Marquez)](../setup/guide/data-source/marquez-metadata-source.md)<br>- :white_check_mark: [OpenMetadata](../setup/guide/data-source/open-metadata-source.md)<br>- ODCS (Open Data Contract Standard)<br>- Amundsen<br>- Datahub<br>- Solace Event Portal<br>- Airflow<br>- DBT                                                                                                                                                         |
+| Developer API                          | Scala/Java interface for developers/testers to create data generation and validation tasks                                                          | - :white_check_mark: [Scala](https://github.com/data-catering/data-caterer-example)<br>- :white_check_mark: [Java](https://github.com/data-catering/data-caterer-example)                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| Report generation                      | Generate a report that summarises the data generation or validation results                                                                         | - :white_check_mark: [Report for data generated and validation rules](../sample/report/html/index.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+| UI portal                              | Allow users to access a UI to input data generation or validation tasks. Also be able to view report results                                        | - Metadata stored in database<br>- Store data generation/validation run information in file/database                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                  
+| Integration with data validation tools | Derive data validation rules from existing data validation tools                                                                                    | - [Great Expectation](https://greatexpectations.io/)<br>- [DBT constraints](https://docs.getdbt.com/reference/resource-properties/constraints)<br>- [SodaCL](https://docs.soda.io/soda-cl/soda-cl-overview.html)<br>- [MonteCarlo](https://docs.getmontecarlo.com/docs/monitors-as-code)                                                                                                                                                                                                                                                                                                            |
+| Data validation rule suggestions       | Based on metadata, generate data validation rules appropriate for the dataset                                                                       | - :white_check_mark: Suggest basic data validations (yet to document)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| Wait conditions before data validation | Define certain conditions to be met before starting data validations                                                                                | - :white_check_mark: [Webhook](../setup/validation.md#webhook)<br>- :white_check_mark: [File exists](../setup/validation.md#file-exists)<br>- :white_check_mark: [Data exists via SQL expression](../setup/validation.md#data-exists)<br>- :white_check_mark: [Pause](../setup/validation.md#pause)                                                                                                                                                                                                                                                                                                 |
+| Validation types                       | Ability to define simple/complex data validations                                                                                                   | - :white_check_mark: [Basic validations](../setup/validation/basic-validation.md)<br>- :white_check_mark: [Aggregates](../setup/validation/group-by-validation.md) (sum of amount per account is > 500)<br>- Ordering (transactions are ordered by date)<br>- :white_check_mark: [Relationship](../setup/validation/upstream-data-source-validation.md) (at least one account entry in history table per account in accounts table)<br>- Data profile (how close the generated data profile is compared to the expected data profile)<br>- Column name (check column count, column names, ordering) |
+| Data generation record count           | Generate scenarios where there are one to many, many to many situations relating to record count. Also ability to cover all edge cases or scenarios | - Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values etc.)<br>- Ability to override edge cases                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| Alerting                               | When tasks have completed, ability to define alerts based on certain conditions                                                                     | - Slack<br>- Email                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| Metadata enhancements                  | Based on data profiling or inference, can add to existing metadata                                                                                  | - PII detection (can integrate with [Presidio](https://microsoft.github.io/presidio/analyzer/))<br>- Relationship detection across data sources<br>- SQL generation<br>- Ordering information                                                                                                                                                                                                                                                                                                                                                                                                       |
+| Data cleanup                           | Ability to clean up generated data                                                                                                                  | - :white_check_mark: [Clean up generated data](../setup/guide/scenario/delete-generated-data.md)<br>- Clean up data in consumer data sinks<br>- Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS)                                                                                                                                                                                                                                                                                                                                                              |
+| Trial version                          | Trial version of the full app for users to test out all the features                                                                                | - :white_check_mark: [Trial app to try out all features](../get-started/docker.md#paid-version-trial)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| Code generation                        | Based on metadata or existing classes, code for data generation and validation could be generated                                                   | - Code generation<br>- Schema generation from Scala/Java class                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| Real time response data validations    | Ability to define data validations based on the response from real time data sources (e.g. HTTP response)                                           | - HTTP response data validation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
 
diff --git a/site/search/search_index.json b/site/search/search_index.json
index 6d89a8f8..4de04d62 100644
--- a/site/search/search_index.json
+++ b/site/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Home","text":"Data Caterer is a metadata-driven data generation and  testing tool that aids in creating production-like data across both batch and event data systems. Run data validations  to ensure your systems have ingested it as expected, then clean up the data afterwards. Simplify your data testing Take away the pain and complexity of your data landscape and let Data Caterer handle it <p> Try now </p> Data testing is difficult and fragmented <ul> <li>Data being sent via messages, HTTP requests or files and getting stored in databases, file systems, etc.</li> <li>Maintaining and updating tests with the latest schemas and business definitions</li> <li>Different testing tools for services, jobs or data sources</li> <li>Complex relationships between datasets and fields</li> <li>Different scenarios, permutations, combinations and edge cases to cover</li> </ul> Current solutions only cover half the story <ul> <li>Specific testing frameworks that support one or limited number of data sources or transport protocols</li> <li>Under utilizing metadata from data catalogs or metadata discovery services</li> <li>Testing teams having difficulties understanding when failures occur</li> <li>Integration tests relying on external teams/services</li> <li>Manually generating data, or worse, copying/masking production data into lower environments</li> <li>Observability pushes towards being reactive rather than proactive</li> </ul> <p> Try now </p> What you need is a reliable tool that can handle changes to your data landscape <p> </p> <p>With Data Caterer, you get:</p> <ul> <li>Ability to connect to any type of data source: files, SQL or no-SQL databases, messaging systems, HTTP</li> <li>Discover metadata from your existing infrastructure and services</li> <li>Gain confidence that bugs do not propagate to production</li> <li>Be proactive in ensuring changes do not affect other data producers or consumers</li> <li>Configurability to run the way you want</li> </ul> <p> Try now </p>"},{"location":"#tech-summary","title":"Tech Summary","text":"<p>Use the Java, Scala API, or YAML files to help with setup or customisation that are all run via a Docker image. Want to  get into details? Checkout the setup pages here to get code examples and guides that will take you  through scenarios and data sources.</p> <p>Main features include:</p> <ul> <li> Metadata discovery</li> <li> Batch and  event data generation</li> <li> Maintain referential integrity across any dataset</li> <li> Create custom data generation scenarios</li> <li> Clean up generated data</li> <li> Validate data</li> <li> Suggest data validations</li> </ul> <p></p> <p>Check other run configurations here.</p>"},{"location":"#what-is-it","title":"What is it","text":"<ul> <li> <p> Data generation and testing tool</p> <p>Generate production like data to be consumed and validated.</p> </li> <li> <p> Designed for any data source</p> <p>We aim to support pushing data to any data source, in any format.</p> </li> <li> <p> Low/no code solution</p> <p>Can use the tool via either Scala, Java or YAML. Connect to data or metadata sources to generate data and validate.</p> </li> <li> <p> Developer productivity tool</p> <p>If you are a new developer or seasoned veteran, cut down on your feedback loop when developing with data.</p> </li> </ul>"},{"location":"#what-it-is-not","title":"What it is not","text":"<ul> <li> <p> Metadata storage/platform</p> <p>You could store and use metadata within the data generation/validation tasks but is not the recommended approach. Rather, this metadata should be gathered from existing services who handle metadata on behalf of Data Caterer.</p> </li> <li> <p> Data contract</p> <p>The focus of Data Caterer is on the data generation and testing, which can include details about how the data looks like and how it behaves. But it does not encompass all the additional metadata that comes with a data contract such as SLAs, security, etc.</p> </li> <li> <p> Metrics from load testing</p> <p>Although millions of records can be generated, there are limited capabilities in terms of metric capturing.</p> </li> </ul> <p> Try now </p> Data Catering vs Other tools vs In-house <p> Data Catering Other tools In-house Data flow Batch and events generation with validation Batch generation only or validation only Depends on architecture and design Time to results 1 day 1+ month to integrate, deploy and onboard 1+ month to build and deploy Solution Connect with your existing data ecosystem, automatic generation and validation Manual UI data entry or via SDK Depends on engineer(s) building it <p></p>"},{"location":"about/","title":"About","text":"<p>Hi, my name is Peter. I am a independent Software Developer, mainly focussing on data related services. My experience can be found on my LinkedIn.</p> <p>I have created Data Caterer to help serve individuals and companies with data generation and data testing. It is a complex area that has many edge cases or intricacies that are hard to summarise or turn into something actionable and repeatable. Through the use of metadata, Data Caterer can help simplify your data testing, simulating production environment data, aid in data debugging, or whatever your data use case may be.</p> <p>Given that it is going to save you and your team time and money, please help in considering financial support. This will help the product grow into a sustainable and feature-full service.</p>"},{"location":"about/#contact","title":"Contact","text":"<p>Please contact Peter Flook via Slack or via email <code>peter.flook@data.catering</code> if you have any questions or queries.</p>"},{"location":"about/#terms-of-service","title":"Terms of service","text":"<p>Terms of service can be found here.</p>"},{"location":"about/#privacy-policy","title":"Privacy policy","text":"<p>Privacy policy can be found here.</p>"},{"location":"sponsor/","title":"Sponsor","text":"<p>To have access to all the features of Data Caterer, you can subscribe according to your situation. You will not be charged by usage. As you continue to subscribe, you will have access to the latest version of Data Caterer as new bug fixes and features get published.</p> <p>This has been a passion project of mine where I have spent countless hours thinking of the idea, implementing,  maintaining, documenting and updating it. I hope that it will help with developers and companies with their testing  by saving time and effort, allowing you to focus on what is important. If you fall under this boat, please consider sponsorship to allow me to further maintain and upgrade the solution. Any contributions are much appreciated.</p> <p>This is inspired by the mkdocs-material project that follows the same model.</p>"},{"location":"sponsor/#features","title":"Features","text":"<ul> <li> Metadata discovery</li> <li> All data sources (see here for all data sources)</li> <li> Batch and  Event generation</li> <li> Auto generation from data connections or metadata sources</li> <li> Suggest data validations</li> <li> Clean up generated data</li> <li> Run as many times as you want, not charged by usage</li> </ul>"},{"location":"sponsor/#tiers","title":"Tiers","text":""},{"location":"sponsor/#manage-subscription","title":"Manage Subscription","text":"<p>Manage via this link</p>"},{"location":"sponsor/#contact","title":"Contact","text":"<p>Please contact Peter Flook via Slack or via email <code>peter.flook@data.catering</code> if you have any questions or queries.</p>"},{"location":"use-case/","title":"Use cases","text":""},{"location":"use-case/#replicate-production-in-lower-environment","title":"Replicate production in lower environment","text":"<p>Having a stable and reliable test environment is a challenge for a number of companies, especially where teams are asynchronously deploying and testing changes at faster rates. Data Caterer can help alleviate these issues by doing the following:</p> <ol> <li>Generates data with the latest schema changes and production like field values</li> <li>Run as a job on a daily/regular basis to replicate production traffic or data flows</li> <li>Validate data to ensure your system runs as expected</li> <li>Clean up data to avoid build up of generated data</li> </ol> <p></p>"},{"location":"use-case/#local-development","title":"Local development","text":"<p>Similar to the above, being able to replicate production like data in your local environment can be key to developing more reliable code as you can test directly against data in your local computer. This has a number of benefits including:</p> <ol> <li>Fewer assumptions or ambiguities when the developer codes</li> <li>Direct feedback loop in local computer rather than waiting for test environment for more reliable test data</li> <li>No domain expertise required to understand the data</li> <li>Easy for new developers to be onboarded and developing/testing code for jobs/services</li> </ol>"},{"location":"use-case/#systemintegration-testing","title":"System/integration testing","text":"<p>When working with third-party, external or internal data providers, it can be difficult to have all setup ready to produce reliable data that abides by relationship contracts between each of the systems. You have to rely on these data providers in order for you to run your tests which may not align to their priorities. With Data Caterer, you can generate the same data that they would produce, along with maintaining referential integrity across the data providers, so that you can run your tests without relying on their systems being up and reliable in their corresponding lower environments.</p>"},{"location":"use-case/#scenario-testing","title":"Scenario testing","text":"<p>If you want to set up particular data scenarios, you can customise the generated data to fit your scenario. Once the data gets generated and is consumed, you can also run validations to ensure your system has consumed the data correctly. These scenarios can be put together from existing tasks or data sources can be enabled/disabled based on your requirement. Built into Data Caterer and controlled via feature flags, is the ability to test edge cases based on the data type of the fields used for data generation (<code>enableEdgeCases</code> flag within <code>&lt;field&gt;.generator.options</code>, see more here).</p>"},{"location":"use-case/#data-debugging","title":"Data debugging","text":"<p>When data related issues occur in production, it may be difficult to replicate in a lower or local environment. It could be related to specific fields not containing expected results, size of data is too large or missing corresponding referenced data. This becomes key to resolving the issue as you can directly code against the exact data scenario and have confidence that your code changes will fix the problem. Data Caterer can be used to generate the appropriate data in whichever environment you want to test your changes against.</p>"},{"location":"use-case/#data-profiling","title":"Data profiling","text":"<p>When using Data Caterer with the feature flag <code>enableGeneratePlanAndTasks</code> enabled (see here), metadata relating all the fields defined in the data sources you have configured will be generated via data profiling. You can run this as a standalone job (can disable <code>enableGenerateData</code>)  so that you can focus on the profile of the data you are utilising. This can be run against your production data sources  to ensure the metadata can be used to accurately generate data in other environments. This is a key feature of Data  Caterer as no direct production connections need to be maintained to generate data in other environments (which can  lead to serious concerns about data security as seen here).</p>"},{"location":"use-case/#schema-gathering","title":"Schema gathering","text":"<p>When using Data Caterer with the feature flag <code>enableGeneratePlanAndTasks</code> enabled (see here), all schemas of the data sources defined will be tracked in a common format (as tasks). This data, along with the data profiling metadata, could then feed back into your schema registries to help keep them up to date with your system.</p>"},{"location":"get-started/docker/","title":"Run Data Caterer","text":""},{"location":"get-started/docker/#quick-start","title":"Quick start","text":"<p>Ensure you have <code>docker</code> installed and running.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example &amp;&amp; ./run.sh\n#check results under docker/sample/report/index.html folder\n</code></pre>"},{"location":"get-started/docker/#report","title":"Report","text":"<p>Check the report generated under <code>docker/data/custom/report/index.html</code>.</p> <p>Sample report can also be seen here</p>"},{"location":"get-started/docker/#paid-version-trial","title":"Paid Version Trial","text":"<p>30 day trial of the paid version can be accessed via these steps:</p> <ol> <li>Join the Slack Data Catering Slack group here</li> <li>Get an API_KEY by using slash command <code>/token</code> in the Slack group (will only be visible to you)</li> <li> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example &amp;&amp; export DATA_CATERING_API_KEY=&lt;insert api key&gt;\n./run.sh\n</code></pre> </li> </ol> <p>If you want to check how long your trial has left, you can check back in the Slack group or type <code>/token</code> again.</p>"},{"location":"get-started/docker/#guided-tour","title":"Guided tour","text":"<p>Check out the starter guide here that will take your through step by step. You can also check the other guides here to see the other possibilities of what Data Caterer can achieve for you.</p>"},{"location":"legal/privacy-policy/","title":"Privacy Policy","text":"<p>Last updated September 25, 2023</p>"},{"location":"legal/privacy-policy/#data-caterer-policy-on-privacy-of-customer-personal-information","title":"Data Caterer Policy on Privacy of Customer Personal Information","text":"<p>Peter John Flook is committed to protecting the privacy and security of your personal information obtained by reason of your use of Data Caterer. This policy explains the types of customer personal information we collect, how it is used, and the steps we take to ensure your personal information is handled appropriately.</p>"},{"location":"legal/privacy-policy/#who-is-peter-john-flook","title":"Who is Peter John Flook?","text":"<p>For purposes of this Privacy Policy, \u201cPeter John Flook\u201d means Peter John Flook, the company developing and providing Data Caterer and related websites and services.</p>"},{"location":"legal/privacy-policy/#what-is-personal-information","title":"What is personal information?","text":"<p>Personal information is information that refers to an individual specifically and is recorded in any form. Personal information includes such things as age, income, date of birth, ethnic origin and credit records. Information about individuals contained in the following documents is not considered personal information:</p> <ul> <li>public telephone directories, where the subscriber can refuse to be listed</li> <li>professional and business directories available to the public</li> <li>public registries and court records</li> <li>other publicly available printed and electronic publications</li> </ul>"},{"location":"legal/privacy-policy/#we-are-accountable-to-you","title":"We are accountable to you","text":"<p>Peter John Flook is responsible for all personal information under its control. Our team is accountable for compliance with these privacy and security principles.</p>"},{"location":"legal/privacy-policy/#we-let-you-know-why-we-collect-and-use-your-personal-information-and-get-your-consent","title":"We let you know why we collect and use your personal information and get your consent","text":"<p>Peter John Flook identifies the purpose for which your personal information is collected and will be used or disclosed. If that purpose is not listed below we will do this before or at the time the information is actually being collected. You will be deemed to consent to our use of your personal information for the purpose of:</p> <ul> <li>communicating with you generally</li> <li>processing your purchases</li> <li>processing and keeping track of transactions and reporting back to you</li> <li>protecting against fraud or error</li> <li>providing product and services requested by you</li> <li>recommending products and services that Peter John Flook believes will be of interest and provide value to you</li> <li>fulfilling any other purpose that would be reasonably apparent to the average person at the time we collect it from   you</li> </ul> <p>Otherwise, Peter John Flook will obtain your express consent (by verbal, written or electronic agreement) to collect, use or disclose your personal information. You can change your consent preferences at any time by contacting Peter John Flook (please refer to the \u201cHow to contact us\u201d section below).</p>"},{"location":"legal/privacy-policy/#we-limit-collection-of-your-personal-information","title":"We limit collection of your personal information","text":"<p>Peter John Flook collects only the information required to provide products and services to you. Peter John Flook will collect personal information only by clear, fair and lawful means.</p> <p>We receive and store any information you enter on our website or give us in any other way. You can choose not to provide certain information, but then you might not be able to take advantage of many of our features.</p> <p>Peter John Flook does not receive or store personal content saved to your local device while using Data Caterer.</p> <p>We also receive and store certain types of information whenever you interact with us.</p>"},{"location":"legal/privacy-policy/#information-provided-to-stripe","title":"Information provided to Stripe","text":"<p>All purchases that are made through this site are processed securely and externally by Stripe. Unless you expressly consent otherwise, we do not see or have access to any personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address).</p>"},{"location":"legal/privacy-policy/#we-limit-disclosure-and-retention-of-your-personal-information","title":"We limit disclosure and retention of your personal information","text":"<p>Peter John Flook does not disclose personal information to any organization or person for any reason except the following:</p> <p>We employ other companies and individuals to perform functions on our behalf. Examples include fulfilling orders, delivering packages, sending postal mail and e-mail, removing repetitive information from customer lists, analyzing data, providing marketing assistance, processing credit card payments, and providing customer service. They have access to personal information needed to perform their functions, but may not use it for other purposes. We may use service providers located outside of Australia, and, if applicable, your personal information may be processed and stored in other countries and therefore may be subject to disclosure under the laws of those countries. As we continue to develop our business, we might sell or buy stores, subsidiaries, or business units. In such transactions, customer information generally is one of the transferred business assets but remains subject to the promises made in any pre-existing Privacy Notice (unless, of course, the customer consents otherwise). Also, in the unlikely event that Peter John Flook or substantially all of its assets are acquired, customer information of course will be one of the transferred assets. You are deemed to consent to disclosure of your personal information for those purposes. If your personal information is shared with third parties, those third parties are bound by appropriate agreements with Peter John Flook to secure and protect the confidentiality of your personal information.</p> <p>Peter John Flook retains your personal information only as long as it is required for our business relationship or as required by federal and provincial laws.</p>"},{"location":"legal/privacy-policy/#we-keep-your-personal-information-up-to-date-and-accurate","title":"We keep your personal information up to date and accurate","text":"<p>Peter John Flook keeps your personal information up to date, accurate and relevant for its intended use.</p> <p>You may request access to the personal information we have on record in order to review and amend the information, as appropriate. In circumstances where your personal information has been provided by a third party, we will refer you to that party (e.g. credit bureaus). To access your personal information, refer to the \u201cHow to contact us\u201d section below.</p>"},{"location":"legal/privacy-policy/#the-security-of-your-personal-information-is-a-priority-for-peter-john-flook","title":"The security of your personal information is a priority for Peter John Flook","text":"<p>We take steps to safeguard your personal information, regardless of the format in which it is held, including:</p> <p>physical security measures such as restricted access facilities and locked filing cabinets electronic security measures for computerized personal information such as password protection, database encryption and personal identification numbers. We work to protect the security of your information during transmission by using \u201cTransport Layer Security\u201d (TLS) protocol. organizational processes such as limiting access to your personal information to a selected group of individuals contractual obligations with third parties who need access to your personal information requiring them to protect and secure your personal information It\u2019s important for you to protect against unauthorized access to your password and your computer. Be sure to sign off when you\u2019ve finished using any shared computer.</p>"},{"location":"legal/privacy-policy/#what-about-third-party-advertisers-and-links-to-other-websites","title":"What About Third-Party Advertisers and Links to Other Websites?","text":"<p>Our site may include third-party advertising and links to other websites. We do not provide any personally identifiable customer information to these advertisers or third-party websites.</p> <p>These third-party websites and advertisers, or Internet advertising companies working on their behalf, sometimes use technology to send (or \u201cserve\u201d) the advertisements that appear on our website directly to your browser. They automatically receive your IP address when this happens. They may also use cookies, JavaScript, web beacons (also known as action tags or single-pixel gifs), and other technologies to measure the effectiveness of their ads and to personalize advertising content. We do not have access to or control over cookies or other features that they may use, and the information practices of these advertisers and third-party websites are not covered by this Privacy Notice. Please contact them directly for more information about their privacy practices. In addition, the Network Advertising Initiative offers useful information about Internet advertising companies (also called \u201cad networks\u201d or \u201cnetwork advertisers\u201d), including information about how to opt-out of their information collection. You can access the Network Advertising Initiative at http://www.networkadvertising.org.</p>"},{"location":"legal/privacy-policy/#redirection-to-stripe","title":"Redirection to Stripe","text":"<p>In particular, when you submit an order to us, you may be automatically redirected to Stripe in order to complete the required payment. The payment page that is provided by Stripe is not part of this site. As noted above, we are not privy to any of the bank account, credit card or other personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address). We recommend that you refer to Stripe\u2019s privacy statement if you would like more information about how Stripe collects and handles your personal information.</p>"},{"location":"legal/privacy-policy/#we-are-open-about-our-privacy-and-security-policy","title":"We are open about our privacy and security policy","text":"<p>We are committed to providing you with understandable and easily available information about our policy and practices related to management of your personal information. This policy and any related information is available at all times on our website, https://data.catering/about/ under Privacy or on request. To contact us, refer to the \u201cHow to contact us\u201d section below.</p>"},{"location":"legal/privacy-policy/#we-provide-access-to-your-personal-information-stored-by-peter-john-flook","title":"We provide access to your personal information stored by Peter John Flook","text":"<p>You can request access to your personal information stored by Peter John Flook. To contact us, refer to the \u201cHow to contact us\u201d section below. Upon receiving such a request, Peter John Flook will:</p> <p>inform you about what type of personal information we have on record or in our control, how it is used and to whom it may have been disclosed provide you with access to your information so you can review and verify the accuracy and completeness and request changes to the information make any necessary updates to your personal information We respond to your questions, concerns and complaints about privacy Peter John Flook responds in a timely manner to your questions, concerns and complaints about the privacy of your personal information and our privacy policies and procedures.</p>"},{"location":"legal/privacy-policy/#how-to-contact-us","title":"How to contact us","text":"<ul> <li>by email at <code>peter.flook@data.catering</code></li> </ul> <p>Our business changes constantly, and this privacy notice will change also. We may e-mail periodic reminders of our notices and conditions, unless you have instructed us not to, but you should check our website frequently to see recent changes. We are, however, committed to protecting your information and will never materially change our policies and practices to make them less protective of customer information collected in the past without the consent of affected customers.</p>"},{"location":"legal/terms-of-service/","title":"Terms and Conditions","text":"<p>Last updated: September 25, 2023</p> <p>Please read these terms and conditions carefully before using Our Service.</p>"},{"location":"legal/terms-of-service/#interpretation-and-definitions","title":"Interpretation and Definitions","text":""},{"location":"legal/terms-of-service/#interpretation","title":"Interpretation","text":"<p>The words of which the initial letter is capitalized have meanings defined under the following conditions. The following definitions shall have the same meaning regardless of whether they appear in singular or in plural.</p>"},{"location":"legal/terms-of-service/#definitions","title":"Definitions","text":"<p>For the purposes of these Terms and Conditions:</p> <ul> <li>Application means the software program provided by the Company downloaded by You on any electronic device, named   Data Caterer</li> <li>Application Store means the digital distribution service operated and developed by Docker Inc. (\u201cDocker\u201d) in which   the Application has been downloaded.</li> <li>Affiliate means an entity that controls, is controlled by or is under common control with a party, where \"control\"   means ownership of 50% or more of the shares, equity interest or other securities entitled to vote for election of   directors or other managing authority.</li> <li>Country refers to: New South Wales, Australia</li> <li>Company (referred to as either \"the Company\", \"We\", \"Us\" or \"Our\" in this Agreement) refers to Peter John Flook (   ABN: 65153160916), 30 Anne William Drive, West Pennant Hills, 2125, NSW, Australia.</li> <li>Device means any device that can access the Service such as a computer, a cellphone or a digital tablet.</li> <li>Service refers to the Application.</li> <li>Terms and Conditions (also referred as \"Terms\") mean these Terms and Conditions that form the entire agreement   between You and the Company regarding the use of the Service.</li> <li>Third-party Social Media Service means any services or content (including data, information, products or services)   provided by a third party that may be displayed, included or made available by the Service.</li> <li>You means the individual accessing or using the Service, or the company, or other legal entity on behalf of which   such individual is accessing or using the Service, as applicable.</li> </ul>"},{"location":"legal/terms-of-service/#acknowledgment","title":"Acknowledgment","text":"<p>These are the Terms and Conditions governing the use of this Service and the agreement that operates between You and the Company. These Terms and Conditions set out the rights and obligations of all users regarding the use of the Service.</p> <p>Your access to and use of the Service is conditioned on Your acceptance of and compliance with these Terms and Conditions. These Terms and Conditions apply to all visitors, users and others who access or use the Service.</p> <p>By accessing or using the Service You agree to be bound by these Terms and Conditions. If You disagree with any part of these Terms and Conditions then You may not access the Service.</p> <p>You represent that you are over the age of 18. The Company does not permit those under 18 to use the Service.</p> <p>Your access to and use of the Service is also conditioned on Your acceptance of and compliance with the Privacy Policy of the Company. Our Privacy Policy describes Our policies and procedures on the collection, use and disclosure of Your personal information when You use the Application or the Website and tells You about Your privacy rights and how the law protects You. Please read Our Privacy Policy carefully before using Our Service.</p>"},{"location":"legal/terms-of-service/#links-to-other-websites","title":"Links to Other Websites","text":"<p>Our Service may contain links to third-party websites or services that are not owned or controlled by the Company.</p> <p>The Company has no control over, and assumes no responsibility for, the content, privacy policies, or practices of any third party websites or services. You further acknowledge and agree that the Company shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with the use of or reliance on any such content, goods or services available on or through any such websites or services.</p> <p>We strongly advise You to read the terms and conditions and privacy policies of any third-party websites or services that You visit.</p>"},{"location":"legal/terms-of-service/#termination","title":"Termination","text":"<p>We may terminate or suspend Your access immediately, without prior notice or liability, for any reason whatsoever, including without limitation if You breach these Terms and Conditions.</p> <p>Upon termination, Your right to use the Service will cease immediately.</p>"},{"location":"legal/terms-of-service/#limitation-of-liability","title":"Limitation of Liability","text":"<p>Notwithstanding any damages that You might incur, the entire liability of the Company and any of its suppliers under any provision of these Terms and Your exclusive remedy for all the foregoing shall be limited to the amount actually paid by You through the Service or 100 USD if You haven't purchased anything through the Service.</p> <p>To the maximum extent permitted by applicable law, in no event shall the Company or its suppliers be liable for any special, incidental, indirect, or consequential damages whatsoever (including, but not limited to, damages for loss of profits, loss of data or other information, for business interruption, for personal injury, loss of privacy arising out of or in any way related to the use of or inability to use the Service, third-party software and/or third-party hardware used with the Service, or otherwise in connection with any provision of these Terms), even if the Company or any supplier has been advised of the possibility of such damages and even if the remedy fails of its essential purpose.</p> <p>Some states do not allow the exclusion of implied warranties or limitation of liability for incidental or consequential damages, which means that some of the above limitations may not apply. In these states, each party's liability will be limited to the greatest extent permitted by law.</p>"},{"location":"legal/terms-of-service/#as-is-and-as-available-disclaimer","title":"\"AS IS\" and \"AS AVAILABLE\" Disclaimer","text":"<p>The Service is provided to You \"AS IS\" and \"AS AVAILABLE\" and with all faults and defects without warranty of any kind. To the maximum extent permitted under applicable law, the Company, on its own behalf and on behalf of its Affiliates and its and their respective licensors and service providers, expressly disclaims all warranties, whether express, implied, statutory or otherwise, with respect to the Service, including all implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and warranties that may arise out of course of dealing, course of performance, usage or trade practice. Without limitation to the foregoing, the Company provides no warranty or undertaking, and makes no representation of any kind that the Service will meet Your requirements, achieve any intended results, be compatible or work with any other software, applications, systems or services, operate without interruption, meet any performance or reliability standards or be error free or that any errors or defects can or will be corrected.</p> <p>Without limiting the foregoing, neither the Company nor any of the company's provider makes any representation or warranty of any kind, express or implied: (i) as to the operation or availability of the Service, or the information, content, and materials or products included thereon; (ii) that the Service will be uninterrupted or error-free; (iii) as to the accuracy, reliability, or currency of any information or content provided through the Service; or (iv) that the Service, its servers, the content, or e-mails sent from or on behalf of the Company are free of viruses, scripts, trojan horses, worms, malware, time-bombs or other harmful components.</p> <p>Some jurisdictions do not allow the exclusion of certain types of warranties or limitations on applicable statutory rights of a consumer, so some or all of the above exclusions and limitations may not apply to You. But in such a case the exclusions and limitations set forth in this section shall be applied to the greatest extent enforceable under applicable law.</p>"},{"location":"legal/terms-of-service/#governing-law","title":"Governing Law","text":"<p>The laws of the Country, excluding its conflicts of law rules, shall govern this Terms and Your use of the Service. Your use of the Application may also be subject to other local, state, national, or international laws.</p>"},{"location":"legal/terms-of-service/#disputes-resolution","title":"Disputes Resolution","text":"<p>If You have any concern or dispute about the Service, You agree to first try to resolve the dispute informally by contacting the Company.</p>"},{"location":"legal/terms-of-service/#for-european-union-eu-users","title":"For European Union (EU) Users","text":"<p>If You are a European Union consumer, you will benefit from any mandatory provisions of the law of the country in which you are resident in.</p>"},{"location":"legal/terms-of-service/#united-states-legal-compliance","title":"United States Legal Compliance","text":"<p>You represent and warrant that (i) You are not located in a country that is subject to the United States government embargo, or that has been designated by the United States government as a \"terrorist supporting\" country, and (ii) You are not listed on any United States government list of prohibited or restricted parties.</p>"},{"location":"legal/terms-of-service/#severability-and-waiver","title":"Severability and Waiver","text":""},{"location":"legal/terms-of-service/#severability","title":"Severability","text":"<p>If any provision of these Terms is held to be unenforceable or invalid, such provision will be changed and interpreted to accomplish the objectives of such provision to the greatest extent possible under applicable law and the remaining provisions will continue in full force and effect.</p>"},{"location":"legal/terms-of-service/#waiver","title":"Waiver","text":"<p>Except as provided herein, the failure to exercise a right or to require performance of an obligation under these Terms shall not affect a party's ability to exercise such right or require such performance at any time thereafter nor shall the waiver of a breach constitute a waiver of any subsequent breach.</p>"},{"location":"legal/terms-of-service/#translation-interpretation","title":"Translation Interpretation","text":"<p>These Terms and Conditions may have been translated if We have made them available to You on our Service. You agree that the original English text shall prevail in the case of a dispute.</p>"},{"location":"legal/terms-of-service/#changes-to-these-terms-and-conditions","title":"Changes to These Terms and Conditions","text":"<p>We reserve the right, at Our sole discretion, to modify or replace these Terms at any time. If a revision is material We will make reasonable efforts to provide at least 30 days' notice prior to any new terms taking effect. What constitutes a material change will be determined at Our sole discretion.</p> <p>By continuing to access or use Our Service after those revisions become effective, You agree to be bound by the revised terms. If You do not agree to the new terms, in whole or in part, please stop using the website and the Service.</p>"},{"location":"legal/terms-of-service/#contact-us","title":"Contact Us","text":"<p>If you have any questions about these Terms and Conditions, You can contact us:</p> <ul> <li>By email: peter.flook@data.catering</li> </ul>"},{"location":"setup/","title":"Setup","text":"<p>All the configurations and customisation related to Data Caterer can be found under here.</p>"},{"location":"setup/#guide","title":"Guide","text":"<p>If you want a guided tour of using the Java or Scala API, you can follow one of the guides found here.</p>"},{"location":"setup/#specific-configuration","title":"Specific Configuration","text":"<ul> <li> Configurations - Configurations relating to feature flags, folder pathways, metadata   analysis</li> <li> Connections - Explore the data source connections available</li> <li> Generators - Choose and configure the type of generator you want used for   fields</li> <li> Validations - How to validate data to ensure your system is performing as expected</li> <li> Foreign Keys - Define links between data elements across data sources</li> <li> Deployment - Deploy Data Caterer as a job to your chosen environment</li> <li> Advanced - Advanced usage of Data Caterer</li> </ul>"},{"location":"setup/#high-level-run-configurations","title":"High Level Run Configurations","text":""},{"location":"setup/advanced/","title":"Advanced use cases","text":""},{"location":"setup/advanced/#special-data-formats","title":"Special data formats","text":"<p>There are many options available for you to use when you have a scenario when data has to be a certain format.</p> <ol> <li>Create expression datafaker<ol> <li>Can be used to create names, addresses, or anything that can be found    under here</li> </ol> </li> <li>Create regex</li> </ol>"},{"location":"setup/advanced/#foreign-keys-across-data-sets","title":"Foreign keys across data sets","text":"<p>Details for how you can configure foreign keys can be found here.</p>"},{"location":"setup/advanced/#edge-cases","title":"Edge cases","text":"<p>For each given data type, there are edge cases which can cause issues when your application processes the data. This can be controlled at a column level by including the following flag in the generator options:</p> JavaScalaYAML <pre><code>field()\n.name(\"amount\")\n.type(DoubleType.instance())\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n</code></pre> <pre><code>field\n.name(\"amount\")\n.`type`(DoubleType)\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n</code></pre> <pre><code>fields:\n- name: \"amount\"\ntype: \"double\"\ngenerator:\ntype: \"random\"\noptions:\nenableEdgeCases: \"true\"\nedgeCaseProb: 0.1\n</code></pre> <p>If you want to know all the possible edge cases for each data type, can check the documentation here.</p>"},{"location":"setup/advanced/#scenario-testing","title":"Scenario testing","text":"<p>You can create specific scenarios by adjusting the metadata found in the plan and tasks to your liking. For example, if you had two data sources, a Postgres database and a parquet file, and you wanted to save account data into Postgres and transactions related to those accounts into a parquet file. You can alter the <code>status</code> column in the account data to only generate <code>open</code> accounts and define a foreign key between Postgres and parquet to ensure the same <code>account_id</code> is being used. Then in the parquet task, define 1 to 10 transactions per <code>account_id</code> to be generated.</p> <p>Postgres account generation example task Parquet transaction generation example task Plan</p>"},{"location":"setup/advanced/#cloud-storage","title":"Cloud storage","text":""},{"location":"setup/advanced/#data-source","title":"Data source","text":"<p>If you want to save the file types CSV, JSON, Parquet or ORC into cloud storage, you can do so via adding extra configurations. Below is an example for S3.</p> JavaScalaYAML <pre><code>var csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield().name(\"account_id\"),\n...\n);\n\nvar s3Configuration = configuration()\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n\nexecute(s3Configuration, csvTask);\n</code></pre> <pre><code>val csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield.name(\"account_id\"),\n...\n)\n\nval s3Configuration = configuration\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -&gt; \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -&gt; \"true\",\n\"spark.hadoop.fs.defaultFS\" -&gt; \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -&gt; \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -&gt; \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -&gt; \"secret_key\"\n))\n\nexecute(s3Configuration, csvTask)\n</code></pre> <pre><code>folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n</code></pre>"},{"location":"setup/advanced/#storing-plantasks","title":"Storing plan/task(s)","text":"<p>You can generate and store the plan/task files inside either AWS S3, Azure Blob Storage or Google GCS. This can be controlled via configuration set in the <code>application.conf</code> file where you can set something like the below:</p> JavaScalaYAML <pre><code>configuration()\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n</code></pre> <pre><code>configuration\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -&gt; \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -&gt; \"true\",\n\"spark.hadoop.fs.defaultFS\" -&gt; \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -&gt; \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -&gt; \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -&gt; \"secret_key\"\n))\n</code></pre> <pre><code>folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n</code></pre>"},{"location":"setup/configuration/","title":"Configuration","text":"<p>A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.</p> <p>These configurations are defined from within your Java or Scala class via <code>configuration</code> or for YAML file setup, <code>application.conf</code> file as seen  here.</p>"},{"location":"setup/configuration/#flags","title":"Flags","text":"<p>Flags are used to control which processes are executed when you run Data Caterer.</p> Config Default Paid Description <code>enableGenerateData</code> true N Enable/disable data generation <code>enableCount</code> true N Count the number of records generated. Can be disabled to improve performance <code>enableFailOnError</code> true N Whilst saving generated data, if there is an error, it will stop any further data from being generated <code>enableSaveReports</code> true N Enable/disable HTML reports summarising data generated, metadata of data generated (if <code>enableSinkMetadata</code> is enabled) and validation results (if <code>enableValidation</code> is enabled). Sample here <code>enableSinkMetadata</code> true N Run data profiling for the generated data. Shown in HTML reports if <code>enableSaveSinkMetadata</code> is enabled <code>enableValidation</code> false N Run validations as described in plan. Results can be viewed from logs or from HTML report if <code>enableSaveSinkMetadata</code> is enabled. Sample here <code>enableGeneratePlanAndTasks</code> false Y Enable/disable plan and task auto generation based off data source connections <code>enableRecordTracking</code> false Y Enable/disable which data records have been generated for any data source <code>enableDeleteGeneratedRecords</code> false Y Delete all generated records based off record tracking (if <code>enableRecordTracking</code> has been set to true) <code>enableGenerateValidations</code> false Y If enabled, it will generate validations based on the data sources defined. JavaScalaapplication.conf <pre><code>configuration()\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false);\n</code></pre> <pre><code>configuration\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false)\n</code></pre> <pre><code>flags {\n  enableCount = false\n  enableCount = ${?ENABLE_COUNT}\n  enableGenerateData = true\n  enableGenerateData = ${?ENABLE_GENERATE_DATA}\n  enableFailOnError = true\n  enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}\n  enableGeneratePlanAndTasks = false\n  enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}\n  enableRecordTracking = false\n  enableRecordTracking = ${?ENABLE_RECORD_TRACKING}\n  enableDeleteGeneratedRecords = false\n  enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}\n  enableGenerateValidations = false\n  enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}\n}\n</code></pre>"},{"location":"setup/configuration/#folders","title":"Folders","text":"<p>Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.</p> <p>These folder pathways can be defined as a cloud storage pathway (i.e. <code>s3a://my-bucket/task</code>).</p> Config Default Paid Description <code>planFilePath</code> /opt/app/plan/customer-create-plan.yaml N Plan file path to use when generating and/or validating data <code>taskFolderPath</code> /opt/app/task N Task folder path that contains all the task files (can have nested directories) <code>validationFolderPath</code> /opt/app/validation N Validation folder path that contains all the validation files (can have nested directories) <code>generatedReportsFolderPath</code> /opt/app/report N Where HTML reports get generated that contain information about data generated along with any validations performed <code>generatedPlanAndTaskFolderPath</code> /tmp Y Folder path where generated plan and task files will be saved <code>recordTrackingFolderPath</code> /opt/app/record-tracking Y Where record tracking parquet files get saved JavaScalaapplication.conf <pre><code>configuration()\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\");\n</code></pre> <pre><code>configuration\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n</code></pre> <pre><code>folders {\n  planFilePath = \"/opt/app/custom/plan/postgres-plan.yaml\"\n  planFilePath = ${?PLAN_FILE_PATH}\n  taskFolderPath = \"/opt/app/custom/task\"\n  taskFolderPath = ${?TASK_FOLDER_PATH}\n  validationFolderPath = \"/opt/app/custom/validation\"\n  validationFolderPath = ${?VALIDATION_FOLDER_PATH}\n  generatedReportsFolderPath = \"/opt/app/custom/report\"\n  generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}\n  generatedPlanAndTaskFolderPath = \"/opt/app/custom/generated\"\n  generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}\n  recordTrackingFolderPath = \"/opt/app/custom/record-tracking\"\n  recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}\n}\n</code></pre>"},{"location":"setup/configuration/#metadata","title":"Metadata","text":"<p>When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy related issues. Metadata gets generated from two processes: 1) if <code>enableGeneratePlanAndTasks</code> or 2) if <code>enableSinkMetadata</code> are enabled.</p> <p>During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.</p> Config Default Paid Description <code>numRecordsFromDataSource</code> 10000 Y Number of records read in from the data source that could be used for data profiling <code>numRecordsForAnalysis</code> 10000 Y Number of records used for data profiling from the records gathered in <code>numRecordsFromDataSource</code> <code>oneOfMinCount</code> 1000 Y Minimum number of records required before considering if a field can be of type <code>oneOf</code> <code>oneOfDistinctCountVsCountThreshold</code> 0.2 Y Threshold ratio to determine if a field is of type <code>oneOf</code> (i.e. a field called <code>status</code> that only contains <code>open</code> or <code>closed</code>. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as <code>oneOf</code>) <code>numGeneratedSamples</code> 10 N Number of sample records from generated data to take. Shown in HTML report JavaScalaapplication.conf <pre><code>configuration()\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10);\n</code></pre> <pre><code>configuration\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10)\n</code></pre> <pre><code>metadata {\n  numRecordsFromDataSource = 10000\n  numRecordsForAnalysis = 10000\n  oneOfMinCount = 1000\n  oneOfDistinctCountVsCountThreshold = 0.2\n  numGeneratedSamples = 10\n}\n</code></pre>"},{"location":"setup/configuration/#generation","title":"Generation","text":"<p>When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.</p> Config Default Paid Description <code>numRecordsPerBatch</code> 100000 N Number of records across all data sources to generate per batch <code>numRecordsPerStep</code> N Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated) ScalaScalaapplication.conf <pre><code>configuration()\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000);\n</code></pre> <pre><code>configuration\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000)\n</code></pre> <pre><code>generation {\n  numRecordsPerBatch = 100000\n  numRecordsPerStep = 1000\n}\n</code></pre>"},{"location":"setup/configuration/#runtime","title":"Runtime","text":"<p>Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your  specifications via configuration as seen here.</p> JavaScalaapplication.conf <pre><code>configuration()\n.master(\"local[*]\")\n.runtimeConfig(Map.of(\"spark.driver.cores\", \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\", \"10g\");\n</code></pre> <pre><code>configuration\n.master(\"local[*]\")\n.runtimeConfig(Map(\"spark.driver.cores\" -&gt; \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\" -&gt; \"10g\")\n</code></pre> <pre><code>runtime {\n  master = \"local[*]\"\n  master = ${?DATA_CATERER_MASTER}\n  config {\n    \"spark.driver.cores\" = \"5\"\n    \"spark.driver.memory\" = \"10g\"\n  }\n}\n</code></pre>"},{"location":"setup/connection/","title":"Data Source Connections","text":"<p>Details of all the connection configuration supported can be found in the below subsections for each type of connection.</p> <p>These configurations can be done via API or from configuration. Examples of both are shown for each data source below.</p>"},{"location":"setup/connection/#supported-data-connections","title":"Supported Data Connections","text":"Data Source Type Data Source Sponsor Database Postgres, MySQL, Cassandra N File CSV, JSON, ORC, Parquet N Messaging Kafka, Solace Y HTTP REST API Y Metadata Marquez, OpenMetadata, OpenAPI/Swagger Y"},{"location":"setup/connection/#api","title":"API","text":"<p>All connection details require a name. Depending on the data source, you can define additional options which may be used by the driver or connector for connecting to the data source.</p>"},{"location":"setup/connection/#configuration-file","title":"Configuration file","text":"<p>All connection details follow the same pattern.</p> <pre><code>&lt;connection format&gt; {\n    &lt;connection name&gt; {\n        &lt;key&gt; = &lt;value&gt;\n    }\n}\n</code></pre> <p>Overriding configuration</p> <p>When defining a configuration value that can be defined by a system property or environment variable at runtime, you can define that via the following:</p> <pre><code>url = \"localhost\"\nurl = ${?POSTGRES_URL}\n</code></pre> <p>The above defines that if there is a system property or environment variable named <code>POSTGRES_URL</code>, then that value will be used for the <code>url</code>, otherwise, it will default to <code>localhost</code>.</p>"},{"location":"setup/connection/#data-sources","title":"Data sources","text":"<p>To find examples of a task for each type of data source, please check out this page.</p>"},{"location":"setup/connection/#file","title":"File","text":"<p>Linked here is a list of generic options that can be included as part of your file data source configuration if required. Links to specific file type configurations can be found below.</p>"},{"location":"setup/connection/#csv","title":"CSV","text":"JavaScalaapplication.conf <pre><code>csv(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>csv(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>csv {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?CSV_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for CSV can be found here</p>"},{"location":"setup/connection/#json","title":"JSON","text":"JavaScalaapplication.conf <pre><code>json(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>json(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>json {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?JSON_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for JSON can be found here</p>"},{"location":"setup/connection/#orc","title":"ORC","text":"JavaScalaapplication.conf <pre><code>orc(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>orc(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>orc {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?ORC_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for ORC can be found here</p>"},{"location":"setup/connection/#parquet","title":"Parquet","text":"JavaScalaapplication.conf <pre><code>parquet(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>parquet(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>parquet {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?PARQUET_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for Parquet can be found here</p>"},{"location":"setup/connection/#delta-not-supported-yet","title":"Delta (not supported yet)","text":"JavaScalaapplication.conf <pre><code>delta(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>delta(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>delta {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?DELTA_PATH}\n  }\n}\n</code></pre>"},{"location":"setup/connection/#rmdbs","title":"RMDBS","text":"<p>Follows the same configuration used by Spark as found here. Sample can be found below</p> JavaScalaapplication.conf <pre><code>postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n</code></pre> <pre><code>postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n</code></pre> <pre><code>jdbc {\n    customer_postgres {\n        url = \"jdbc:postgresql://localhost:5432/customer\"\n        url = ${?POSTGRES_URL}\n        user = \"postgres\"\n        user = ${?POSTGRES_USERNAME}\n        password = \"postgres\"\n        password = ${?POSTGRES_PASSWORD}\n        driver = \"org.postgresql.Driver\"\n    }\n}\n</code></pre> <p>Ensure that the user has write permission, so it is able to save the table to the target tables.</p> SQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO &lt;user&gt;;\n</code></pre>"},{"location":"setup/connection/#postgres","title":"Postgres","text":"<p>Can see example API or Config definition for Postgres connection above.</p>"},{"location":"setup/connection/#permissions","title":"Permissions","text":"<p>Following permissions are required when generating plan and tasks:</p> SQL Permission Statements <pre><code>GRANT SELECT ON information_schema.tables TO &lt; user &gt;;\nGRANT SELECT ON information_schema.columns TO &lt; user &gt;;\nGRANT SELECT ON information_schema.key_column_usage TO &lt; user &gt;;\nGRANT SELECT ON information_schema.table_constraints TO &lt; user &gt;;\nGRANT SELECT ON information_schema.constraint_column_usage TO &lt; user &gt;;\n</code></pre>"},{"location":"setup/connection/#mysql","title":"MySQL","text":"JavaScalaapplication.conf <pre><code>mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n</code></pre> <pre><code>mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n</code></pre> <pre><code>jdbc {\n    customer_mysql {\n        url = \"jdbc:mysql://localhost:3306/customer\"\n        user = \"root\"\n        password = \"root\"\n        driver = \"com.mysql.cj.jdbc.Driver\"\n    }\n}\n</code></pre>"},{"location":"setup/connection/#permissions_1","title":"Permissions","text":"<p>Following permissions are required when generating plan and tasks:</p> SQL Permission Statements <pre><code>GRANT SELECT ON information_schema.columns TO &lt; user &gt;;\nGRANT SELECT ON information_schema.statistics TO &lt; user &gt;;\nGRANT SELECT ON information_schema.key_column_usage TO &lt; user &gt;;\n</code></pre>"},{"location":"setup/connection/#cassandra","title":"Cassandra","text":"<p>Follows same configuration as defined by the Spark Cassandra Connector as found here</p> JavaScalaapplication.conf <pre><code>cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap.of()                #optional additional connection options\n)\n</code></pre> <pre><code>cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap()                #optional additional connection options\n)\n</code></pre> <pre><code>org.apache.spark.sql.cassandra {\n    customer_cassandra {\n        spark.cassandra.connection.host = \"localhost\"\n        spark.cassandra.connection.host = ${?CASSANDRA_HOST}\n        spark.cassandra.connection.port = \"9042\"\n        spark.cassandra.connection.port = ${?CASSANDRA_PORT}\n        spark.cassandra.auth.username = \"cassandra\"\n        spark.cassandra.auth.username = ${?CASSANDRA_USERNAME}\n        spark.cassandra.auth.password = \"cassandra\"\n        spark.cassandra.auth.password = ${?CASSANDRA_PASSWORD}\n    }\n}\n</code></pre>"},{"location":"setup/connection/#permissions_2","title":"Permissions","text":"<p>Ensure that the user has write permission, so it is able to save the table to the target tables.</p> CQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO &lt;user&gt;;\n</code></pre> <p>Following permissions are required when enabling <code>configuration.enableGeneratePlanAndTasks(true)</code> as it will gather metadata information about tables and columns from the below tables.</p> CQL Permission Statements <pre><code>GRANT SELECT ON system_schema.tables TO &lt;user&gt;;\nGRANT SELECT ON system_schema.columns TO &lt;user&gt;;\n</code></pre>"},{"location":"setup/connection/#kafka","title":"Kafka","text":"<p>Define your Kafka bootstrap server to connect and send generated data to corresponding topics. Topic gets set at a step level. Further details can be found here</p> JavaScalaapplication.conf <pre><code>kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n</code></pre> <pre><code>kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n</code></pre> <pre><code>kafka {\n    customer_kafka {\n        kafka.bootstrap.servers = \"localhost:9092\"\n        kafka.bootstrap.servers = ${?KAFKA_BOOTSTRAP_SERVERS}\n    }\n}\n</code></pre> <p>When defining your schema for pushing data to Kafka, it follows a specific top level schema. An example can be found here . You can define the key, value, headers, partition or topic by following the linked schema.</p>"},{"location":"setup/connection/#jms","title":"JMS","text":"<p>Uses JNDI lookup to send messages to JMS queue. Ensure that the messaging system you are using has your queue/topic registered via JNDI otherwise a connection cannot be created.</p> JavaScalaapplication.conf <pre><code>solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n</code></pre> <pre><code>solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n</code></pre> <pre><code>jms {\n    customer_solace {\n        initialContextFactory = \"com.solacesystems.jndi.SolJNDIInitialContextFactory\"\n        connectionFactory = \"/jms/cf/default\"\n        url = \"smf://localhost:55555\"\n        url = ${?SOLACE_URL}\n        user = \"admin\"\n        user = ${?SOLACE_USER}\n        password = \"admin\"\n        password = ${?SOLACE_PASSWORD}\n        vpnName = \"default\"\n        vpnName = ${?SOLACE_VPN}\n    }\n}\n</code></pre>"},{"location":"setup/connection/#http","title":"HTTP","text":"<p>Define any username and/or password needed for the HTTP requests. The url is defined in the tasks to allow for generated data to be populated in the url.</p> JavaScalaapplication.conf <pre><code>http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n</code></pre> <pre><code>http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n</code></pre> <pre><code>http {\n    customer_api {\n        user = \"admin\"\n        user = ${?HTTP_USER}\n        password = \"admin\"\n        password = ${?HTTP_PASSWORD}\n    }\n}\n</code></pre>"},{"location":"setup/deployment/","title":"Deployment","text":"<p>Two main ways to deploy and run Data Caterer:</p> <ul> <li>Docker</li> <li>Helm</li> </ul>"},{"location":"setup/deployment/#docker","title":"Docker","text":"<p>To package up your class along with the Data Caterer base image, you can follow the Dockerfile that is created for you here.</p> <p>Then you can run the following:</p> <pre><code>./gradlew clean build\ndocker build -t &lt;my_image_name&gt;:&lt;my_image_tag&gt; .\n</code></pre>"},{"location":"setup/deployment/#helm","title":"Helm","text":"<p>Link to sample helm on GitHub here</p> <p>Update the configuration to your own data connections and configuration or own image created from above.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\nhelm install data-caterer ./data-caterer-example/helm/data-caterer\n</code></pre>"},{"location":"setup/design/","title":"Design","text":"<p>This document shows the thought process behind the design of Data Caterer to help give you insights as to how and why it was created to what it is today. Also, this serves as a reference for future design decisions which will get updated  here and thus is a living document.</p>"},{"location":"setup/design/#motivation","title":"Motivation","text":"<p>The main difficulties that I faced as a developer and team lead relating to testing were:</p> <ul> <li>Difficulty in testing with multiple data sources, both batch and real time</li> <li>Reliance on other teams for stable environments or domain knowledge</li> <li>Test environments with no reliable or consistent data flows</li> <li>Complex data masking/anonymization solutions</li> <li>Relying on production data (potential privacy and data breach issues)</li> <li>Cost of data production issues can be very high</li> <li>Unknown unknowns staying hidden until problems occur in production</li> <li>Underutilised metadata</li> </ul>"},{"location":"setup/design/#guiding-principles","title":"Guiding Principles","text":"<p>These difficulties helped formed the basis of the principles for which Data Caterer should follow:</p> <ul> <li>Data source agnostic: Connect to any batch or real time data sources for data generation or validation</li> <li>Configurable: Run the application the way you want</li> <li>Extensible: Allow for new innovations to seamlessly integrate with Data Caterer</li> <li>Integrate with existing solutions: Utilise existing metadata to make it easy for users to use straight away</li> <li>Secure: No production connections required, metadata based solution</li> <li>Fast: Give developers fast feedback loops to encourage them to thoroughly test data flows</li> </ul>"},{"location":"setup/design/#high-level-flow","title":"High level flow","text":"<pre><code>graph LR\n  subgraph userTasks [User Configuration]\n  dataGen[Data Generation]\n  dataValid[Data Validation]\n  runConf[Runtime Config]\n  end\n\n  subgraph dataProcessor [Processor]\n  dataCaterer[Data Caterer]\n  end\n\n  subgraph existingMetadata [Metadata]\n  metadataService[Metadata Services]\n  metadataDataSource[Data Sources]\n  end\n\n  subgraph output [Output]\n  outputDataSource[Data Sources]\n  report[Report]\n  end\n\n  dataGen --&gt; dataCaterer\n  dataValid --&gt; dataCaterer\n  runConf --&gt; dataCaterer\n  direction TB\n  dataCaterer -.-&gt; metadataService\n  dataCaterer -.-&gt; metadataDataSource\n  direction LR\n  dataCaterer ---&gt; outputDataSource\n  dataCaterer ---&gt; report</code></pre> <ol> <li>User Configuration<ol> <li>Users define data generation, validation and runtime configuration</li> </ol> </li> <li>Processor<ol> <li>Engine will take user configuration to decide how to run</li> <li>User defined configuration merged with metadata from external sources</li> </ol> </li> <li>Metadata<ol> <li>Automatically retrieve schema, data profiling, relationship or validation rule metadata from data sources or metadata services</li> </ol> </li> <li>Output<ol> <li>Execute data generation and validation tasks on data sources</li> <li>Generate report summarising outcome</li> </ol> </li> </ol>"},{"location":"setup/foreign-key/","title":"Foreign Keys","text":"<p>Foreign keys can be defined to represent the relationships between datasets where values are required to match for particular columns.</p>"},{"location":"setup/foreign-key/#single-column","title":"Single column","text":"<p>Define a column in one data source to match against another column. Below example shows a <code>postgres</code> data source with two tables, <code>accounts</code> and <code>transactions</code> that have a foreign key for <code>account_id</code>.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList.of(Map.entry(postgresTxn, \"account_id\"))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList(postgresTxn -&gt; \"account_id\")\n)\n</code></pre> <pre><code>---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"postgres.accounts.account_id\":\n- \"postgres.transactions.account_id\"\n</code></pre>"},{"location":"setup/foreign-key/#multiple-columns","title":"Multiple columns","text":"<p>You may have a scenario where multiple columns need to be aligned. From the same example, we want <code>account_id</code> and <code>name</code> from <code>accounts</code> to match with <code>account_id</code> and <code>full_name</code> to match in <code>transactions</code> respectively.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(postgresTxn, List.of(\"account_id\", \"full_name\")))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(postgresTxn -&gt; List(\"account_id\", \"full_name\"))\n)\n</code></pre> <pre><code>---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_postgres.transactions.account_id,full_name\"\n</code></pre>"},{"location":"setup/foreign-key/#nested-column","title":"Nested column","text":"<p>Your schema structure can have nested fields which can also be referenced as foreign keys. But to do so, you need to create a proxy field that gets omitted from the final saved data.</p> <p>In the example below, the nested <code>customer_details.name</code> field inside the <code>json</code> task needs to match with <code>name</code> from <code>postgres</code>. A new field in the <code>json</code> called <code>_txn_name</code> is used as a temporary column to facilitate the foreign key definition.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n),\nfield().name(\"_txn_name\").omit(true)       #value will not be included in output\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(jsonTask, List.of(\"account_id\", \"_txn_name\")))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n), field.name(\"_txn_name\").omit(true)       #value will not be included in output\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(jsonTask -&gt; List(\"account_id\", \"_txn_name\"))\n)\n</code></pre> <pre><code>---\n#postgres task yaml\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n---\n#json task yaml\nname: \"json_data\"\nsteps:\n- name: \"transactions\"\ntype: \"json\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"_txn_name\"\ngenerator:\noptions:\nomit: true\n- name: \"cusotmer_details\"\nschema:\nfields:\nname: \"name\"\ngenerator:\ntype: \"sql\"\noptions:\nsql: \"_txn_name\"\n\n---\n#plan yaml\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n- name: \"json_data\"\ndataSourceName: \"my_json\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_json.transactions.account_id,_txn_name\"\n</code></pre>"},{"location":"setup/validation/","title":"Validations","text":"<p>Validations can be used to run data checks after you have run the data generator or even as a standalone task. A report summarising the success or failure of the validations is produced and can be examined for further investigation.</p> <ul> <li>Basic - Basic column level validations</li> <li>Group by/Aggregate - Run aggregates over grouped data, then validate</li> <li>Upstream data source - Ensure record values exist in datasets based on other data sources or data generated</li> <li>[Data Profile (Coming soon)] - Score how close the data profile of generated data is against the target data profile</li> </ul>"},{"location":"setup/validation/#define-validations","title":"Define Validations","text":"<p>Full example validation can be found below. For more details, check out each of the subsections defined further below.</p> JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().col(\"amount\").lessThan(100),\nvalidation().col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation().col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)\n.validationWait(waitCondition().pause(1));\n\nvar conf = configuration().enableValidation(true);\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validations(\nvalidation.col(\"amount\").lessThan(100),\nvalidation.col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation.col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)  .validationWait(waitCondition.pause(1))\n\nval conf = configuration.enableValidation(true)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is &gt; 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is &gt; 200, then fail\ndescription: \"Should be lots of Peters\"\nwaitCondition:\npauseInSeconds: 1\n</code></pre>"},{"location":"setup/validation/#wait-condition","title":"Wait Condition","text":"<p>Once data has been generated, you may want to wait for a certain condition to be met before starting the data validations. This can be via:</p> <ul> <li>Pause for seconds</li> <li>When file is available</li> <li>Data exists</li> <li>Webhook</li> </ul>"},{"location":"setup/validation/#pause","title":"Pause","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().pause(1));\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.pause(1))\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npauseInSeconds: 1\n</code></pre>"},{"location":"setup/validation/#data-exists","title":"Data exists","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWaitDataExists(\"updated_date &gt; DATE('2023-01-01')\");\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWaitDataExists(\"updated_date &gt; DATE('2023-01-01')\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"transactions\"\noptions:\npath: \"/tmp/csv\"\nexpr: \"updated_date &gt; DATE('2023-01-01')\"\n</code></pre>"},{"location":"setup/validation/#webhook","title":"Webhook","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\")); //by default, GET request successful when 200 status code\n\n//or\n\nvar csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202));  //successful if 200 or 202 status code\n\n//or\n\nvar csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"my_http\", \"http://localhost:8080/finished\"));  //use connection configuration from existing 'my_http' connection definition\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\"))  //by default, GET request successful when 200 status code\n\n//or\n\nval csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)) //successful if 200 or 202 status code\n\n//or\n\nval csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"my_http\", \"http://localhost:8080/finished\")) //use connection configuration from existing 'my_http' connection definition\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\" #by default, GET request successful when 200 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\"\nmethod: \"GET\"\nstatusCodes: [200, 202] #successful if 200 or 202 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"my_http\" #use connection configuration from existing 'my_http' connection definition\nurl: \"http://localhost:8080/finished\"\n</code></pre>"},{"location":"setup/validation/#file-exists","title":"File exists","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().file(\"/tmp/json\"));\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition.file(\"/tmp/json\"))\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npath: \"/tmp/json\"\n</code></pre>"},{"location":"setup/validation/#report","title":"Report","text":"<p>Once run, it will produce a report like this.</p>"},{"location":"setup/generator/count/","title":"Record Count","text":"<p>There are options related to controlling the number of records generated that can help in generating the scenarios or data required.</p>"},{"location":"setup/generator/count/#record-count_1","title":"Record Count","text":"<p>Record count is the simplest as you define the total number of records you require for that particular step. For example, in the below step, it will generate 1000 records for the CSV file  </p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\n</code></pre>"},{"location":"setup/generator/count/#generated-count","title":"Generated Count","text":"<p>As like most things in Data Caterer, the count can be generated based on some metadata. For example, if I wanted to generate between 1000 and 2000 records, I could define that by the below configuration:</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator().min(1000).max(2000));\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator.min(1000).max(2000))\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\ngenerator:\ntype: \"random\"\noptions:\nmin: 1000\nmax: 2000\n</code></pre>"},{"location":"setup/generator/count/#per-column-count","title":"Per Column Count","text":"<p>When defining a per column count, this allows you to generate records \"per set of columns\". This means that for a given set of columns, it will generate a particular amount of records per combination of values for those columns.  </p> <p>One example of this would be when generating transactions relating to a customer, a customer may be defined by columns <code>account_id, name</code>. A number of transactions would be generated per <code>account_id, name</code>.  </p> <p>You can also use a combination of the above two methods to generate the number of records per column.</p>"},{"location":"setup/generator/count/#records","title":"Records","text":"<p>When defining a base number of records within the <code>perColumn</code> configuration, it translates to creating <code>(count.records * count.recordsPerColumn)</code> records. This is a fixed number of records that will be generated each time, with no variation between runs.</p> <p>In the example below, we have <code>count.records = 1000</code> and <code>count.recordsPerColumn = 2</code>. Which means that <code>1000 * 2 = 2000</code> records will be generated in total.</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\nrecords: 2\ncolumnNames:\n- \"account_id\"\n- \"name\"\n</code></pre>"},{"location":"setup/generator/count/#generated","title":"Generated","text":"<p>You can also define a generator for the count per column. This can be used in scenarios where you want a variable number of records per set of columns.</p> <p>In the example below, it will generate between <code>(count.records * count.perColumnGenerator.generator.min) = (1000 * 1) = 1000</code> and <code>(count.records * count.perColumnGenerator.generator.max) = (1000 * 2) = 2000</code> records.</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumnGenerator(generator().min(1).max(2), \"account_id\", \"name\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumnGenerator(generator.min(1).max(2), \"account_id\", \"name\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\ncolumnNames:\n- \"account_id\"\n- \"name\"\ngenerator:\ntype: \"random\"\noptions:\nmin: 1\nmax: 2\n</code></pre>"},{"location":"setup/generator/data-generator/","title":"Data Generators","text":""},{"location":"setup/generator/data-generator/#data-types","title":"Data Types","text":"<p>Below is a list of all supported data types for generating data:</p> Data Type Spark Data Type Options Description string StringType <code>minLen, maxLen, expression, enableNull</code> integer IntegerType <code>min, max, stddev, mean</code> long LongType <code>min, max, stddev, mean</code> short ShortType <code>min, max, stddev, mean</code> decimal(precision, scale) DecimalType(precision, scale) <code>min, max, stddev, mean</code> double DoubleType <code>min, max, stddev, mean</code> float FloatType <code>min, max, stddev, mean</code> date DateType <code>min, max, enableNull</code> timestamp TimestampType <code>min, max, enableNull</code> boolean BooleanType binary BinaryType <code>minLen, maxLen, enableNull</code> byte ByteType array ArrayType <code>arrayMinLen, arrayMaxLen, arrayType</code> _ StructType Implicitly supported when a schema is defined for a field"},{"location":"setup/generator/data-generator/#options","title":"Options","text":""},{"location":"setup/generator/data-generator/#all-data-types","title":"All data types","text":"<p>Some options are available to use for all types of data generators. Below is the list along with example and descriptions:</p> Option Default Example Description <code>enableEdgeCase</code> false <code>enableEdgeCase: \"true\"</code> Enable/disable generated data to contain edge cases based on the data type. For example, integer data type has edge cases of (Int.MaxValue, Int.MinValue and 0) <code>edgeCaseProbability</code> 0.0 <code>edgeCaseProb: \"0.1\"</code> Probability of generating a random edge case value if <code>enableEdgeCase</code> is true <code>isUnique</code> false <code>isUnique: \"true\"</code> Enable/disable generated data to be unique for that column. Errors will be thrown when it is unable to generate unique data <code>seed</code> <code>seed: \"1\"</code> Defines the random seed for generating data for that particular column. It will override any seed defined at a global level <code>sql</code> <code>sql: \"CASE WHEN amount &lt; 10 THEN true ELSE false END\"</code> Define any SQL statement for generating that columns value. Computation occurs after all non-SQL fields are generated. This means any columns used in the SQL cannot be based on other SQL generated columns. Data type of generated value from SQL needs to match data type defined for the field"},{"location":"setup/generator/data-generator/#string","title":"String","text":"Option Default Example Description <code>minLen</code> 1 <code>minLen: \"2\"</code> Ensures that all generated strings have at least length <code>minLen</code> <code>maxLen</code> 10 <code>maxLen: \"15\"</code> Ensures that all generated strings have at most length <code>maxLen</code> <code>expression</code> <code>expression: \"#{Name.name}\"</code><code>expression:\"#{Address.city}/#{Demographic.maritalStatus}\"</code> Will generate a string based on the faker expression provided. All possible faker expressions can be found here Expression has to be in format <code>#{&lt;faker expression name&gt;}</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", \"\u0130yi g\u00fcnler\", \"\u0421\u043f\u0430\u0441\u0438\u0431\u043e\", \"\u039a\u03b1\u03bb\u03b7\u03bc\u03ad\u03c1\u03b1\", \"\u0635\u0628\u0627\u062d \u0627\u0644\u062e\u064a\u0631\", \" F\u00f6rl\u00e5t\", \"\u4f60\u597d\u5417\", \"Nh\u00e0 v\u1ec7 sinh \u1edf \u0111\u00e2u\", \"\u3053\u3093\u306b\u3061\u306f\", \"\u0928\u092e\u0938\u094d\u0924\u0947\", \"\u0532\u0561\u0580\u0565\u0582\", \"\u0417\u0434\u0440\u0430\u0432\u0435\u0439\u0442\u0435\")</p>"},{"location":"setup/generator/data-generator/#sample","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield()\n.name(\"name\")\n.type(StringType.instance())\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield\n.name(\"name\")\n.`type`(StringType)\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\nschema:\nfields:\n- name: \"name\"\ntype: \"string\"\ngenerator:\noptions:\nexpression: \"#{Name.name}\"\nenableNull: true\nnullProb: 0.1\nminLength: 4\nmaxLength: 20\n</code></pre>"},{"location":"setup/generator/data-generator/#numeric","title":"Numeric","text":"<p>For all the numeric data types, there are 4 options to choose from: min, max and maxValue. Generally speaking, you only need to define one of min or minValue, similarly with max or maxValue. The reason why there are 2 options for each is because of when metadata is automatically gathered, we gather the statistics of the observed min and max values. Also, it will attempt to gather any restriction on the min or max value as defined by the data source (i.e. max value as per database type).</p>"},{"location":"setup/generator/data-generator/#integerlongshort","title":"Integer/Long/Short","text":"Option Default Example Description <code>min</code> 0 <code>min: \"2\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000 <code>max: \"25\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <p>Edge cases Integer: (2147483647, -2147483648, 0) Edge cases Long: (9223372036854775807, -9223372036854775808, 0) Edge cases Short: (32767, -32768, 0)</p>"},{"location":"setup/generator/data-generator/#sample_1","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"year\").type(IntegerType.instance()).min(2020).max(2023),\nfield().name(\"customer_id\").type(LongType.instance()),\nfield().name(\"customer_group\").type(ShortType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"year\").`type`(IntegerType).min(2020).max(2023),\nfield.name(\"customer_id\").`type`(LongType),\nfield.name(\"customer_group\").`type`(ShortType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"year\"\ntype: \"integer\"\ngenerator:\noptions:\nmin: 2020\nmax: 2023\n- name: \"customer_id\"\ntype: \"long\"\n- name: \"customer_group\"\ntype: \"short\"\n</code></pre>"},{"location":"setup/generator/data-generator/#decimal","title":"Decimal","text":"Option Default Example Description <code>min</code> 0 <code>min: \"2\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000 <code>max: \"25\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <code>numericPrecision</code> 10 <code>precision: \"25\"</code> The maximum number of digits <code>numericScale</code> 0 <code>scale: \"25\"</code> The number of digits on the right side of the decimal point (has to be less than or equal to precision) <p>Edge cases Decimal: (9223372036854775807, -9223372036854775808, 0)</p>"},{"location":"setup/generator/data-generator/#sample_2","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"balance\").type(DecimalType.instance()).numericPrecision(10).numericScale(5)\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"balance\").`type`(DecimalType).numericPrecision(10).numericScale(5)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"balance\"\ntype: \"decimal\"\ngenerator:\noptions:\nprecision: 10\nscale: 5\n</code></pre>"},{"location":"setup/generator/data-generator/#doublefloat","title":"Double/Float","text":"Option Default Example Description <code>min</code> 0.0 <code>min: \"2.1\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000.0 <code>max: \"25.9\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <p>Edge cases Double: (+infinity, 1.7976931348623157e+308, 4.9e-324, 0.0, -0.0, -1.7976931348623157e+308, -infinity, NaN) Edge cases Float: (+infinity, 3.4028235e+38, 1.4e-45, 0.0, -0.0, -3.4028235e+38, -infinity, NaN)</p>"},{"location":"setup/generator/data-generator/#sample_3","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"amount\").type(DoubleType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"amount\").`type`(DoubleType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"amount\"\ntype: \"double\"\n</code></pre>"},{"location":"setup/generator/data-generator/#date","title":"Date","text":"Option Default Example Description <code>min</code> now() - 365 days <code>min: \"2023-01-31\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> now() <code>max: \"2023-12-31\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (0001-01-01, 1582-10-15, 1970-01-01, 9999-12-31) (reference)</p>"},{"location":"setup/generator/data-generator/#sample_4","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_date\").type(DateType.instance()).min(java.sql.Date.valueOf(\"2020-01-01\"))\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_date\").`type`(DateType).min(java.sql.Date.valueOf(\"2020-01-01\"))\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_date\"\ntype: \"date\"\ngenerator:\noptions:\nmin: \"2020-01-01\"\n</code></pre>"},{"location":"setup/generator/data-generator/#timestamp","title":"Timestamp","text":"Option Default Example Description <code>min</code> now() - 365 days <code>min: \"2023-01-31 23:10:10\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> now() <code>max: \"2023-12-31 23:10:10\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (0001-01-01 00:00:00, 1582-10-15 23:59:59, 1970-01-01 00:00:00, 9999-12-31 23:59:59)</p>"},{"location":"setup/generator/data-generator/#sample_5","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_time\").type(TimestampType.instance()).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_time\").`type`(TimestampType).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_time\"\ntype: \"timestamp\"\ngenerator:\noptions:\nmin: \"2020-01-01 00:00:00\"\n</code></pre>"},{"location":"setup/generator/data-generator/#binary","title":"Binary","text":"Option Default Example Description <code>minLen</code> 1 <code>minLen: \"2\"</code> Ensures that all generated array of bytes have at least length <code>minLen</code> <code>maxLen</code> 20 <code>maxLen: \"15\"</code> Ensures that all generated array of bytes have at most length <code>maxLen</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", -128, 127)</p>"},{"location":"setup/generator/data-generator/#sample_6","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"payload\").type(BinaryType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"payload\").`type`(BinaryType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"payload\"\ntype: \"binary\"\n</code></pre>"},{"location":"setup/generator/data-generator/#array","title":"Array","text":"Option Default Example Description <code>arrayMinLen</code> 0 <code>arrayMinLen: \"2\"</code> Ensures that all generated arrays have at least length <code>arrayMinLen</code> <code>arrayMaxLen</code> 5 <code>arrayMaxLen: \"15\"</code> Ensures that all generated arrays have at most length <code>arrayMaxLen</code> <code>arrayType</code> <code>arrayType: \"double\"</code> Inner data type of the array. Optional when using Java/Scala API. Allows for nested data types to be defined like struct <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true"},{"location":"setup/generator/data-generator/#sample_7","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"last_5_amounts\").type(ArrayType.instance()).arrayType(\"double\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"last_5_amounts\").`type`(ArrayType).arrayType(\"double\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"last_5_amounts\"\ntype: \"array&lt;double&gt;\"\n</code></pre>"},{"location":"setup/generator/report/","title":"Report","text":"<p>Data Caterer can be configured to produce a report of the data generated to help users understand what was run, how much  data was generated, where it was generated, validation results and any associated metadata. </p>"},{"location":"setup/generator/report/#sample","title":"Sample","text":"<p>Once run, it will produce a report like this.</p>"},{"location":"setup/guide/","title":"Guides","text":"<p>Below are a list of guides you can follow to create your data generation for your use case.</p> <p>For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.</p>"},{"location":"setup/guide/#scenarios","title":"Scenarios","text":"<ul> <li>First Data Generation - If you are new, this is the place to start</li> <li>Multiple Records Per Column Value - How you can generate multiple records per set of columns</li> <li>Foreign Keys Across Data Sources - Generate matching values across generated data sets</li> <li>Data Validations - Run data validations after generating data</li> <li>Auto Generate From Data Connection - Automatically generating data from just defining data sources</li> <li>Delete Generated Data - Delete the generated data whilst leaving other data</li> <li>Generate Batch and Event Data - Generate matching batch and event data</li> </ul>"},{"location":"setup/guide/#data-sources","title":"Data Sources","text":"<ul> <li>Files (CSV, JSON, ORC, Parquet) - Generate data for popular file formats</li> <li>Postgres - JDBC Postgres tables</li> <li>Cassandra - Cassandra tables</li> <li>Kafka - Kafka topics</li> <li>Solace - Solace messages</li> <li>Marquez - Generate data based on metadata in Marquez</li> <li>OpenMetadata - Generate data based on metadata in OpenMetadata</li> <li>HTTP - HTTP requests</li> <li>Files (Fixed width) - (Soon to document) A variant of CSV but with no separator</li> <li>MySql - (Soon to document) JDBC MySql tables</li> </ul>"},{"location":"setup/guide/#yaml-files","title":"YAML Files","text":""},{"location":"setup/guide/#base-concept","title":"Base Concept","text":"<p>The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources. A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.</p>"},{"location":"setup/guide/#plan","title":"Plan","text":""},{"location":"setup/guide/#foreign-keys","title":"Foreign Keys","text":"<p>Define foreign keys across data sources in your plan to ensure generated data can match Link to associated task 1 Link to associated task 2</p>"},{"location":"setup/guide/#task","title":"Task","text":"Data Source Type Data Source Sample Task Notes Database Postgres Sample Database MySQL Sample Database Cassandra Sample File CSV Sample File JSON Sample Contains nested schemas and use of SQL for generated values File Parquet Sample Partition by year column Kafka Kafka Sample Specific base schema to be used, define headers, key, value, etc. JMS Solace Sample JSON formatted message HTTP PUT Sample JSON formatted PUT body"},{"location":"setup/guide/#configuration","title":"Configuration","text":"<p>Basic configuration</p>"},{"location":"setup/guide/#docker-compose","title":"Docker-compose","text":"<p>To see how it runs against different data sources, you can run using <code>docker-compose</code> and set <code>DATA_SOURCE</code> like below</p> <pre><code>./gradlew build\ncd docker\nDATA_SOURCE=postgres docker-compose up -d datacaterer\n</code></pre> <p>Can set it to one of the following:</p> <ul> <li>postgres</li> <li>mysql</li> <li>cassandra</li> <li>solace</li> <li>kafka</li> <li>http</li> </ul>"},{"location":"setup/guide/data-source/cassandra/","title":"Cassandra","text":"<p>Info</p> <p>Writing data to Cassandra is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Cassandra. You will build a Docker image that will be able to populate data in Cassandra for the tables you configure.</p>"},{"location":"setup/guide/data-source/cassandra/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Cassandra</li> </ul>"},{"location":"setup/guide/data-source/cassandra/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Cassandra instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/cassandra/#cassandra-setup","title":"Cassandra Setup","text":"<p>Next, let's make sure you have an instance of Cassandra up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d cassandra\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#permissions","title":"Permissions","text":"<p>Let's make a new user that has the required permissions needed to push data into the Cassandra tables we want.</p> CQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO data_caterer_user;\n</code></pre> <p>Following permissions are required when enabling <code>configuration.enableGeneratePlanAndTasks(true)</code> as it will gather metadata information about tables and columns from the below tables.</p> CQL Permission Statements <pre><code>GRANT SELECT ON system_schema.tables TO data_caterer_user;\nGRANT SELECT ON system_schema.columns TO data_caterer_user;\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedCassandraJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedCassandraPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedCassandraJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedCassandraPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/cassandra/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Cassandra.</p> JavaScala <pre><code>var accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap.of()                //optional additional connection options\n)\n</code></pre> <p>Additional options such as SSL configuration, etc can be found here.</p> <pre><code>val accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap()                   //optional additional connection options\n)\n</code></pre> <p>Additional options such as SSL configuration, etc can be found here.</p>"},{"location":"setup/guide/data-source/cassandra/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>account.accounts</code> and <code>account.account_status_history</code> tables as defined under<code>docker/data/cql/customer.cql</code>. This table should already be setup for you if you followed this step. We can check if the table is setup already via the following command:</p> <pre><code>docker exec host.docker.internal cqlsh -e 'describe account.accounts; describe account.account_status_history;'\n</code></pre> <p>Here we should see some output that looks like the below. This tells us what schema we need to follow when generating data. We need to define that alongside any metadata that is useful to add constraints on what are possible values the generated data should contain.</p> <pre><code>CREATE TABLE account.accounts (\naccount_id text PRIMARY KEY,\n    amount double,\n    created_by text,\n    name text,\n    open_time timestamp,\n    status text\n)...\n\nCREATE TABLE account.account_status_history (\naccount_id text,\n    eod_date date,\n    status text,\n    updated_by text,\n    updated_time timestamp,\n    PRIMARY KEY (account_id, eod_date)\n)...\n</code></pre> <p>Trimming the connection details to work with the docker-compose Cassandra, we have a base Cassandra connection to define the table and schema required. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code> which corresponds to <code>text</code> in Cassandra.</p> JavaScala <pre><code>{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n}\n</code></pre> <pre><code>val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#field-metadata","title":"Field Metadata","text":"<p>We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata that add guidelines that the data generator will understand when generating data.</p>"},{"location":"setup/guide/data-source/cassandra/#account_id","title":"account_id","text":"<p><code>account_id</code> follows a particular pattern that where it starts with <code>ACC</code> and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that unique values are generated.</p> JavaScala <pre><code>field().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n</code></pre> <pre><code>field.name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#amount","title":"amount","text":"<p><code>amount</code> the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between <code>1</code> and <code>1000</code>.</p> JavaScala <pre><code>field().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\n</code></pre> <pre><code>field.name(\"amount\").`type`(DoubleType).min(1).max(1000),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#name","title":"name","text":"<p><code>name</code> is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an <code>expression</code> to generate real looking name. All possible faker expressions can be found here</p> JavaScala <pre><code>field().name(\"name\").expression(\"#{Name.name}\"),\n</code></pre> <pre><code>field.name(\"name\").expression(\"#{Name.name}\"),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#open_time","title":"open_time","text":"<p><code>open_time</code> is a timestamp that we want to have a value greater than a specific date. We can define a min date by using <code>java.sql.Date</code> like below.</p> JavaScala <pre><code>field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre> <pre><code>field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#status","title":"status","text":"<p><code>status</code> is a field that can only obtain one of four values, <code>open, closed, suspended or pending</code>.</p> JavaScala <pre><code>field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre> <pre><code>field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#created_by","title":"created_by","text":"<p><code>created_by</code> is a field that is based on the <code>status</code> field where it follows the logic: <code>if status is open or closed, then it is created_by eod else created_by event</code>. This can be achieved by defining a SQL expression like below.</p> JavaScala <pre><code>field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <pre><code>field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <p>Putting it all the fields together, our class should now look like this.</p> JavaScala <pre><code>var accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n</code></pre> <pre><code>val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>accountTask</code>, we have to call <code>execute</code> . So our full plan run will look like this.</p> JavaScala <pre><code>public class MyAdvancedCassandraJavaPlan extends PlanRun {\n{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n</code></pre> <pre><code>class MyAdvancedCassandraPlan extends PlanRun {\nval accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class MyAdvancedCassandraJavaPlan or MyAdvancedCassandraPlan\n#after completing\ndocker exec docker-cassandraserver-1 cqlsh -e 'select count(1) from account.accounts;select * from account.accounts limit 10;'\n</code></pre> <p>Your output should look like this.</p> <pre><code> count\n-------\n  1000\n\n(1 rows)\n\nWarnings :\nAggregation query used without partition key\n\n\n account_id  | amount    | created_by         | name                   | open_time                       | status\n-------------+-----------+--------------------+------------------------+---------------------------------+-----------\n ACC13554145 | 917.00418 | zb CVvbBTTzitjo5fK |          Jan Sanford I | 2023-06-21 21:50:10.463000+0000 | suspended\n ACC19154140 |  46.99177 |             VH88H9 |       Clyde Bailey PhD | 2023-07-18 11:33:03.675000+0000 |      open\n ACC50587836 |  774.9872 |         GENANwPm t |           Sang Monahan | 2023-03-21 00:16:53.308000+0000 |    closed\n ACC67619387 | 452.86706 |       5msTpcBLStTH |         Jewell Gerlach | 2022-10-18 19:13:07.606000+0000 | suspended\n ACC69889784 |  14.69298 |           WDmOh7NT |          Dale Schulist | 2022-10-25 12:10:52.239000+0000 | suspended\n ACC41977254 |  51.26492 |          J8jAKzvj2 |           Norma Nienow | 2023-08-19 18:54:39.195000+0000 | suspended\n ACC40932912 | 349.68067 |   SLcJgKZdLp5ALMyg | Vincenzo Considine III | 2023-05-16 00:22:45.991000+0000 |    closed\n ACC20642011 | 658.40713 |          clyZRD4fI |  Lannie McLaughlin DDS | 2023-05-11 23:14:30.249000+0000 |      open\n ACC74962085 | 970.98218 |       ZLETTSnj4NpD |          Ima Jerde DVM | 2023-05-07 10:01:56.218000+0000 |   pending\n ACC72848439 | 481.64267 |                 cc |        Kyla Deckow DDS | 2023-08-16 13:28:23.362000+0000 | suspended\n\n(10 rows)\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/data-source/http/","title":"HTTP Source","text":"<p>Info</p> <p>Generating data based on OpenAPI/Swagger document and pushing to HTTP endpoint is a paid feature. Try the free trial here.</p> <p>Creating a data generator based on an OpenAPI/Swagger document.</p> <p></p>"},{"location":"setup/guide/data-source/http/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/http/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/http/#http-setup","title":"HTTP Setup","text":"<p>We will be using the http-bin docker image to help simulate a service with HTTP endpoints.</p> <p>Start it via:</p> <pre><code>cd docker\ndocker-compose up -d http\ndocker ps\n</code></pre>"},{"location":"setup/guide/data-source/http/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedHttpJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedHttpPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedHttpJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedHttpPlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/http/#schema","title":"Schema","text":"<p>We can point the schema of a data source to a OpenAPI/Swagger document or URL. For this example, we will use the OpenAPI document found under <code>docker/mount/http/petstore.json</code> in the data-caterer-example repo. This is a simplified version of the original OpenAPI spec that can be found here.</p> <p>We have kept the following endpoints to test out:</p> <ul> <li>GET /pets - get all pets</li> <li>POST /pets - create a new pet</li> <li>GET /pets/{id} - get a pet by id</li> <li>DELETE /pets/{id} - delete a pet by id</li> </ul> JavaScala <pre><code>var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count().records(2));\n</code></pre> <pre><code>val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count.records(2))\n</code></pre> <p>The above defines that the schema will come from an OpenAPI document found on the pathway defined. It will then generate 2 requests per request method and endpoint combination.</p>"},{"location":"setup/guide/data-source/http/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\n#after completing\ndocker logs -f docker-http-1\n</code></pre> <p>It should look something like this.</p> <pre><code>172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DeXQxFUHVja+EYm%26limit%3D33895 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DSXaFvAqwYGF%26tags%3DjdNRFONA%26limit%3D40975 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/kbH8D7rDuq HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/REsa0tnu7dvekGDvxR HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/EqrOr1dHFfKUjWb HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/7WG7JHPaNxP HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p>"},{"location":"setup/guide/data-source/http/#foreign-keys","title":"Foreign keys","text":"<p>The four different requests that get sent could have the same <code>id</code> passed across to each of them if we define a foreign key relationship. This will make it more realistic to a real life scenario as pets get created and queried by a particular <code>id</code> value. We note that the <code>id</code> value is first used when a pet is created in the body of the POST request. Then it gets used as a path parameter in the DELETE and GET requests.</p> <p>To link them all together, we must follow a particular pattern when referring to request body, query parameter or path parameter columns.</p> HTTP Type Column Prefix Example Request Body <code>bodyContent</code> <code>bodyContent.id</code> Path Parameter <code>pathParam</code> <code>pathParamid</code> Query Parameter <code>queryParam</code> <code>queryParamid</code> Header <code>header</code> <code>headerContent_Type</code> <p>Also note, that when creating a foreign field definition for a HTTP data source, to refer to a specific endpoint and method, we have to follow the pattern of <code>{http method}{http path}</code>. For example, <code>POST/pets</code>. Let's apply this knowledge to link all the <code>id</code> values together.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n);\n\nexecute(myPlan, conf, httpTask);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n)\n\nexecute(myPlan, conf, httpTask)\n</code></pre> <p>Let's test it out by running it again</p> <pre><code>./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n</code></pre> <pre><code>172.21.0.1 [06/Nov/2023:01:33:59 +0000] GET /anything/pets?limit%3D45971 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:00 +0000] GET /anything/pets?limit%3D62015 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:04 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:05 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Now we have the same <code>id</code> values being produced across the POST, DELETE and GET requests! What if we knew that the <code>id</code> values should follow a particular pattern?</p>"},{"location":"setup/guide/data-source/http/#custom-metadata","title":"Custom metadata","text":"<p>So given that we have defined a foreign key where the root of the foreign key values is from the POST request, we can update the metadata of the <code>id</code> column for the POST request and it will proliferate to the other endpoints as well. Given the <code>id</code> column is a nested column as noted in the foreign key, we can alter its metadata via the following:</p> JavaScala <pre><code>var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field().name(\"bodyContent\").schema(field().name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count().records(2));\n</code></pre> <pre><code>val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field.name(\"bodyContent\").schema(field.name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count.records(2))\n</code></pre> <p>We first get the column <code>bodyContent</code>, then get the nested schema and get the column <code>id</code> and add metadata stating that <code>id</code> should follow the patter <code>ID[0-9]{8}</code>.</p> <p>Let's try run again, and hopefully we should see some proper ID values.</p> <pre><code>./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n</code></pre> <pre><code>172.21.0.1 [06/Nov/2023:01:45:45 +0000] GET /anything/pets?tags%3D10fWnNoDz%26limit%3D66804 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:46 +0000] GET /anything/pets?tags%3DhyO6mI8LZUUpS HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:50 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:51 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Great! Now we have replicated a production-like flow of HTTP requests.</p>"},{"location":"setup/guide/data-source/http/#ordering","title":"Ordering","text":"<p>If you wanted to change the ordering of the requests, you can alter the order from within the OpenAPI/Swagger document. This is particularly useful when you want to simulate the same flow that users would take when utilising your application (i.e. create account, query account, update account).</p>"},{"location":"setup/guide/data-source/http/#rows-per-second","title":"Rows per second","text":"<p>By default, Data Caterer will push requests per method and endpoint at a rate of around 5 requests per second. If you want to alter this value, you can do so via the below configuration. The lowest supported requests per second is 1.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.api.model.Constants;\n\n...\nvar httpTask = http(\"my_http\", Map.of(Constants.ROWS_PER_SECOND(), \"1\"))\n...\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.model.Constants.ROWS_PER_SECOND\n\n...\nval httpTask = http(\"my_http\", options = Map(ROWS_PER_SECOND -&gt; \"1\"))\n...\n</code></pre> <p>Check out the full example under <code>AdvancedHttpPlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/kafka/","title":"Kafka","text":"<p>Info</p> <p>Writing data to Kafka is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Kafka. You will build a Docker image that will be able to populate data in kafka for the topics you configure.</p>"},{"location":"setup/guide/data-source/kafka/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Kafka</li> </ul>"},{"location":"setup/guide/data-source/kafka/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Kafka instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/kafka/#kafka-setup","title":"Kafka Setup","text":"<p>Next, let's make sure you have an instance of Kafka up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d kafka\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedKafkaJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedKafkaPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedKafkaJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedKafkaPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/kafka/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Kafka.</p> JavaScala <pre><code>var accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap.of()          //optional additional connection options\n);\n</code></pre> <p>Additional options can be found here.</p> <pre><code>val accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap()             //optional additional connection options\n)\n</code></pre> <p>Additional options can be found here.</p>"},{"location":"setup/guide/data-source/kafka/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>account-topic</code> that is already defined under<code>docker/data/kafka/setup_kafka.sh</code>. This topic should already be setup for you if you followed this step. We can check if the topic is set up already via the following command:</p> <pre><code>docker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n</code></pre> <p>Trimming the connection details to work with the docker-compose Kafka, we have a base Kafka connection to define the topic we will publish to. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>{\nvar kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield().name(\"key\").sql(\"content.account_id\"),\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),  can define partition here\nfield().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n),\nfield().name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield().name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n}\n</code></pre> <pre><code>val kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield.name(\"key\").sql(\"content.account_id\"),\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").type(IntegerType),  can define partition here\nfield.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\nfield.name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield.name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#fields","title":"Fields","text":"<p>The schema defined for Kafka has a format that needs to be followed as noted above. Specifically, the required fields are: - value</p> <p>Whilst, the other fields are optional: - key - partition - headers</p>"},{"location":"setup/guide/data-source/kafka/#headers","title":"headers","text":"<p><code>headers</code> follows a particular pattern that where it is of type <code>array&lt;struct&lt;key: string,value: binary&gt;&gt;</code>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the  <code>value</code> part, it refers to <code>content.account_id</code> where <code>content</code> is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.</p> JavaScala <pre><code>field().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n</code></pre> <pre><code>field.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#transactions","title":"transactions","text":"<p><code>transactions</code> is an array that contains an inner structure of <code>txn_date</code> and <code>amount</code>. The size of the array generated can be controlled via <code>arrayMinLength</code> and <code>arrayMaxLength</code>.</p> JavaScala <pre><code>field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n</code></pre> <pre><code>field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#details","title":"details","text":"<p><code>details</code> is another example of a nested schema structure where it also has a nested structure itself in <code>updated_by</code>. One thing to note here is the <code>first_txn_date</code> field has a reference to the <code>content.transactions</code> array where it will  sort the array by <code>txn_date</code> and get the first element.</p> JavaScala <pre><code>field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n</code></pre> <pre><code>field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>kafkaTask</code>, we have to call <code>execute</code> .</p>"},{"location":"setup/guide/data-source/kafka/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class AdvancedKafkaJavaPlanRun or AdvancedKafkaPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n</code></pre> <p>Your output should look like this.</p> <pre><code>{\"account_id\":\"ACC56292178\",\"year\":2022,\"amount\":18338.627721151555,\"details\":{\"name\":\"Isaias Reilly\",\"first_txn_date\":\"2021-01-22\",\"updated_by\":{\"user\":\"FgYXbKDWdhHVc3\",\"time\":\"2022-12-30T13:49:07.309Z\"}},\"transactions\":[{\"txn_date\":\"2021-01-22\",\"amount\":30556.52125487579},{\"txn_date\":\"2021-10-29\",\"amount\":39372.302259554635},{\"txn_date\":\"2021-10-29\",\"amount\":61887.31389495968}]}\n{\"account_id\":\"ACC37729457\",\"year\":2022,\"amount\":96885.31758764731,\"details\":{\"name\":\"Randell Witting\",\"first_txn_date\":\"2021-06-30\",\"updated_by\":{\"user\":\"HCKYEBHN8AJ3TB\",\"time\":\"2022-12-02T02:05:01.144Z\"}},\"transactions\":[{\"txn_date\":\"2021-06-30\",\"amount\":98042.09647765031},{\"txn_date\":\"2021-10-06\",\"amount\":41191.43564742036},{\"txn_date\":\"2021-11-16\",\"amount\":78852.08184809204},{\"txn_date\":\"2021-10-09\",\"amount\":13747.157653571106}]}\n{\"account_id\":\"ACC23127317\",\"year\":2023,\"amount\":81164.49304198896,\"details\":{\"name\":\"Jed Wisozk\",\"updated_by\":{\"user\":\"9MBFZZ\",\"time\":\"2023-07-12T05:56:52.397Z\"}},\"transactions\":[]}\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/data-source/marquez-metadata-source/","title":"Metadata Source","text":"<p>Info</p> <p>Generating data based on an external metadata source is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Postgres tables and CSV file based on metadata stored in Marquez ( follows OpenLineage API).</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/marquez-metadata-source/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/marquez-metadata-source/#marquez-setup","title":"Marquez Setup","text":"<p>You can follow the README found here to help with setting up Marquez in your local environment. This comes with an instance of Postgres which we will also be using as a data store for generated data.</p> <p>The command that was run for this example to help with setup of dummy data was <code>./docker/up.sh -a 5001 -m 5002 --seed</code>.</p> <p>Check that the following url shows some data like below once you click on <code>food_delivery</code> from the <code>ns</code> drop down in the top right corner.</p> <p></p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#postgres-setup","title":"Postgres Setup","text":"<p>Since we will also be using the Marquez Postgres instance as a data source, we will set up a separate database to store the generated data in via:</p> <pre><code>docker exec marquez-db psql -Upostgres -c 'CREATE DATABASE food_delivery'\n</code></pre>"},{"location":"setup/guide/data-source/marquez-metadata-source/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedMetadataSourceJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedMetadataSourcePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#schema","title":"Schema","text":"<p>We can point the schema of a data source to our Marquez instance. For the Postgres data source, we will point to a <code>namespace</code>, which in Marquez or OpenLineage, represents a set of datasets. For the CSV data source, we will point to a specific <code>namespace</code> and <code>dataset</code>.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala <pre><code>var csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map.of(\"saveMode\", \"overwrite\", \"header\", \"true\"))\n.schema(metadataSource().marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count().records(10));\n</code></pre> <pre><code>val csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map(\"saveMode\" -&gt; \"overwrite\", \"header\" -&gt; \"true\"))\n.schema(metadataSource.marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count.records(10))\n</code></pre> <p>The above defines that the schema will come from Marquez, which is a type of metadata source that contains information about schemas. Specifically, it points to the <code>food_delivery</code> namespace and <code>public.categories</code> dataset to retrieve the schema information from.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#multiple-schemas","title":"Multiple Schemas","text":"JavaScala <pre><code>var postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\", Map.of())\n.schema(metadataSource().marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count().records(10));\n</code></pre> <pre><code>val postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\")\n.schema(metadataSource.marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count.records(10))\n</code></pre> <p>We now have pointed this Postgres instance to produce multiple schemas that are defined under the <code>food_delivery</code> namespace. Also note that we are using database <code>food_delivery</code> in Postgres to push our generated data to, and we have set the number of records per sub data source (in this case, per table) to be 10.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\n#after completing\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n</code></pre> <p>It should look something like this.</p> <pre><code> order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |         customer_email         |                     customer_address                     | menu_id | restaurant_id |                        restaurant_address\n   | menu_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+--------------------------------+----------------------------------------------------------+---------+---------------+---------------------------------------------------------------\n---+--------------+-------------+-------------+---------+-----------\n    38736 | 2023-02-05 06:05:23.755 | 2023-09-08 04:29:10.878 | 2023-09-03 23:58:34.285 | april.skiles@hotmail.com       | 5018 Lang Dam, Gaylordfurt, MO 35172                     |   59841 |         30971 | Suite 439 51366 Bartoletti Plains, West Lashawndamouth, CA 242\n42 |        55697 |       36370 |       21574 |   88022 |     16569\n4376 | 2022-12-19 14:39:53.442 | 2023-08-30 07:40:06.948 | 2023-03-15 20:38:26.11  | adelina.balistreri@hotmail.com | Apt. 340 9146 Novella Motorway, East Troyhaven, UT 34773 |   66195 |         42765 | Suite 670 8956 Rob Fork, Rennershire, CA 04524\n|        26516 |       81335 |       87615 |   27433 |     45649\n11083 | 2022-10-30 12:46:38.692 | 2023-06-02 13:05:52.493 | 2022-11-27 18:38:07.873 | johnny.gleason@gmail.com       | Apt. 385 99701 Lemke Place, New Irvin, RI 73305          |   66427 |         44438 | 1309 Danny Cape, Weimanntown, AL 15865\n|        41686 |       36508 |       34498 |   24191 |     92405\n58759 | 2023-07-26 14:32:30.883 | 2022-12-25 11:04:08.561 | 2023-04-21 17:43:05.86  | isabelle.ohara@hotmail.com     | 2225 Evie Lane, South Ardella, SD 90805                  |   27106 |         25287 | Suite 678 3731 Dovie Park, Port Luigi, ID 08250\n|        94205 |       66207 |       81051 |   52553 |     27483\n</code></pre> <p>You can also try query some other tables. Let's also check what is in the CSV file.</p> <pre><code>$ head docker/sample/csv/part-0000*\nmenu_item_id,category_id,discount_id,city_id,driver_id,order_id,order_placed_on,order_dispatched_on,order_delivered_on,customer_email,customer_address,menu_id,restaurant_id,restaurant_address\n72248,37098,80135,45888,5036,11090,2023-09-20T05:33:08.036+08:00,2023-05-16T23:10:57.119+08:00,2023-05-01T22:02:23.272+08:00,demetrice.rohan@hotmail.com,\"406 Harmony Rue, Wisozkburgh, MD 12282\",33762,9042,\"Apt. 751 0796 Ellan Flats, Lake Chetville, WI 81957\"\n41644,40029,48565,83373,89919,58359,2023-04-18T06:28:26.194+08:00,2022-10-15T18:17:48.998+08:00,2023-02-06T17:02:04.104+08:00,joannie.okuneva@yahoo.com,\"Suite 889 022 Susan Lane, Zemlakport, OR 56996\",27467,6216,\"Suite 016 286 Derick Grove, Dooleytown, NY 14664\"\n49299,53699,79675,40821,61764,72234,2023-07-16T21:33:48.739+08:00,2023-02-14T21:23:10.265+08:00,2023-09-18T02:08:51.433+08:00,ina.heller@yahoo.com,\"Suite 600 86844 Heller Island, New Celestinestad, DE 42622\",48002,12462,\"5418 Okuneva Mountain, East Blairchester, MN 04060\"\n83197,86141,11085,29944,81164,65382,2023-01-20T06:08:25.981+08:00,2023-01-11T13:24:32.968+08:00,2023-09-09T02:30:16.890+08:00,lakisha.bashirian@yahoo.com,\"Suite 938 534 Theodore Lock, Port Caitlynland, LA 67308\",69109,47727,\"4464 Stewart Tunnel, Marguritemouth, AR 56791\"\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p> <p>What if we wanted the same records in Postgres <code>public.delivery_7_days</code> to also show up in the CSV file? That's where we can use a foreign key definition.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#foreign-key","title":"Foreign Key","text":"<p>We can take a look at the report (under <code>docker/sample/report/index.html</code>) to see what we need to do to create the  foreign key. From the overview, you should see under <code>Tasks</code> there is a <code>my_postgres</code> task which has  <code>food_delivery_public.delivery_7_days</code> as a step. Click on the link for <code>food_delivery_public.delivery_7_days</code> and it  will take us to a page where we can find out about the columns used in this table. Click on the <code>Fields</code> button on the  far right to see.</p> <p>We can copy all of a subset of fields that we want matched across the CSV file and Postgres. For this example, we will  take all the fields.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\npostgresTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask);\n</code></pre> <pre><code>val foreignCols = List(\"order_id\", \"order_placed_on\", \"order_dispatched_on\", \"order_delivered_on\", \"customer_email\",\n\"customer_address\", \"menu_id\", \"restaurant_id\", \"restaurant_address\", \"menu_item_id\", \"category_id\", \"discount_id\",\n\"city_id\", \"driver_id\")\n\nval myPlan = plan.addForeignKeyRelationships(\ncsvTask, foreignCols,\nList(foreignField(postgresTask, \"food_delivery_public.delivery_7_days\", foreignCols))\n)\n\nval conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask)\n</code></pre> <p>Notice how we have defined the <code>csvTask</code> and <code>foreignCols</code> as the main foreign key but for <code>postgresTask</code>, we had to  define it as a <code>foreignField</code>. This is because <code>postgresTask</code> has multiple tables within it, and we only want to define our foreign key with respect to the <code>public.delivery_7_days</code> table. We use the step name (can be seen from the report)  to specify the table to target. </p> <p>To test this out, we will truncate the <code>public.delivery_7_days</code> table in Postgres first, and then try run again.</p> <pre><code>docker exec marquez-db psql -Upostgres -d food_delivery -c 'TRUNCATE public.delivery_7_days'\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n</code></pre> <pre><code> order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |        customer_email        |\ncustomer_address                     | menu_id | restaurant_id |                   restaurant_address                   | menu\n_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+------------------------------+-------------\n--------------------------------------------+---------+---------------+--------------------------------------------------------+-----\n---------+-------------+-------------+---------+-----------\n    53333 | 2022-10-15 08:40:23.394 | 2023-01-23 09:42:48.397 | 2023-08-12 08:50:52.397 | normand.aufderhar@gmail.com  | Apt. 036 449\n27 Wilderman Forge, Marvinchester, CT 15952 |   40412 |         70130 | Suite 146 98176 Schaden Village, Grahammouth, SD 12354 |\n90141 |       44210 |       83966 |   78614 |     77449\n</code></pre> <p>Let's grab the first email from the Postgres table and check whether the same record exists in the CSV file.</p> <pre><code>$ cat docker/sample/csv/part-0000* | grep normand.aufderhar\n90141,44210,83966,78614,77449,53333,2022-10-15T08:40:23.394+08:00,2023-01-23T09:42:48.397+08:00,2023-08-12T08:50:52.397+08:00,normand.aufderhar@gmail.com,\"Apt. 036 44927 Wilderman Forge, Marvinchester, CT 15952\",40412,70130,\"Suite 146 98176 Schaden Village, Grahammouth, SD 12354\"\n</code></pre> <p>Great! Now we have the ability to get schema information from an external source, add our own foreign keys and generate  data.</p> <p>Check out the full example under <code>AdvancedMetadataSourcePlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/open-metadata-source/","title":"OpenMetadata Source","text":"<p>Info</p> <p>Generating data based on an external metadata source is a paid feature. Try the free trial here.</p> <p>Creating a data generator for a JSON file based on metadata stored in OpenMetadata.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/open-metadata-source/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/open-metadata-source/#openmetadata-setup","title":"OpenMetadata Setup","text":"<p>You can follow the local docker setup found here to help with setting up OpenMetadata in your local environment.</p> <p>If that page becomes outdated or the link doesn't work, below are the commands I used to run it:</p> <pre><code>mkdir openmetadata-docker &amp;&amp; cd openmetadata-docker\ncurl -sL https://github.com/open-metadata/OpenMetadata/releases/download/1.2.0-release/docker-compose.yml &gt; docker-compose.yml\ndocker compose -f docker-compose.yml up --detach\n</code></pre> <p>Check that the following url works and login with <code>admin:admin</code>. Then you should see some data  like below:</p> <p></p>"},{"location":"setup/guide/data-source/open-metadata-source/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedOpenMetadataSourceJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedOpenMetadataSourcePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedOpenMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedOpenMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#schema","title":"Schema","text":"<p>We can point the schema of a data source to our OpenMetadata instance. We will use a JSON data source so that we can show how nested data types are handled and how we could customise it.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala <pre><code>import com.github.pflooky.datacaterer.api.model.Constants;\n...\n\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(metadataSource().openMetadataJava(\n\"http://localhost:8585/api\",                                                              //url\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(),                                        //auth type\nMap.of(                                                                                   //additional options (including auth options)\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\",                                        //get from settings/bots/ingestion-bot\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count().records(10));\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.model.Constants.{OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, OPEN_METADATA_JWT_TOKEN, OPEN_METADATA_TABLE_FQN, SAVE_MODE}\n...\n\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -&gt; \"overwrite\"))\n.schema(metadataSource.openMetadata(\n\"http://localhost:8585/api\",                                                  //url\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA,                                        //auth type\nMap(                                                                          //additional options (including auth options)\nOPEN_METADATA_JWT_TOKEN -&gt; \"abc123\",                                        //get from settings/bots/ingestion-bot\nOPEN_METADATA_TABLE_FQN -&gt; \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count.records(10))\n</code></pre> <p>The above defines that the schema will come from OpenMetadata, which is a type of metadata source that contains information about schemas. Specifically, it points to the <code>sample_data.ecommerce_db.shopify.raw_customer</code> table. You can check out the schema here to see what it looks like.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedOpenMetadataSourceJavaPlanRun or MyAdvancedOpenMetadataSourcePlanRun\n#after completing\ncat docker/sample/json/part-00000-*\n</code></pre> <p>It should look something like this.</p> <pre><code>{\n\"comments\": \"Mh6jqpD5e4M\",\n\"creditcard\": \"6771839575926717\",\n\"membership\": \"Za3wCQUl9E  EJj712\",\n\"orders\": [\n{\n\"product_id\": \"Aa6NG0hxfHVq\",\n\"price\": 16139,\n\"onsale\": false,\n\"tax\": 58134,\n\"weight\": 40734,\n\"others\": 45813,\n\"vendor\": \"Kh\"\n},\n{\n\"product_id\": \"zbHBY \",\n\"price\": 17903,\n\"onsale\": false,\n\"tax\": 39526,\n\"weight\": 9346,\n\"others\": 52035,\n\"vendor\": \"jbkbnXAa\"\n},\n{\n\"product_id\": \"5qs3gakppd7Nw5\",\n\"price\": 48731,\n\"onsale\": true,\n\"tax\": 81105,\n\"weight\": 2004,\n\"others\": 20465,\n\"vendor\": \"nozCDMSXRPH Ev\"\n},\n{\n\"product_id\": \"CA6h17ANRwvb\",\n\"price\": 62102,\n\"onsale\": true,\n\"tax\": 96601,\n\"weight\": 78849,\n\"others\": 79453,\n\"vendor\": \" ihVXEJz7E2EFS\"\n}\n],\n\"platform\": \"GLt9\",\n\"preference\": {\n\"key\": \"nmPmsPjg C\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Loren Bechtelar\",\n\"street_address\": \"Suite 526 293 Rohan Road, Wunschshire, NE 25532\",\n\"city\": \"South Norrisland\",\n\"postcode\": \"56863\"\n}\n],\n\"shipping_date\": \"2022-11-03\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"lance.murphy\",\n\"name\": \"Zane Brakus DVM\",\n\"sex\": \"7HcAaPiO\",\n\"address\": \"594 Loida Haven, Gilland, MA 26071\",\n\"mail\": \"Un3fhbvK2rEbenIYdnq\",\n\"birthdate\": \"2023-01-31\"\n}\n}\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#custom-metadata","title":"Custom metadata","text":"<p>We can see from the data generated, that it isn't quite what we want. The metadata is not sufficient for us to produce production-like data yet. Let's try to add some enhancements to it.</p> <p>Let's make the <code>platform</code> field a choice field that can only be a set of certain values and the nested field <code>customer.sex</code> is also from a predefined set of values.</p> JavaScala <pre><code>var jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield().name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield().name(\"customer\").schema(field().name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count().records(10));\n</code></pre> <pre><code>val jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -&gt; \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield.name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield.name(\"customer\").schema(field.name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count.records(10))\n</code></pre> <p>Let's test it out by running it again</p> <pre><code>./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ncat docker/sample/json/part-00000-*\n</code></pre> <pre><code>{\n\"comments\": \"vqbPUm\",\n\"creditcard\": \"6304867705548636\",\n\"membership\": \"GZ1xOnpZSUOKN\",\n\"orders\": [\n{\n\"product_id\": \"rgOokDAv\",\n\"price\": 77367,\n\"onsale\": false,\n\"tax\": 61742,\n\"weight\": 87855,\n\"others\": 26857,\n\"vendor\": \"04XHR64ImMr9T\"\n}\n],\n\"platform\": \"mobile\",\n\"preference\": {\n\"key\": \"IB5vNdWka\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Isiah Bins\",\n\"street_address\": \"36512 Ross Spurs, Hillhaven, IA 18760\",\n\"city\": \"Averymouth\",\n\"postcode\": \"75818\"\n},\n{\n\"name\": \"Scott Prohaska\",\n\"street_address\": \"26573 Haley Ports, Dariusland, MS 90642\",\n\"city\": \"Ashantimouth\",\n\"postcode\": \"31792\"\n},\n{\n\"name\": \"Rudolf Stamm\",\n\"street_address\": \"Suite 878 0516 Danica Path, New Christiaport, ID 10525\",\n\"city\": \"Doreathaport\",\n\"postcode\": \"62497\"\n}\n],\n\"shipping_date\": \"2023-08-24\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"jolie.cremin\",\n\"name\": \"Fay Klein\",\n\"sex\": \"O\",\n\"address\": \"Apt. 174 5084 Volkman Creek, Hillborough, PA 61959\",\n\"mail\": \"BiTmzb7\",\n\"birthdate\": \"2023-04-07\"\n}\n}\n</code></pre> <p>Great! Now we have the ability to get schema information from an external source, add our own metadata and generate  data.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#data-validation","title":"Data validation","text":"<p>Another aspect of OpenMetadata that can be leveraged is the definition of data quality rules. These rules can be  incorporated into your Data Caterer job as well by enabling data validations via <code>enableGenerateValidations</code> in  <code>configuration</code>.</p> JavaScala <pre><code>var conf = configuration().enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(conf, jsonTask);\n</code></pre> <pre><code>val conf = configuration.enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(conf, jsonTask)\n</code></pre> <p>Check out the full example under <code>AdvancedOpenMetadataSourcePlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/solace/","title":"Solace","text":"<p>Info</p> <p>Writing data to Solace is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Solace. You will build a Docker image that will be able to populate data in Solace for the queues/topics you configure.</p> <p></p>"},{"location":"setup/guide/data-source/solace/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Solace</li> </ul>"},{"location":"setup/guide/data-source/solace/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Solace instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/solace/#solace-setup","title":"Solace Setup","text":"<p>Next, let's make sure you have an instance of Solace up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d solace\n</code></pre> <p>Open up localhost:8080 and login with <code>admin:admin</code> and check there is the <code>default</code> VPN like below. Notice there is 2 queues/topics created. If you do not see 2 created, try to run the script found under <code>docker/data/solace/setup_solace.sh</code> and change the <code>host</code> to <code>localhost</code>.</p> <p></p>"},{"location":"setup/guide/data-source/solace/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedSolaceJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedSolacePlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedSolaceJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedSolacePlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/solace/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Solace.</p> JavaScala <pre><code>var accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap.of()                            //optional additional connection options\n);\n</code></pre> <p>Additional connection options can be found here.</p> <pre><code>val accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap()                               //optional additional connection options\n)\n</code></pre> <p>Additional connection options can be found here.</p>"},{"location":"setup/guide/data-source/solace/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>rest_test_queue</code> or <code>rest_test_topic</code> that is already created for us from this step.</p> <p>Trimming the connection details to work with the docker-compose Solace, we have a base Solace connection to define the JNDI destination we will publish to. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>{\nvar solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),   //can define message JMS priority here\nfield().name(\"headers\")                                     //set message properties via headers field\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()).min(2021).max(2023),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n)\n)\n.count(count().records(10));\n}\n</code></pre> <pre><code>val solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").`type`(IntegerType),  //can define message JMS priority here\nfield.name(\"headers\")                           //set message properties via headers field\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\n).count(count.records(10))\n</code></pre>"},{"location":"setup/guide/data-source/solace/#fields","title":"Fields","text":"<p>The schema defined for Solace has a format that needs to be followed as noted above. Specifically, the required fields are:</p> <ul> <li>value</li> </ul> <p>Whilst, the other fields are optional:</p> <ul> <li>partition - refers to JMS priority of the message</li> <li>headers - refers to JMS message properties</li> </ul>"},{"location":"setup/guide/data-source/solace/#headers","title":"headers","text":"<p><code>headers</code> follows a particular pattern that where it is of type <code>HeaderType.getType</code> which behind the scenes, translates to<code>array&lt;struct&lt;key: string,value: binary&gt;&gt;</code>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the<code>value</code> part, it refers to <code>content.account_id</code> where  <code>content</code> is another field defined at the top level of the schema. This allows you to reference other values that have  already been generated.</p> JavaScala <pre><code>field().name(\"headers\")\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n</code></pre> <pre><code>field.name(\"headers\")\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#transactions","title":"transactions","text":"<p><code>transactions</code> is an array that contains an inner structure of <code>txn_date</code> and <code>amount</code>. The size of the array generated can be controlled via <code>arrayMinLength</code> and <code>arrayMaxLength</code>.</p> JavaScala <pre><code>field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n</code></pre> <pre><code>field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#details","title":"details","text":"<p><code>details</code> is another example of a nested schema structure where it also has a nested structure itself in <code>updated_by</code>. One thing to note here is the <code>first_txn_date</code> field has a reference to the <code>content.transactions</code> array where it will sort the array by <code>txn_date</code> and get the first element.</p> JavaScala <pre><code>field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n</code></pre> <pre><code>field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n</code></pre>"},{"location":"setup/guide/data-source/solace/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>kafkaTask</code>, we have to call <code>execute</code>.</p>"},{"location":"setup/guide/data-source/solace/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class AdvancedSolaceJavaPlanRun or AdvancedSolacePlanRun\n#after completing, check http://localhost:8080 from browser\n</code></pre> <p>Your output should look like this.</p> <p></p> <p>Unfortunately, there is no easy way to see the message content. You can check the message content from your application or service that consumes these messages.</p> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed. Or view the sample report found here.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/","title":"Auto Generate From Data Connection","text":"<p>Info</p> <p>Auto data generation from data connection is a paid feature. Try the free trial here.</p> <p>Creating a data generator based on only a data connection to Postgres.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/auto-generate-connection/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/auto-generate-connection/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedAutomatedJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedAutomatedPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedAutomatedJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedAutomatedPlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n</code></pre> <p>In the above code, we note the following:</p> <ol> <li>Data source configuration to a Postgres data source called <code>my_postgres</code></li> <li>We have enabled the flag <code>enableGeneratePlanAndTasks</code> which tells Data Caterer to go to <code>my_postgres</code> and generate    data for all the tables found under the database <code>customer</code> (which is defined in the connection string).</li> <li>The config <code>generatedPlanAndTaskFolderPath</code> defines where the metadata that is gathered from <code>my_postgres</code> should be    saved at so that we could re-use it later.</li> <li><code>enableUniqueCheck</code> is set to true to ensure that generated data is unique based on primary key or foreign key    definitions.</li> </ol> <p>Note</p> <p>Unique check will only ensure generated data is unique. Any existing data in your data source is not taken into  account, so generated data may fail to insert depending on the data source restrictions</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#postgres-setup","title":"Postgres Setup","text":"<p>If you don't have your own Postgres up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n</code></pre> <p>This will create the tables found under <code>docker/data/sql/postgres/customer.sql</code>. You can change this file to contain your own tables. We can see there are 4 tables created for us, <code>accounts, balances, transactions and mapping</code>.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedAutomatedJavaPlanRun or MyAdvancedAutomatedPlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1;'\n</code></pre> <p>It should look something like this.</p> <pre><code>   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n</code></pre> <p>The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.</p> <p>Also check the HTML report that gets generated under <code>docker/sample/report/index.html</code>. You can see a summary of what was generated along with other metadata.</p> <p>You can now look to play around with other tables or data sources and auto generate for them.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#learn-from-existing-data","title":"Learn From Existing Data","text":"<p>If you have any existing data within your data source, Data Caterer will gather metadata about the existing data to help guide it when generating new data. There are configurations that can help tune the metadata analysis found here.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#filter-out-schematables","title":"Filter Out Schema/Tables","text":"<p>As part of your connection definition, you can define any schemas and/or tables your don't want to generate data for. In the example below, it will not generate any data for any tables under the <code>history</code> and <code>audit</code> schemas. Also, any table with the name <code>balances</code> or <code>transactions</code> in any schema will also not have data generated.</p> JavaScala <pre><code>var autoRun = configuration()\n.postgres(\n\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap.of(\n\"filterOutSchema\", \"history, audit\",\n\"filterOutTable\", \"balances, transactions\")\n)\n)\n</code></pre> <pre><code>val autoRun = configuration\n.postgres(\n\"my_postgres\",\n\"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap(\n\"filterOutSchema\" -&gt; \"history, audit\",\n\"filterOutTable\" -&gt; \"balances, transactions\")\n)\n)\n</code></pre>"},{"location":"setup/guide/scenario/auto-generate-connection/#define-record-count","title":"Define record count","text":"<p>You can control the record count per sub data source via <code>numRecordsPerStep</code>.</p> JavaScala <pre><code>var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n</code></pre> <pre><code>val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/","title":"Generate Batch and Event Data","text":"<p>Info</p> <p>Generating event data is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Kafka topic with matching records in a CSV file.</p>"},{"location":"setup/guide/scenario/batch-and-event/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/batch-and-event/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/#kafka-setup","title":"Kafka Setup","text":"<p>If you don't have your own Kafka up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d kafka\ndocker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n</code></pre> <p>Let's create a task for inserting data into the <code>account-topic</code> that is already defined under<code>docker/data/kafka/setup_kafka.sh</code>.</p>"},{"location":"setup/guide/scenario/batch-and-event/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedBatchEventJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedBatchEventPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedBatchEventJavaPlanRun extends PlanRun {\n{\nvar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedBatchEventPlanRun extends PlanRun {\nval kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n}\n</code></pre> <p>We will borrow the Kafka task that is already defined under the class <code>AdvancedKafkaPlanRun</code> or <code>AdvancedKafkaJavaPlanRun</code>. You can go through the Kafka guide here for more details.</p>"},{"location":"setup/guide/scenario/batch-and-event/#schema","title":"Schema","text":"<p>Let us set up the corresponding schema for the CSV file where we want to match the values that are generated for the Kafka messages.</p> JavaScala <pre><code>var kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n\nvar csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield().name(\"account_number\"),\nfield().name(\"year\"),\nfield().name(\"name\"),\nfield().name(\"payload\")\n);\n</code></pre> <pre><code>val kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n\nval csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield.name(\"account_number\"),\nfield.name(\"year\"),\nfield.name(\"name\"),\nfield.name(\"payload\")\n)\n</code></pre> <p>This is a simple schema where we want to use the values and metadata that is already defined in the <code>kafkaTask</code> to determine what the data will look like for the CSV file. Even if we defined some metadata here, it would be overridden when we define our foreign key relationships.</p>"},{"location":"setup/guide/scenario/batch-and-event/#foreign-keys","title":"Foreign Keys","text":"<p>From the above CSV schema, we see note the following against the Kafka schema:</p> <ul> <li><code>account_number</code> in CSV needs to match with the <code>account_id</code> in Kafka<ul> <li>We see that <code>account_id</code> is referred to in the <code>key</code> column as <code>field.name(\"key\").sql(\"content.account_id\")</code></li> </ul> </li> <li><code>year</code> needs to match with <code>content.year</code> in Kafka, which is a nested field<ul> <li>We can only do foreign key relationships with top level fields, not nested fields. So we define a new column   called <code>tmp_year</code> which will not appear in the final output for the Kafka messages but is used as an intermediate   step <code>field.name(\"tmp_year\").sql(\"content.year\").omit(true)</code></li> </ul> </li> <li><code>name</code> needs to match with <code>content.details.name</code> in Kafka, also a nested field<ul> <li>Using the same logic as above, we define a temporary column called <code>tmp_name</code> which will take the value of the   nested field but will be omitted <code>field.name(\"tmp_name\").sql(\"content.details.name\").omit(true)</code></li> </ul> </li> <li><code>payload</code> represents the whole JSON message sent to Kafka, which matches to <code>value</code> column</li> </ul> <p>Our foreign keys are therefore defined like below. Order is important when defining the list of columns. The index needs to match with the corresponding column in the other data source.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\nkafkaTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(myPlan, conf, kafkaTask, csvTask);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\nkafkaTask, List(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList(csvTask -&gt; List(\"account_number\", \"year\", \"name\", \"payload\"))\n)\n\nval conf = configuration.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(myPlan, conf, kafkaTask, csvTask)\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedBatchEventJavaPlanRun or MyAdvancedBatchEventPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n</code></pre> <p>It should look something like this.</p> <pre><code>{\"account_id\":\"ACC03093143\",\"year\":2023,\"amount\":87990.37196728592,\"details\":{\"name\":\"Nadine Heidenreich Jr.\",\"first_txn_date\":\"2021-11-09\",\"updated_by\":{\"user\":\"YfEyJCe8ohrl0j IfyT\",\"time\":\"2022-09-26T20:47:53.404Z\"}},\"transactions\":[{\"txn_date\":\"2021-11-09\",\"amount\":97073.7914706189}]}\n{\"account_id\":\"ACC08764544\",\"year\":2021,\"amount\":28675.58758765888,\"details\":{\"name\":\"Delila Beer\",\"first_txn_date\":\"2021-05-19\",\"updated_by\":{\"user\":\"IzB5ksXu\",\"time\":\"2023-01-26T20:47:26.389Z\"}},\"transactions\":[{\"txn_date\":\"2021-10-01\",\"amount\":80995.23818711648},{\"txn_date\":\"2021-05-19\",\"amount\":92572.40049217848},{\"txn_date\":\"2021-12-11\",\"amount\":99398.79832225188}]}\n{\"account_id\":\"ACC62505420\",\"year\":2023,\"amount\":96125.3125884202,\"details\":{\"name\":\"Shawn Goodwin\",\"updated_by\":{\"user\":\"F3dqIvYp2pFtena4\",\"time\":\"2023-02-11T04:38:29.832Z\"}},\"transactions\":[]}\n</code></pre> <p>Let's also check if there is a corresponding record in the CSV file.</p> <pre><code>$ cat docker/sample/csv/account/part-0000* | grep ACC03093143\nACC03093143,2023,Nadine Heidenreich Jr.,\"{\\\"account_id\\\":\\\"ACC03093143\\\",\\\"year\\\":2023,\\\"amount\\\":87990.37196728592,\\\"details\\\":{\\\"name\\\":\\\"Nadine Heidenreich Jr.\\\",\\\"first_txn_date\\\":\\\"2021-11-09\\\",\\\"updated_by\\\":{\\\"user\\\":\\\"YfEyJCe8ohrl0j IfyT\\\",\\\"time\\\":\\\"2022-09-26T20:47:53.404Z\\\"}},\\\"transactions\\\":[{\\\"txn_date\\\":\\\"2021-11-09\\\",\\\"amount\\\":97073.7914706189}]}\"\n</code></pre> <p>Great! The account, year, name and payload look to all match up.</p>"},{"location":"setup/guide/scenario/batch-and-event/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/batch-and-event/#order-of-execution","title":"Order of execution","text":"<p>You may notice that the events are generated first, then the CSV file. This is because as part of the <code>execute</code> function, we passed in the <code>kafkaTask</code> first, before the <code>csvTask</code>. You can change the order of execution by passing in <code>csvTask</code> before <code>kafkaTask</code> into the <code>execute</code> function.</p>"},{"location":"setup/guide/scenario/data-validation/","title":"Data Validations","text":"<p>Creating a data validator for a JSON file.</p> <p></p>"},{"location":"setup/guide/scenario/data-validation/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/data-validation/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#data-setup","title":"Data Setup","text":"<p>To aid in showing the functionality of data validations, we will first generate some data that our validations will run against. Run the below command and it will generate JSON files under <code>docker/sample/json</code> folder.</p> <pre><code>./run.sh JsonPlan\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyValidationJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyValidationPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyValidationJavaPlan extends PlanRun {\n{\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\");\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyValidationPlan extends PlanRun {\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n}\n</code></pre> <p>As noted above, we create a JSON task that points to where the JSON data has been created at folder <code>/opt/app/data/json</code> . We also note that <code>enableValidation</code> is set to <code>true</code> and <code>enableGenerateData</code> to <code>false</code> to tell Data Catering, we only want to validate data.</p>"},{"location":"setup/guide/scenario/data-validation/#validations","title":"Validations","text":"<p>For reference, the schema in which we will be validating against looks like the below.</p> <pre><code>.schema(\nfield.name(\"account_id\"),\n  field.name(\"year\").`type`(IntegerType),\n  field.name(\"balance\").`type`(DoubleType),\n  field.name(\"date\").`type`(DateType),\n  field.name(\"status\"),\n  field.name(\"update_history\").`type`(ArrayType)\n.schema(\nfield.name(\"updated_time\").`type`(TimestampType),\n      field.name(\"status\").oneOf(\"open\", \"closed\", \"pending\", \"suspended\"),\n    ),\n  field.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\n      field.name(\"age\").`type`(IntegerType),\n      field.name(\"city\").expression(\"#{Address.city}\")\n)\n)\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#basic-validation","title":"Basic Validation","text":"<p>Let's say our goal is to validate the <code>customer_details.name</code> field to ensure it conforms to the regex pattern <code>[A-Z][a-z]+ [A-Z][a-z]+</code>. Given the diversity in naming conventions across cultures and countries, variations such as middle names, suffixes, prefixes, or language-specific differences are tolerated to a certain extent. The validation considers an acceptable error threshold before marking it as failed.</p>"},{"location":"setup/guide/scenario/data-validation/#validation-criteria","title":"Validation Criteria","text":"<ul> <li>Field to Validate: <code>customer_details.name</code></li> <li>Regex Pattern: <code>[A-Z][a-z]+ [A-Z][a-z]+</code></li> <li>Error Tolerance: If more than 10% do not match the regex, then fail.</li> </ul>"},{"location":"setup/guide/scenario/data-validation/#considerations","title":"Considerations","text":"<ul> <li>Customisation<ul> <li>Adjust the regex pattern and error threshold based on your specific data schema and validation requirements.</li> <li>For the full list of types of basic validations that can be   used, check this page.</li> </ul> </li> <li>Understanding Tolerance<ul> <li>Be mindful of the error threshold, as it directly influences what percentage of deviations from the pattern is   acceptable.</li> </ul> </li> </ul> JavaScala <pre><code>validation().col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1)                                      //&lt;=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"),  //description to add context in report or other developers\n</code></pre> <pre><code>validation.col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1)                                      //&lt;=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"),  //description to add context in report or other developers\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#custom-validation","title":"Custom Validation","text":"<p>There will be situation where you have a complex data setup and require you own custom logic to use for data validation. You can achieve this via setting your own SQL expression that returns a boolean value. An example is seen below where we want to check the array <code>update_history</code>, that each entry has <code>updated_time</code> greater than a certain timestamp.</p> JavaScala <pre><code>validation().expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\n</code></pre> <pre><code>validation.expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\n</code></pre> <p>If you want to know what other SQL function are available for you to use, check this page.</p>"},{"location":"setup/guide/scenario/data-validation/#group-by-validation","title":"Group By Validation","text":"<p>There are scenarios where you want to validate against grouped values or the whole dataset via aggregations. An example would be validating that each customer's transactions sum is greater than 0.</p>"},{"location":"setup/guide/scenario/data-validation/#validation-criteria_1","title":"Validation Criteria","text":"<p>Line 1: <code>validation.groupBy().count().isEqual(100)</code></p> <ul> <li>Method Chaining<ul> <li><code>groupBy()</code>: Group by whole dataset.</li> <li><code>count()</code>: Counts the number of dataset elements.</li> <li><code>isEqual(100)</code>: Checks if the count is equal to 100.</li> </ul> </li> <li>Validation Rule<ul> <li>This line ensures that the count of the total dataset is exactly 100.</li> </ul> </li> </ul> <p>Line 2: <code>validation.groupBy(\"account_id\").max(\"balance\").lessThan(900)</code></p> <ul> <li>Method Chaining<ul> <li><code>groupBy(\"account_id\")</code>: Groups the data based on the <code>account_id</code> field.</li> <li><code>max(\"balance\")</code>: Calculates the maximum value of the <code>balance</code> field within each group.</li> <li><code>lessThan(900)</code>: Checks if the maximum balance in each group is less than 900.</li> </ul> </li> <li>Validation Rule<ul> <li>This line ensures that, for each group identified by <code>account_id</code> the maximum balance is less than 900.</li> </ul> </li> </ul>"},{"location":"setup/guide/scenario/data-validation/#considerations_1","title":"Considerations","text":"<ul> <li>Adjust the <code>errorThreshold</code> or validation to your specification scenario. The full list   of types of validations can be found here.</li> <li>For the full list of types of group by validations that can be   used, check this page.</li> </ul> JavaScala <pre><code>validation().groupBy().count().isEqual(100),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n</code></pre> <pre><code>validation.groupBy().count().isEqual(100),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#sample-validation","title":"Sample Validation","text":"<p>To try cover the majority of validation cases, the below has been created.</p> JavaScala <pre><code>var jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation().col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation().col(\"date\").isNotNull().errorThreshold(10),\nvalidation().col(\"balance\").greaterThan(500),\nvalidation().expr(\"YEAR(date) == year\"),\nvalidation().col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation().col(\"customer_details.age\").greaterThan(18),\nvalidation().expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation().col(\"update_history\").greaterThanSize(2),\nvalidation().unique(\"account_id\"),\nvalidation().groupBy().count().isEqual(1000),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n</code></pre> <pre><code>val jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation.col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation.col(\"date\").isNotNull.errorThreshold(10),\nvalidation.col(\"balance\").greaterThan(500),\nvalidation.expr(\"YEAR(date) == year\"),\nvalidation.col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation.col(\"customer_details.age\").greaterThan(18),\nvalidation.expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation.col(\"update_history\").greaterThanSize(2),\nvalidation.unique(\"account_id\"),\nvalidation.groupBy().count().isEqual(1000),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>./run.sh\n#input class MyValidationJavaPlan or MyValidationPlan\n#after completing, check report at docker/sample/report/index.html\n</code></pre> <p>It should look something like this.</p> <p>Check the full example at <code>ValidationPlanRun</code> inside the examples repo.</p>"},{"location":"setup/guide/scenario/delete-generated-data/","title":"Delete Generated Data","text":"<p>Info</p> <p>Delete generated data is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Postgres and delete the generated data after using it.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/delete-generated-data/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/delete-generated-data/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedDeleteJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedDeletePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedDeleteJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedDeletePlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n</code></pre> <p>In the above code we note the following:</p> <ol> <li>We have defined a Postgres connection called <code>my_postgres</code></li> <li><code>enableGeneratePlanAndTasks</code> is enabled to auto generate data for all tables under <code>customer</code> database</li> <li><code>enableRecordTracking</code> is enabled to ensure that all generated records are tracked. This will get used when we want    to delete data afterwards</li> <li><code>enableDeleteGeneratedRecords</code> is disabled for now. We want to see the generated data first and delete sometime after</li> <li><code>generatedPlanAndTaskFolderPath</code> is the folder path where we saved the metadata we have gathered from <code>my_postgres</code></li> <li><code>recordTrackingFolderPath</code> is the folder path where record tracking is maintained. We need to persist this data to    ensure it is still available when we want to delete data</li> </ol>"},{"location":"setup/guide/scenario/delete-generated-data/#postgres-setup","title":"Postgres Setup","text":"<p>If you don't have your own Postgres up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n</code></pre> <p>This will create the tables found under <code>docker/data/sql/postgres/customer.sql</code>. You can change this file to contain your own tables. We can see there are 4 tables created for us, <code>accounts, balances, transactions and mapping</code>.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\n</code></pre> <p>It should look something like this.</p> <pre><code>   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n</code></pre> <p>The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.</p> <p>Check the number of records via:</p> <pre><code>docker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n#open report under docker/sample/report/index.html\n</code></pre>"},{"location":"setup/guide/scenario/delete-generated-data/#delete","title":"Delete","text":"<p>We are now at a stage where we want to delete the data that was generated. All we need to do is flip two flags.</p> <pre><code>.enableDeleteGeneratedRecords(true)\n.enableGenerateData(false)  //we need to explicitly disable generating data\n</code></pre> <p>Enable delete generated records and disable generating data. </p> <p>Before we run again, let us insert a record manually to see if that data will survive after running the job to delete the generated data.</p> <pre><code>docker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"insert into account.accounts (account_number) values ('my_account_number')\"\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"select count(1) from account.accounts\"\n</code></pre> <p>We now should have 1001 records in our <code>account.accounts</code> table. Let's delete the generated data now.</p> <pre><code>./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n</code></pre> <p>You should see that only 1 record is left, the one that we manually inserted. Great, now we can generate data reliably  and also be able to clean it up.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/delete-generated-data/#one-class-for-generating-another-for-deleting","title":"One class for generating, another for deleting?","text":"<p>Yes, this is possible. There are two requirements: - the connection names used need to be the same across both classes - <code>recordTrackingFolderPath</code> needs to be set to the same value</p>"},{"location":"setup/guide/scenario/delete-generated-data/#define-record-count","title":"Define record count","text":"<p>You can control the record count per sub data source via <code>numRecordsPerStep</code>.</p> JavaScala <pre><code>var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n</code></pre> <pre><code>val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/","title":"First Data Generation","text":"<p>Creating a data generator for a CSV file.</p> <p></p>"},{"location":"setup/guide/scenario/first-data-generation/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/first-data-generation/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyCsvPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyCsvPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyCsvJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyCsvPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/scenario/first-data-generation/#connection-configuration","title":"Connection Configuration","text":"<p>When dealing with CSV files, we need to define a path for our generated CSV files to be saved at, along with any other high level configurations.</p> JavaScala <pre><code>csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap.of(\"header\", \"true\")          //optional additional options\n)\n</code></pre> <p>Other additional options for CSV can be found here</p> <pre><code>csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap(\"header\" -&gt; \"true\")           //optional additional options\n)\n</code></pre> <p>Other additional options for CSV can be found here</p>"},{"location":"setup/guide/scenario/first-data-generation/#schema","title":"Schema","text":"<p>Our CSV file that we generate should adhere to a defined schema where we can also define data types.</p> <p>Let's define each field along with their corresponding data type. You will notice that the <code>string</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"balance\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"balance\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#field-metadata","title":"Field Metadata","text":"<p>We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata attributes that add guidelines that the data generator will understand when generating data.</p>"},{"location":"setup/guide/scenario/first-data-generation/#account_id","title":"account_id","text":"<ul> <li><code>account_id</code> follows a particular pattern that where it starts with <code>ACC</code> and has 8 digits after it.   This can be defined via a regex like below. Alongside, we also mention that values are unique ensure that   unique values are generated.</li> </ul> JavaScala <pre><code>field().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n</code></pre> <pre><code>field.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#balance","title":"balance","text":"<ul> <li><code>balance</code> let's make the numbers not too large, so we can define a min and max for the generated numbers to be between   <code>1</code> and <code>1000</code>.</li> </ul> JavaScala <pre><code>field().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\n</code></pre> <pre><code>field.name(\"balance\").`type`(DoubleType).min(1).max(1000),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#name","title":"name","text":"<ul> <li><code>name</code> is a string that also follows a certain pattern, so we could also define a regex but here we will choose to   leverage the DataFaker library and create an <code>expression</code> to generate real looking name. All possible faker   expressions   can be found here</li> </ul> JavaScala <pre><code>field().name(\"name\").expression(\"#{Name.name}\"),\n</code></pre> <pre><code>field.name(\"name\").expression(\"#{Name.name}\"),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#open_time","title":"open_time","text":"<ul> <li><code>open_time</code> is a timestamp that we want to have a value greater than a specific date. We can define a min date by   using   <code>java.sql.Date</code> like below.</li> </ul> JavaScala <pre><code>field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre> <pre><code>field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#status","title":"status","text":"<ul> <li><code>status</code> is a field that can only obtain one of four values, <code>open, closed, suspended or pending</code>.</li> </ul> JavaScala <pre><code>field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre> <pre><code>field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#created_by","title":"created_by","text":"<ul> <li><code>created_by</code> is a field that is based on the <code>status</code> field where it follows the   logic: <code>if status is open or closed, then   it is created_by eod else created_by event</code>. This can be achieved by defining a SQL expression like below.</li> </ul> JavaScala <pre><code>field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <pre><code>field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <p>Putting it all the fields together, our class should now look like this.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#record-count","title":"Record Count","text":"<p>We only want to generate 100 records, so that we can see what the output looks like. This is controlled at the <code>accountTask</code> level like below. If you want to generate more records, set it to the value you want.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().records(100));\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.records(100))\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>accountTask</code>, we have to call <code>execute</code> . So our full plan run will look like this.</p> JavaScala <pre><code>public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n</code></pre> <pre><code>class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing\nhead docker/sample/customer/account/part-00000*\n</code></pre> <p>Your output should look like this.</p> <pre><code>account_id,balance,created_by,name,open_time,status\nACC06192462,853.9843359645766,eod,Hoyt Kertzmann MD,2023-07-22T11:17:01.713Z,closed\nACC15350419,632.5969895326234,eod,Dr. Claude White,2022-12-13T21:57:56.840Z,open\nACC25134369,592.0958847218986,eod,Fabian Rolfson,2023-04-26T04:54:41.068Z,open\nACC48021786,656.6413439322964,eod,Dewayne Stroman,2023-05-17T06:31:27.603Z,open\nACC26705211,447.2850352884595,event,Garrett Funk,2023-07-14T03:50:22.746Z,pending\nACC03150585,750.4568929015996,event,Natisha Reichel,2023-04-11T11:13:10.080Z,suspended\nACC29834210,686.4257811608622,event,Gisele Ondricka,2022-11-15T22:09:41.172Z,suspended\nACC39373863,583.5110618128994,event,Thaddeus Ortiz,2022-09-30T06:33:57.193Z,suspended\nACC39405798,989.2623959059525,eod,Shelby Reinger,2022-10-23T17:29:17.564Z,open\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/scenario/first-data-generation/#join-with-another-csv","title":"Join With Another CSV","text":"<p>Now that we have generated some accounts, let's also try to generate a set of transactions for those accounts in CSV format as well. The transactions could be in any other format, but to keep this simple, we will continue using CSV.</p> <p>We can define our schema the same way along with any additional metadata.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#records-per-column","title":"Records Per Column","text":"<p>Usually, for a given <code>account_id, full_name</code>, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the <code>count</code> function.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#random-records-per-column","title":"Random Records Per Column","text":"<p>Above, you will notice that we are generating 5 records per <code>account_id, full_name</code>. This is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n</code></pre> <p>Here we set the minimum number of records per column to be 0 and the maximum to 5.</p>"},{"location":"setup/guide/scenario/first-data-generation/#foreign-key","title":"Foreign Key","text":"<p>In this scenario, we want to match the <code>account_id</code> in <code>account</code> to match the same column values in <code>transaction</code>. We also want to match <code>name</code> in <code>account</code> to <code>full_name</code> in <code>transaction</code>. This can be done via plan configuration like below.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"), //the task and columns we want linked\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\"))) //list of other tasks and their respective column names we want matched\n);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),  //the task and columns we want linked\nList(transactionTask -&gt; List(\"account_id\", \"full_name\"))  //list of other tasks and their respective column names we want matched\n)\n</code></pre> <p>Now, stitching it all together for the <code>execute</code> function, our final plan should look like this.</p> JavaScala <pre><code>public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count().records(100));\n\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nvar myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\")))\n);\n\nexecute(myPlan, config, accountTask, transactionTask);\n}\n}\n</code></pre> <pre><code>class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count.records(100))\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nval myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),\nList(transactionTask -&gt; List(\"account_id\", \"full_name\"))\n)\n\nexecute(myPlan, config, accountTask, transactionTask)\n}\n</code></pre> <p>Let's try run again.</p> <pre><code>#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing, let's pick an account and check the transactions for that account\naccount=$(tail -1 docker/sample/customer/account/part-00000* | awk -F \",\" '{print $1 \",\" $4}')\necho $account\ncat docker/sample/customer/transaction/part-00000* | grep $account\n</code></pre> <p>It should look something like this.</p> <pre><code>ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n</code></pre> <p>Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the <code>DocumentationJavaPlanRun.java</code> or <code>DocumentationPlanRun.scala</code> files as well to check that your plan is the same.</p> <p>We can now look to consume this CSV data from a job or service. Usually, once we have consumed the data, we would also want to check and validate that our consumer has correctly ingested the data.</p>"},{"location":"setup/guide/scenario/first-data-generation/#validate","title":"Validate","text":"<p>In this scenario, our consumer will read in the CSV file, do some transformations, and then save the data to Postgres. Let's try to configure data validations for the data that gets pushed into Postgres.</p>"},{"location":"setup/guide/scenario/first-data-generation/#postgres-setup","title":"Postgres setup","text":"<p>First, we define our connection properties for Postgres. You can check out the full options available here.</p> JavaScala <pre><code>var postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\");\n</code></pre> <pre><code>val postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\")\n</code></pre> <p>We can connect and access the data inside the table <code>account.transactions</code>. Now to define our data validations.</p>"},{"location":"setup/guide/scenario/first-data-generation/#validations","title":"Validations","text":"<p>For full information about validation options and configurations, check here. Below, we have an example that should give you a good understanding of what validations are possible.</p> JavaScala <pre><code>var postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation().col(\"account_id\").isNotNull(),\nvalidation().col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation().col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation().expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation().unique(\"account_id\", \"name\"),\nvalidation().groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n);\n</code></pre> <pre><code>val postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation.col(\"account_id\").isNotNull,\nvalidation.col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation.col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation.expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation.unique(\"account_id\", \"name\"),\nvalidation.groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#name_1","title":"name","text":"<p>For all values in the <code>name</code> column, we check if they match the regex <code>[A-Z][a-z]+ [A-Z][a-z]+</code>. As we know in the real world, names do not always follow the same pattern, so we allow for an <code>errorThreshold</code> before marking the validation as failed. Here, we define the <code>errorThreshold</code> to be <code>0.2</code>, which means, if the error percentage is greater than 20%, then fail the validation. We also append on a helpful description so other developers/users can understand the context of the validation.</p>"},{"location":"setup/guide/scenario/first-data-generation/#balance_1","title":"balance","text":"<p>We check that all <code>balance</code> values are greater than or equal to 0. This time, we have a slightly different <code>errorThreshold</code> as it is set to <code>10</code>, which means, if the number of errors is greater than 10, then fail the validation.</p>"},{"location":"setup/guide/scenario/first-data-generation/#expr","title":"expr","text":"<p>Sometimes, we may need to include the values of multiple columns to validate a certain condition. This is where we can use <code>expr</code> to define a SQL expression that returns a boolean. In this scenario, we are checking if the <code>status</code> column has value <code>closed</code>, then the <code>close_date</code> should be not null, otherwise, <code>close_date</code> is null.</p>"},{"location":"setup/guide/scenario/first-data-generation/#unique","title":"unique","text":"<p>We check whether the combination of <code>account_id</code> and <code>name</code> are unique within the dataset. You can define one or more columns for <code>unique</code> validations.</p>"},{"location":"setup/guide/scenario/first-data-generation/#groupby","title":"groupBy","text":"<p>There may be some business rule that states the number of <code>login_retry</code> should be less than 10 for each account. We can check this via a group by validation where we group by the <code>account_id, name</code>, take the maximum value for <code>login_retry</code> per <code>account_id,name</code> combination, then check if it is less than 10.</p> <p>You can now look to play around with other configurations or data sources to meet your needs. Also, make sure to explore the docs further as it can guide you on what can be configured.</p>"},{"location":"setup/guide/scenario/records-per-column/","title":"Multiple Records Per Column","text":"<p>Creating a data generator for a CSV file where there are multiple records per column values.</p>"},{"location":"setup/guide/scenario/records-per-column/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/records-per-column/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/records-per-column/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyMultipleRecordsPerColJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyMultipleRecordsPerColPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyMultipleRecordsPerColJavaPlan extends PlanRun {\n{\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, transactionTask);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyMultipleRecordsPerColPlan extends PlanRun {\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"), field.name(\"full_name\").expression(\"#{Name.name}\"), field.name(\"amount\").`type`(DoubleType.instance).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType.instance).min(java.sql.Date.valueOf(\"2022-01-01\")), field.name(\"date\").`type`(DateType.instance).sql(\"DATE(time)\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(config, transactionTask)\n}\n</code></pre>"},{"location":"setup/guide/scenario/records-per-column/#record-count","title":"Record Count","text":"<p>By default, tasks will generate 1000 records. You can alter this value via the <code>count</code> configuration which can be applied to individual tasks. For example, in Scala, <code>csv(...).count(count.records(100))</code> to generate only 100 records.</p>"},{"location":"setup/guide/scenario/records-per-column/#records-per-column","title":"Records Per Column","text":"<p>In this scenario, for a given <code>account_id, full_name</code>, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the <code>count</code> function.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n</code></pre> <p>This will generate <code>1000 * 5 = 5000</code> records as the default number of records is set (1000) and per <code>account_id, full_name</code> from the initial 1000 records, 5 records will be generated.</p>"},{"location":"setup/guide/scenario/records-per-column/#random-records-per-column","title":"Random Records Per Column","text":"<p>Generating 5 records per column is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n</code></pre> <p>Here we set the minimum number of records per column to be 0 and the maximum to 5. This will follow a uniform distribution so the average number of records per account is 2.5. We could also define other metadata, just like we did with fields, when defining the generator. For example, we could set <code>standardDeviation</code> and <code>mean</code> for the number of records generated per column to follow a normal distribution.</p>"},{"location":"setup/guide/scenario/records-per-column/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyMultipleRecordsPerColJavaPlan or MyMultipleRecordsPerColPlan\n#after completing\nhead docker/sample/customer/transaction/part-00000*\n</code></pre> <p>It should look something like this.</p> <pre><code>ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n</code></pre> <p>You can now look to play around with other count configurations found here.</p>"},{"location":"setup/validation/basic-validation/","title":"Basic Validations","text":"<p>Run validations on a column to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).</p>"},{"location":"setup/validation/basic-validation/#equal","title":"Equal","text":"<p>Ensure all data in column is equal to certain value. Value can be of any data type. Can use <code>isEqualCol</code> to define SQL expression that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isEqual(2021),\nvalidation().col(\"year\").isEqualCol(\"YEAR(date)\"),\n</code></pre> <pre><code>validation.col(\"year\").isEqual(2021),\nvalidation.col(\"year\").isEqualCol(\"YEAR(date)\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year == 2021\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-equal","title":"Not Equal","text":"<p>Ensure all data in column is not equal to certain value. Value can be of any data type. Can use <code>isNotEqualCol</code> to  define SQL expression that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNotEqual(2021),\nvalidation().col(\"year\").isNotEqualCol(\"YEAR(date)\"),\n</code></pre> <pre><code>validation.col(\"year\").isNotEqual(2021)\nvalidation.col(\"year\").isEqualCol(\"YEAR(date)\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year != 2021\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#null","title":"Null","text":"<p>Ensure all data in column is null.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNull()\n</code></pre> <pre><code>validation.col(\"year\").isNull\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNULL(year)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-null","title":"Not Null","text":"<p>Ensure all data in column is not null.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNotNull()\n</code></pre> <pre><code>validation.col(\"year\").isNotNull\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNOTNULL(year)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#contains","title":"Contains","text":"<p>Ensure all data in column is contains certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"name\").contains(\"peter\")\n</code></pre> <pre><code>validation.col(\"name\").contains(\"peter\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"CONTAINS(name, 'peter')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-contains","title":"Not Contains","text":"<p>Ensure all data in column does not contain certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"name\").notContains(\"peter\")\n</code></pre> <pre><code>validation.col(\"name\").notContains(\"peter\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!CONTAINS(name, 'peter')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#unique","title":"Unique","text":"<p>Ensure all data in column is unique.</p> JavaScalaYAML <pre><code>validation().unique(\"account_id\", \"name\")\n</code></pre> <pre><code>validation.unique(\"account_id\", \"name\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- unique: [\"account_id\", \"name\"]\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than","title":"Less Than","text":"<p>Ensure all data in column is less than certain value. Can use <code>lessThanCol</code> to define SQL expression that can reference  other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").lessThan(100),\nvalidation().col(\"amount\").lessThanCol(\"balance + 1\"),\n</code></pre> <pre><code>validation.col(\"amount\").lessThan(100),\nvalidation.col(\"amount\").lessThanCol(\"balance + 1\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"amount &lt; balance + 1\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-or-equal","title":"Less Than Or Equal","text":"<p>Ensure all data in column is less than or equal to certain value. Can use <code>lessThanOrEqualCol</code> to define SQL expression  that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").lessThanOrEqual(100),\nvalidation().col(\"amount\").lessThanOrEqualCol(\"balance + 1\"),\n</code></pre> <pre><code>validation.col(\"amount\").lessThanOrEqual(100),\nvalidation.col(\"amount\").lessThanCol(\"balance + 1\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &lt;= 100\"\n- expr: \"amount &lt;= balance + 1\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than","title":"Greater Than","text":"<p>Ensure all data in column is greater than certain value. Can use <code>greaterThanCol</code> to define SQL expression that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").greaterThan(100),\nvalidation().col(\"amount\").greaterThanCol(\"balance\"),\n</code></pre> <pre><code>validation.col(\"amount\").greaterThan(100),\nvalidation.col(\"amount\").greaterThanCol(\"balance\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &gt; 100\"\n- expr: \"amount &gt; balance\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-or-equal","title":"Greater Than Or Equal","text":"<p>Ensure all data in column is greater than or equal to certain value. Can use <code>greaterThanOrEqualCol</code> to define SQL  expression that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").greaterThanOrEqual(100),\nvalidation().col(\"amount\").greaterThanOrEqualCol(\"balance\"),\n</code></pre> <pre><code>validation.col(\"amount\").greaterThanOrEqual(100),\nvalidation.col(\"amount\").greaterThanOrEqualCol(\"balance\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &gt;= 100\"\n- expr: \"amount &gt;= balance\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#between","title":"Between","text":"<p>Ensure all data in column is between two values. Can use <code>betweenCol</code> to define SQL expression that references other  columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").between(100, 200),\nvalidation().col(\"amount\").betweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n</code></pre> <pre><code>validation.col(\"amount\").between(100, 200),\nvalidation.col(\"amount\").betweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount BETWEEN 100 AND 200\"\n- expr: \"amount BETWEEN balance * 0.9 AND balance * 1.1\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-between","title":"Not Between","text":"<p>Ensure all data in column is not between two values. Can use <code>notBetweenCol</code> to define SQL expression that references  other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").notBetween(100, 200),\nvalidation().col(\"amount\").notBetweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n</code></pre> <pre><code>validation.col(\"amount\").notBetween(100, 200)\nvalidation.col(\"amount\").notBetweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount NOT BETWEEN 100 AND 200\"\n- expr: \"amount NOT BETWEEN balance * 0.9 AND balance * 1.1\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#in","title":"In","text":"<p>Ensure all data in column is in set of defined values.</p> JavaScalaYAML <pre><code>validation().col(\"status\").in(\"open\", \"closed\")\n</code></pre> <pre><code>validation.col(\"status\").in(\"open\", \"closed\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"status IN ('open', 'closed')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#matches","title":"Matches","text":"<p>Ensure all data in column matches certain regex expression.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").matches(\"ACC[0-9]{8}\")\n</code></pre> <pre><code>validation.col(\"account_id\").matches(\"ACC[0-9]{8}\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"REGEXP(account_id, ACC[0-9]{8})\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-matches","title":"Not Matches","text":"<p>Ensure all data in column does not match certain regex expression.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notMatches(\"^acc.*\")\n</code></pre> <pre><code>validation.col(\"account_id\").notMatches(\"^acc.*\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!REGEXP(account_id, '^acc.*')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#starts-with","title":"Starts With","text":"<p>Ensure all data in column starts with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").startsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").startsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"STARTSWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-starts-with","title":"Not Starts With","text":"<p>Ensure all data in column does not start with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notStartsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").notStartsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!STARTSWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#ends-with","title":"Ends With","text":"<p>Ensure all data in column ends with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").endsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").endsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ENDWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-ends-with","title":"Not Ends With","text":"<p>Ensure all data in column does not end with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notEndsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").notEndsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!ENDWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#size","title":"Size","text":"<p>Ensure all data in column has certain size. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").size(5)\n</code></pre> <pre><code>validation.col(\"transactions\").size(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions, 5)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-size","title":"Not Size","text":"<p>Ensure all data in column does not have certain size. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").notSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").notSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) != 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-size","title":"Less Than Size","text":"<p>Ensure all data in column has size less than certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").lessThanSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").lessThanSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &lt; 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-or-equal-size","title":"Less Than Or Equal Size","text":"<p>Ensure all data in column has size less than or equal to certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").lessThanOrEqualSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").lessThanOrEqualSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &lt;= 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-size","title":"Greater Than Size","text":"<p>Ensure all data in column has size greater than certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").greaterThanSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").greaterThanSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &gt; 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-or-equal-size","title":"Greater Than Or Equal Size","text":"<p>Ensure all data in column has size greater than or equal to certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").greaterThanOrEqualSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").greaterThanOrEqualSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &gt;= 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#luhn-check","title":"Luhn Check","text":"<p>Ensure all data in column passes luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).</p> JavaScalaYAML <pre><code>validation().col(\"credit_card\").luhnCheck()\n</code></pre> <pre><code>validation.col(\"credit_card\").luhnCheck\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"LUHN_CHECK(credit_card)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#has-type","title":"Has Type","text":"<p>Ensure all data in column has certain data type.</p> JavaScalaYAML <pre><code>validation().col(\"id\").hasType(\"string\")\n</code></pre> <pre><code>validation.col(\"id\").hasType(\"string\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"TYPEOF(id) == 'string'\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#expression","title":"Expression","text":"<p>Ensure all data in column adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple columns.</p> <p>For example, <code>CASE WHEN status == 'open' THEN balance &gt; 0 ELSE balance == 0 END</code> would check all rows with <code>status</code> open to have <code>balance</code> greater than 0, otherwise, check the <code>balance</code> is 0.</p> JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().expr(\"amount &lt; 100\"),\nvalidation().expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation().expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n);\n\nvar conf = configuration().enableValidation(true);\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validations(\nvalidation.expr(\"amount &lt; 100\"),\nvalidation.expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation.expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)\n\nval conf = configuration.enableValidation(true)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is &gt; 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is &gt; 200, then fail\ndescription: \"Should be lots of Peters\"\n\n#enableValidation inside application.conf\n</code></pre>"},{"location":"setup/validation/group-by-validation/","title":"Group By Validation","text":"<p>If you want to run aggregations based on a particular set of columns or just the whole dataset, you can do so via group by validations. An example would be checking that the sum of <code>amount</code> is less than 1000 per <code>account_id, year</code>. The validations applied can be one of the validations from the basic validation set found here.</p>"},{"location":"setup/validation/group-by-validation/#record-count","title":"Record count","text":"<p>Check the number of records across the whole dataset.</p> JavaScala <pre><code>validation().groupBy().count().lessThan(1000)\n</code></pre> <pre><code>validation.groupBy().count().lessThan(1000)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#record-count-per-group","title":"Record count per group","text":"<p>Check the number of records for each group.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").count().lessThan(10)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").count().lessThan(10)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#sum","title":"Sum","text":"<p>Check the sum of a columns values for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#count","title":"Count","text":"<p>Check the count for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#min","title":"Min","text":"<p>Check the min for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#max","title":"Max","text":"<p>Check the max for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#average","title":"Average","text":"<p>Check the average for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#standard-deviation","title":"Standard deviation","text":"<p>Check the standard deviation for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/","title":"Upstream Data Source Validation","text":"<p>If you want to run data validations based on data generated or data from another data source, you can use the upstream data source validations. An example would be generating a Parquet file that gets ingested by a job and inserted into Postgres. The validations can then check for each <code>account_id</code> generated in the Parquet, it exists in <code>account_number</code> column in Postgres. The validations can be chained with basic and group by validations or even other upstream data sources, to cover any complex validations.</p>"},{"location":"setup/validation/upstream-data-source-validation/#basic-join","title":"Basic join","text":"<p>Join across datasets by particular columns. Then run validations on the joined dataset. You will notice that the data source name is appended onto the column names when joined (i.e. <code>my_first_json_customer_details</code>), to ensure column names do not clash and make it obvious which columns are being validated.</p> <p>In the below example, we check that the for the same <code>account_id</code>, then <code>customer_details.name</code> in the <code>my_first_json</code> dataset should equal to the <code>name</code> column in the <code>my_second_json</code>.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(\nvalidation().col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(\nvalidation.col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/#join-expression","title":"Join expression","text":"<p>Define join expression to link two datasets together. This can be any SQL expression that returns a boolean value.  Useful in situations where join is based on transformations or complex logic.</p> <p>In the below example, we have to use <code>CONCAT</code> SQL function to combine <code>'ACC'</code> and <code>account_number</code> to join with  <code>account_id</code> column in <code>my_first_json</code> dataset.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinExpr(\"my_first_json_account_id == CONCAT('ACC', account_number)\")\n.withValidation(\nvalidation().col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinExpr(\"my_first_json_account_id == CONCAT('ACC', account_number)\")\n.withValidation(\nvalidation.col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/#different-join-type","title":"Different join type","text":"<p>By default, an outer join is used to gather columns from both datasets together for validation. But there may be  scenarios where you want to control the join type.</p> <p>Possible join types include: - inner - outer, full, fullouter, full_outer - leftouter, left, left_outer - rightouter, right, right_outer - leftsemi, left_semi, semi - leftanti, left_anti, anti - cross</p> <p>In the example below, we do an <code>anti</code> join by column <code>account_id</code> and check if there are no records. This essentially  checks that all <code>account_id</code>'s from <code>my_second_json</code> exist in <code>my_first_json</code>. The second validation also does something similar but does an <code>outer</code> join (by default) and checks that the joined dataset has 30 records.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.joinType(\"anti\")\n.withValidation(validation().count().isEqual(0)),\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation().count().isEqual(30))\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.joinType(\"anti\")\n.withValidation(validation.count().isEqual(0)),\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation.count().isEqual(30))\n)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/#join-then-group-by-validation","title":"Join then group by validation","text":"<p>We can apply aggregate or group by validations to the resulting joined dataset as the <code>withValidation</code> method accepts any type of validation.</p> <p>Here we group by <code>account_id, my_first_json_balance</code> to check that when the <code>amount</code> field is summed up per group, it is  between 0.8 and 1.2 times the balance.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"balance\").type(DoubleType.instance()).min(10).max(1000),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation().groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n)\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"balance\").`type`(DoubleType).min(10).max(1000),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation.groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n)\n)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/#chained-validations","title":"Chained validations","text":"<p>Given that the <code>withValidation</code> method accepts any other type of validation, you can chain other upstream data sources with it. Here we will show a third upstream data source being checked to ensure 30 records exists after joining them  together by <code>account_id</code>.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"balance\").type(DoubleType.instance()).min(10).max(1000),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n)\n.count(count().records(10));\n\nvar thirdJsonTask = json(\"my_third_json\", \"/tmp/data/third_json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(IntegerType.instance()).min(1).max(100),\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n.count(count().records(10));\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(\nvalidation().upstreamData(thirdJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation().count().isEqual(30))\n)\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"balance\").`type`(DoubleType).min(10).max(1000),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n.count(count.records(10))\n\nval thirdJsonTask = json(\"my_third_json\", \"/tmp/data/third_json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(IntegerType).min(1).max(100),\nfield.name(\"name\").expression(\"#{Name.name}\"),\n)\n.count(count.records(10))\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation.groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n),\n)\n</code></pre>"},{"location":"use-case/business-value/","title":"Business Value","text":"<p>Below is a list of the business related benefits from using Data Caterer which may be applicable for your use case.</p> Problem Data Caterer Solution Resources Effects Reliable test data creation - Profile existing data- Create scenarios- Generate data Software Engineers, QA, Testers Cost reduction in labor, more time spent on development, more bugs caught before production Faster development cycles - Generate data in local, test, UAT, pre-prod- Run different scenarios Software Engineers, QA, Testers More defects caught in lower environments, features pushed to production faster, common framework used across all environments Data compliance - Profiling existing data- Generate based on metadata- No complex masking- No production data used in lower environments Audit and compliance No chance for production data breaches Storage costs - Delete generated data- Test specific scenarios Infrastructure Lower data storage costs, less time spent on data management and clean up Schema evolution - Create metadata from data sources- Generate data based off fresh metadata Software Engineers, QA, Testers Less time spent altering tests due to schema changes, ease of use between environments and application versions"},{"location":"use-case/comparison/","title":"Comparison to similar tools","text":"<p>I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.</p> <p>The companies/products not shown below either have:</p> <ul> <li>a website with insufficient information about the technology side of data generation/validation</li> <li>no/little documentation</li> <li>don't have a free, no sign-up version of their app to use</li> </ul>"},{"location":"use-case/comparison/#data-generation","title":"Data Generation","text":"Tool Description Cost Pros Cons Clearbox AI Python based data generation tool via ML Unclear  Python SDK UI interface Detect private data Report generation  Batch data only No data clean up Limited/no documentation Curiosity Software Platform solution for test data management Unclear  Extensive documentation Generate data based off test cases UI interface Web/API/UI/mobile testing  No quick start No SDK Many components that may not be required No event generation support DataCebo Synthetic Data Vault Python based data generation tool via ML Unclear  Python SDK Report generation Data quality checks Business logic constraints  No data connection support No data clean up No foreign key support Datafaker Realistic data generation library Free  SDK for many languages Simple, easy to use Extensible Open source Generate realistic values  No data connection support No data clean up No validation No foreign key support DBLDatagen Python based data generation tool Free  Python SDK Open source Good documentation Customisable scenarios Customisable column generation Generate from existing data/schemas Plugin third-party libraries  Limited support if issues Code required No data clean up No data validation Gatling HTTP API load testing tool Free (Open Source)Gatling Enterprise, usage based, starts from \u20ac89 per month, 1 user, 6.25 hours of testing  Kotlin, Java &amp; Scala SDK Widely used Open source Clear documentation Extensive testing/validation support Customisable scenarios Report generation  Only supports HTTP, JMS and JDBC No data clean up Data feeders not based off metadata Gretel Python based data generation tool via ML Usage based, starts from $295 per month, $2.20 per credit, assumed USD  CLI &amp; Python SDK UI interface Training and re-use of models Detect private data Customisable scenarios  Batch data only No relationships between data sources Only simple foreign key relations defined No data clean up Charge by usage Howso Python based data generation tool via ML Unclear  Python SDK Playground to try Open source library Customisable scenarios  No support for data sources No data validation No data clean up Mostly AI Python based data generation tool via ML Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD  Report generation Non-technical users can use UI Customisable scenarios  Charge by usage Batch data only No data clean up Confusing use of 'smart select' for multiple foreign keys Limited custom column generation logic Multiple deployment components No SDK Octopize Python based data generation tool via ML Unclear  Python &amp; R SDK Report generation API for metadata Customisable scenarios  Input data source is only CSV Multiple manual steps before starting Quickstart is not a quickstart Documentation lacks code examples Synthesized Python based data generation tool via ML Unclear  CLI &amp; Python SDK API for metadata IDE setup Data quality checks  Not sure what is SDK &amp; TDK Charge by usage No report of what was generated No relationships between data sources Tonic Platform solution for generating data Unclear  UI interface Good documentation Detect private data Support for encrypted columns Report generation Alerting  Batch data only Multiple deployment components No relationships between data sources No data validation No data clean up No SDK (only API) Difficult to embed complex business logic YData Python based data generation tool via ML. Platform solution as well Unclear  Python SDK Open source Detect private data Compare datasets Report generation  No data connection support Batch data only No data clean up Separate data generation and data validation No foreign key support"},{"location":"use-case/comparison/#use-of-ml-models","title":"Use of ML models","text":"<p>You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.</p> <p>Pros</p> <ul> <li>Simple setup</li> <li>Ability to reproduce complex logic</li> <li>Flexible to accept all types of data</li> </ul> <p>Cons</p> <ul> <li>Long time for model learning</li> <li>Black box of logic</li> <li>Maintain, store and update of ML models</li> <li>Restriction on input data lengths</li> <li>May not maintain referential integrity</li> <li>Require deeper understanding of ML models for fine-tuning</li> <li>Accuracy may be worse than non-ML models</li> </ul>"},{"location":"use-case/roadmap/","title":"Roadmap","text":"<p>Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.</p> Feature Description Sub Tasks Data source support Batch or real time data sources that can be added to Data Caterer. Support data sources that users want - AWS, GCP and Azure related data services ( cloud storage)- Deltalake- RabbitMQ- ActiveMQ- MongoDB- Elasticsearch- Snowflake- Databricks- Pulsar Metadata discovery Allow for schema and data profiling from external metadata sources -  HTTP (OpenAPI spec)- JMS- Read from samples-  OpenLineage metadata (Marquez)-  OpenMetadata- ODCS (Open Data Contract Standard)- Amundsen- Datahub- Solace Event Portal- Airflow- DBT Developer API Scala/Java interface for developers/testers to create data generation and validation tasks -  Scala-  Java Report generation Generate a report that summarises the data generation or validation results -  Report for data generated and validation rules UI portal Allow users to access a UI to input data generation or validation tasks. Also be able to view report results - Metadata stored in database- Store data generation/validation run information in file/database Integration with data validation tools Derive data validation rules from existing data validation tools - Great Expectation- DBT constraints- SodaCL- MonteCarlo Data validation rule suggestions Based on metadata, generate data validation rules appropriate for the dataset -  Suggest basic data validations (yet to document) Wait conditions before data validation Define certain conditions to be met before starting data validations -  Webhook-  File exists-  Data exists via SQL expression-  Pause Validation types Ability to define simple/complex data validations -  Basic validations-  Aggregates (sum of amount per account is &gt; 500)- Ordering (transactions are ordered by date)-  Relationship (at least one account entry in history table per account in accounts table)- Data profile (how close the generated data profile is compared to the expected data profile) Data generation record count Generate scenarios where there are one to many, many to many situations relating to record count. Also ability to cover all edge cases or scenarios - Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values etc.)- Ability to override edge cases Alerting When tasks have completed, ability to define alerts based on certain conditions - Slack- Email Metadata enhancements Based on data profiling or inference, can add to existing metadata - PII detection (can integrate with Presidio)- Relationship detection across data sources- SQL generation- Ordering information Data cleanup Ability to clean up generated data -  Clean up generated data- Clean up data in consumer data sinks- Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS) Trial version Trial version of the full app for users to test out all the features -  Trial app to try out all features Code generation Based on metadata or existing classes, code for data generation and validation could be generated - Code generation- Schema generation from Scala/Java class Real time response data validations Ability to define data validations based on the response from real time data sources (e.g. HTTP response) - HTTP response data validation"},{"location":"use-case/blog/shift-left-data-quality/","title":"Shifting Data Quality Left with Data Catering","text":""},{"location":"use-case/blog/shift-left-data-quality/#empowering-proactive-data-management","title":"Empowering Proactive Data Management","text":"<p>In the ever-evolving landscape of data-driven decision-making, ensuring data quality is non-negotiable. Traditionally, data quality has been a concern addressed late in the development lifecycle, often leading to reactive measures and increased costs. However, a paradigm shift is underway with the adoption of a \"shift left\" approach, placing data quality at the forefront of the development process.</p>"},{"location":"use-case/blog/shift-left-data-quality/#today","title":"Today","text":"<pre><code>graph LR\n  subgraph badQualityData[&lt;b&gt;Manually generated data, data quality always passes&lt;/b&gt;]\n  local[&lt;b&gt;Local&lt;/b&gt;\\nManual test, unit test]\n  dev[&lt;b&gt;Dev&lt;/b&gt;\\nManual test, integration test]\n  stg[&lt;b&gt;Staging&lt;/b&gt;\\nSanity checks]\n  end\n\n  subgraph qualityData[&lt;b&gt;Reliable data, the true test&lt;/b&gt;]\n  prod[&lt;b&gt;Production&lt;/b&gt;\\nData quality checks, monitoring, observaibility]\n  end\n\n  style badQualityData fill:#d9534f,fill-opacity:0.7\n  style qualityData fill:#5cb85c,fill-opacity:0.7\n\n  local --&gt; dev\n  dev --&gt; stg\n  stg --&gt; prod</code></pre>"},{"location":"use-case/blog/shift-left-data-quality/#with-data-caterer","title":"With Data Caterer","text":"<pre><code>graph LR\n  subgraph qualityData[&lt;b&gt;Reliable data for testing anywhere&lt;/b&gt;]\n  direction LR\n  local[&lt;b&gt;Local&lt;/b&gt;\\nManual test, unit test]\n  dev[&lt;b&gt;Dev&lt;/b&gt;\\nManual test, integration test]\n  stg[&lt;b&gt;Staging&lt;/b&gt;\\nSanity checks]\n  prod[&lt;b&gt;Production&lt;/b&gt;\\nData quality checks, monitoring, observaibility]\n  end\n\n  style qualityData fill:#5cb85c,fill-opacity:0.7\n\n  local --&gt; dev\n  dev --&gt; stg\n  stg --&gt; prod</code></pre>"},{"location":"use-case/blog/shift-left-data-quality/#understanding-the-shift-left-approach","title":"Understanding the Shift Left Approach","text":"<p>\"Shift left\" is a philosophy that advocates for addressing tasks and concerns earlier in the development lifecycle. Applied to data quality, it means tackling data issues as early as possible, ideally during the development and testing phases. This approach aims to catch data anomalies, inaccuracies, or inconsistencies before they propagate through the system, reducing the likelihood of downstream errors.</p>"},{"location":"use-case/blog/shift-left-data-quality/#data-caterer-the-catalyst-for-shifting-left","title":"Data Caterer: The Catalyst for Shifting Left","text":"<p>Enter Data Caterer, a metadata-driven data generation and validation tool designed to empower organizations in shifting data quality left. By incorporating Data Caterer into the early stages of development, teams can proactively test complex data flows, validate data sources, and ensure data quality before it reaches downstream processes.</p>"},{"location":"use-case/blog/shift-left-data-quality/#key-advantages-of-shifting-data-quality-left-with-data-caterer","title":"Key Advantages of Shifting Data Quality Left with Data Caterer","text":"<ol> <li>Early Issue Detection:<ul> <li>Identify data quality issues early in the development process, reducing the risk of errors downstream.</li> </ul> </li> <li>Proactive Validation:<ul> <li>Validate data sources and complex data flows in a simplified manner, promoting a proactive approach to data quality.</li> </ul> </li> <li>Efficient Testing Across Sources:<ul> <li>Seamlessly test data across various sources, including databases, file formats, HTTP, and messaging, all within    your local laptop or development environment.</li> <li>Fast feedback loop to motivate developers to ensure thorough testing of data scenarios.</li> </ul> </li> <li>Integration with Development Pipelines:<ul> <li>Easily integrate Data Caterer as a task in your development pipelines, ensuring that data quality is a continuous    consideration rather than an isolated event.</li> </ul> </li> <li>Integration with Existing Metadata:<ul> <li>By harnessing the power of existing metadata from data catalogs, schema registries, or other data validation tools,   Data Caterer streamlines the process, automating the generation and validation of your data effortlessly.</li> </ul> </li> <li>Improved Collaboration:<ul> <li>Facilitate collaboration between developers, testers, and data professionals by providing a common platform for   early data validation.</li> <li>No need to rely on seeking domain expertise or external teams for data testing.</li> </ul> </li> </ol>"},{"location":"use-case/blog/shift-left-data-quality/#realizing-the-vision-of-proactive-data-quality","title":"Realizing the Vision of Proactive Data Quality","text":"<p>As organizations strive for excellence in their data-driven endeavors, the shift left approach with Data Caterer becomes a strategic imperative. By instilling a proactive data quality culture, teams can minimize the risk of costly errors, enhance the reliability of their data, and streamline the entire development lifecycle.</p> <p>In conclusion, the marriage of the shift left philosophy and Data Caterer brings forth a new era of data management, where data quality is not just a final checkpoint but an integral part of every development milestone. Embrace the shift left approach with Data Caterer and empower your teams to build robust, high-quality data solutions from the very beginning.</p> <p>Shift Left, Validate Early, and Accelerate with Data Caterer.</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Home","text":"Data Caterer is a metadata-driven data generation and  testing tool that aids in creating production-like data across both batch and event data systems. Run data validations  to ensure your systems have ingested it as expected, then clean up the data afterwards. Simplify your data testing Take away the pain and complexity of your data landscape and let Data Caterer handle it <p> Try now </p> Data testing is difficult and fragmented <ul> <li>Data being sent via messages, HTTP requests or files and getting stored in databases, file systems, etc.</li> <li>Maintaining and updating tests with the latest schemas and business definitions</li> <li>Different testing tools for services, jobs or data sources</li> <li>Complex relationships between datasets and fields</li> <li>Different scenarios, permutations, combinations and edge cases to cover</li> </ul> Current solutions only cover half the story <ul> <li>Specific testing frameworks that support one or limited number of data sources or transport protocols</li> <li>Under utilizing metadata from data catalogs or metadata discovery services</li> <li>Testing teams having difficulties understanding when failures occur</li> <li>Integration tests relying on external teams/services</li> <li>Manually generating data, or worse, copying/masking production data into lower environments</li> <li>Observability pushes towards being reactive rather than proactive</li> </ul> <p> Try now </p> What you need is a reliable tool that can handle changes to your data landscape <p> </p> <p>With Data Caterer, you get:</p> <ul> <li>Ability to connect to any type of data source: files, SQL or no-SQL databases, messaging systems, HTTP</li> <li>Discover metadata from your existing infrastructure and services</li> <li>Gain confidence that bugs do not propagate to production</li> <li>Be proactive in ensuring changes do not affect other data producers or consumers</li> <li>Configurability to run the way you want</li> </ul> <p> Try now </p>"},{"location":"#tech-summary","title":"Tech Summary","text":"<p>Use the Java, Scala API, or YAML files to help with setup or customisation that are all run via a Docker image. Want to  get into details? Checkout the setup pages here to get code examples and guides that will take you  through scenarios and data sources.</p> <p>Main features include:</p> <ul> <li> Metadata discovery</li> <li> Batch and  event data generation</li> <li> Maintain referential integrity across any dataset</li> <li> Create custom data generation scenarios</li> <li> Clean up generated data</li> <li> Validate data</li> <li> Suggest data validations</li> </ul> <p></p> <p>Check other run configurations here.</p>"},{"location":"#what-is-it","title":"What is it","text":"<ul> <li> <p> Data generation and testing tool</p> <p>Generate production like data to be consumed and validated.</p> </li> <li> <p> Designed for any data source</p> <p>We aim to support pushing data to any data source, in any format.</p> </li> <li> <p> Low/no code solution</p> <p>Can use the tool via either Scala, Java or YAML. Connect to data or metadata sources to generate data and validate.</p> </li> <li> <p> Developer productivity tool</p> <p>If you are a new developer or seasoned veteran, cut down on your feedback loop when developing with data.</p> </li> </ul>"},{"location":"#what-it-is-not","title":"What it is not","text":"<ul> <li> <p> Metadata storage/platform</p> <p>You could store and use metadata within the data generation/validation tasks but is not the recommended approach. Rather, this metadata should be gathered from existing services who handle metadata on behalf of Data Caterer.</p> </li> <li> <p> Data contract</p> <p>The focus of Data Caterer is on the data generation and testing, which can include details about how the data looks like and how it behaves. But it does not encompass all the additional metadata that comes with a data contract such as SLAs, security, etc.</p> </li> <li> <p> Metrics from load testing</p> <p>Although millions of records can be generated, there are limited capabilities in terms of metric capturing.</p> </li> </ul> <p> Try now </p> Data Catering vs Other tools vs In-house <p> Data Catering Other tools In-house Data flow Batch and events generation with validation Batch generation only or validation only Depends on architecture and design Time to results 1 day 1+ month to integrate, deploy and onboard 1+ month to build and deploy Solution Connect with your existing data ecosystem, automatic generation and validation Manual UI data entry or via SDK Depends on engineer(s) building it <p></p>"},{"location":"about/","title":"About","text":"<p>Hi, my name is Peter. I am a independent Software Developer, mainly focussing on data related services. My experience can be found on my LinkedIn.</p> <p>I have created Data Caterer to help serve individuals and companies with data generation and data testing. It is a complex area that has many edge cases or intricacies that are hard to summarise or turn into something actionable and repeatable. Through the use of metadata, Data Caterer can help simplify your data testing, simulating production environment data, aid in data debugging, or whatever your data use case may be.</p> <p>Given that it is going to save you and your team time and money, please help in considering financial support. This will help the product grow into a sustainable and feature-full service.</p>"},{"location":"about/#contact","title":"Contact","text":"<p>Please contact Peter Flook via Slack or via email <code>peter.flook@data.catering</code> if you have any questions or queries.</p>"},{"location":"about/#terms-of-service","title":"Terms of service","text":"<p>Terms of service can be found here.</p>"},{"location":"about/#privacy-policy","title":"Privacy policy","text":"<p>Privacy policy can be found here.</p>"},{"location":"sponsor/","title":"Sponsor","text":"<p>To have access to all the features of Data Caterer, you can subscribe according to your situation. You will not be charged by usage. As you continue to subscribe, you will have access to the latest version of Data Caterer as new bug fixes and features get published.</p> <p>This has been a passion project of mine where I have spent countless hours thinking of the idea, implementing,  maintaining, documenting and updating it. I hope that it will help with developers and companies with their testing  by saving time and effort, allowing you to focus on what is important. If you fall under this boat, please consider sponsorship to allow me to further maintain and upgrade the solution. Any contributions are much appreciated.</p> <p>Those who are wanting to use this project for open source applications, please contact me as I would be  happy to contribute.</p> <p>This is inspired by the mkdocs-material project that follows the same model.</p>"},{"location":"sponsor/#features","title":"Features","text":"<ul> <li> Metadata discovery</li> <li> All data sources (see here for all data sources)</li> <li> Batch and  Event generation</li> <li> Auto generation from data connections or metadata sources</li> <li> Suggest data validations</li> <li> Clean up generated data</li> <li> Run as many times as you want, not charged by usage</li> </ul>"},{"location":"sponsor/#tiers","title":"Tiers","text":""},{"location":"sponsor/#manage-subscription","title":"Manage Subscription","text":"<p>Manage via this link</p>"},{"location":"sponsor/#contact","title":"Contact","text":"<p>Please contact Peter Flook via Slack or via email <code>peter.flook@data.catering</code> if you have any questions or queries.</p>"},{"location":"use-case/","title":"Use cases","text":""},{"location":"use-case/#replicate-production-in-lower-environment","title":"Replicate production in lower environment","text":"<p>Having a stable and reliable test environment is a challenge for a number of companies, especially where teams are asynchronously deploying and testing changes at faster rates. Data Caterer can help alleviate these issues by doing the following:</p> <ol> <li>Generates data with the latest schema changes and production like field values</li> <li>Run as a job on a daily/regular basis to replicate production traffic or data flows</li> <li>Validate data to ensure your system runs as expected</li> <li>Clean up data to avoid build up of generated data</li> </ol> <p></p>"},{"location":"use-case/#local-development","title":"Local development","text":"<p>Similar to the above, being able to replicate production like data in your local environment can be key to developing more reliable code as you can test directly against data in your local computer. This has a number of benefits including:</p> <ol> <li>Fewer assumptions or ambiguities when the developer codes</li> <li>Direct feedback loop in local computer rather than waiting for test environment for more reliable test data</li> <li>No domain expertise required to understand the data</li> <li>Easy for new developers to be onboarded and developing/testing code for jobs/services</li> </ol>"},{"location":"use-case/#systemintegration-testing","title":"System/integration testing","text":"<p>When working with third-party, external or internal data providers, it can be difficult to have all setup ready to produce reliable data that abides by relationship contracts between each of the systems. You have to rely on these data providers in order for you to run your tests which may not align to their priorities. With Data Caterer, you can generate the same data that they would produce, along with maintaining referential integrity across the data providers, so that you can run your tests without relying on their systems being up and reliable in their corresponding lower environments.</p>"},{"location":"use-case/#scenario-testing","title":"Scenario testing","text":"<p>If you want to set up particular data scenarios, you can customise the generated data to fit your scenario. Once the data gets generated and is consumed, you can also run validations to ensure your system has consumed the data correctly. These scenarios can be put together from existing tasks or data sources can be enabled/disabled based on your requirement. Built into Data Caterer and controlled via feature flags, is the ability to test edge cases based on the data type of the fields used for data generation (<code>enableEdgeCases</code> flag within <code>&lt;field&gt;.generator.options</code>, see more here).</p>"},{"location":"use-case/#data-debugging","title":"Data debugging","text":"<p>When data related issues occur in production, it may be difficult to replicate in a lower or local environment. It could be related to specific fields not containing expected results, size of data is too large or missing corresponding referenced data. This becomes key to resolving the issue as you can directly code against the exact data scenario and have confidence that your code changes will fix the problem. Data Caterer can be used to generate the appropriate data in whichever environment you want to test your changes against.</p>"},{"location":"use-case/#data-profiling","title":"Data profiling","text":"<p>When using Data Caterer with the feature flag <code>enableGeneratePlanAndTasks</code> enabled (see here), metadata relating all the fields defined in the data sources you have configured will be generated via data profiling. You can run this as a standalone job (can disable <code>enableGenerateData</code>)  so that you can focus on the profile of the data you are utilising. This can be run against your production data sources  to ensure the metadata can be used to accurately generate data in other environments. This is a key feature of Data  Caterer as no direct production connections need to be maintained to generate data in other environments (which can  lead to serious concerns about data security as seen here).</p>"},{"location":"use-case/#schema-gathering","title":"Schema gathering","text":"<p>When using Data Caterer with the feature flag <code>enableGeneratePlanAndTasks</code> enabled (see here), all schemas of the data sources defined will be tracked in a common format (as tasks). This data, along with the data profiling metadata, could then feed back into your schema registries to help keep them up to date with your system.</p>"},{"location":"get-started/docker/","title":"Run Data Caterer","text":""},{"location":"get-started/docker/#quick-start","title":"Quick start","text":"<p>Ensure you have <code>docker</code> installed and running.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example &amp;&amp; ./run.sh\n#check results under docker/sample/report/index.html folder\n</code></pre>"},{"location":"get-started/docker/#report","title":"Report","text":"<p>Check the report generated under <code>docker/data/custom/report/index.html</code>.</p> <p>Sample report can also be seen here</p>"},{"location":"get-started/docker/#paid-version-trial","title":"Paid Version Trial","text":"<p>30 day trial of the paid version can be accessed via these steps:</p> <ol> <li>Join the Slack Data Catering Slack group here</li> <li>Get an API_KEY by using slash command <code>/token</code> in the Slack group (will only be visible to you)</li> <li> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example &amp;&amp; export DATA_CATERING_API_KEY=&lt;insert api key&gt;\n./run.sh\n</code></pre> </li> </ol> <p>If you want to check how long your trial has left, you can check back in the Slack group or type <code>/token</code> again.</p>"},{"location":"get-started/docker/#guided-tour","title":"Guided tour","text":"<p>Check out the starter guide here that will take your through step by step. You can also check the other guides here to see the other possibilities of what Data Caterer can achieve for you.</p>"},{"location":"legal/privacy-policy/","title":"Privacy Policy","text":"<p>Last updated September 25, 2023</p>"},{"location":"legal/privacy-policy/#data-caterer-policy-on-privacy-of-customer-personal-information","title":"Data Caterer Policy on Privacy of Customer Personal Information","text":"<p>Peter John Flook is committed to protecting the privacy and security of your personal information obtained by reason of your use of Data Caterer. This policy explains the types of customer personal information we collect, how it is used, and the steps we take to ensure your personal information is handled appropriately.</p>"},{"location":"legal/privacy-policy/#who-is-peter-john-flook","title":"Who is Peter John Flook?","text":"<p>For purposes of this Privacy Policy, \u201cPeter John Flook\u201d means Peter John Flook, the company developing and providing Data Caterer and related websites and services.</p>"},{"location":"legal/privacy-policy/#what-is-personal-information","title":"What is personal information?","text":"<p>Personal information is information that refers to an individual specifically and is recorded in any form. Personal information includes such things as age, income, date of birth, ethnic origin and credit records. Information about individuals contained in the following documents is not considered personal information:</p> <ul> <li>public telephone directories, where the subscriber can refuse to be listed</li> <li>professional and business directories available to the public</li> <li>public registries and court records</li> <li>other publicly available printed and electronic publications</li> </ul>"},{"location":"legal/privacy-policy/#we-are-accountable-to-you","title":"We are accountable to you","text":"<p>Peter John Flook is responsible for all personal information under its control. Our team is accountable for compliance with these privacy and security principles.</p>"},{"location":"legal/privacy-policy/#we-let-you-know-why-we-collect-and-use-your-personal-information-and-get-your-consent","title":"We let you know why we collect and use your personal information and get your consent","text":"<p>Peter John Flook identifies the purpose for which your personal information is collected and will be used or disclosed. If that purpose is not listed below we will do this before or at the time the information is actually being collected. You will be deemed to consent to our use of your personal information for the purpose of:</p> <ul> <li>communicating with you generally</li> <li>processing your purchases</li> <li>processing and keeping track of transactions and reporting back to you</li> <li>protecting against fraud or error</li> <li>providing product and services requested by you</li> <li>recommending products and services that Peter John Flook believes will be of interest and provide value to you</li> <li>fulfilling any other purpose that would be reasonably apparent to the average person at the time we collect it from   you</li> </ul> <p>Otherwise, Peter John Flook will obtain your express consent (by verbal, written or electronic agreement) to collect, use or disclose your personal information. You can change your consent preferences at any time by contacting Peter John Flook (please refer to the \u201cHow to contact us\u201d section below).</p>"},{"location":"legal/privacy-policy/#we-limit-collection-of-your-personal-information","title":"We limit collection of your personal information","text":"<p>Peter John Flook collects only the information required to provide products and services to you. Peter John Flook will collect personal information only by clear, fair and lawful means.</p> <p>We receive and store any information you enter on our website or give us in any other way. You can choose not to provide certain information, but then you might not be able to take advantage of many of our features.</p> <p>Peter John Flook does not receive or store personal content saved to your local device while using Data Caterer.</p> <p>We also receive and store certain types of information whenever you interact with us.</p>"},{"location":"legal/privacy-policy/#information-provided-to-stripe","title":"Information provided to Stripe","text":"<p>All purchases that are made through this site are processed securely and externally by Stripe. Unless you expressly consent otherwise, we do not see or have access to any personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address).</p>"},{"location":"legal/privacy-policy/#we-limit-disclosure-and-retention-of-your-personal-information","title":"We limit disclosure and retention of your personal information","text":"<p>Peter John Flook does not disclose personal information to any organization or person for any reason except the following:</p> <p>We employ other companies and individuals to perform functions on our behalf. Examples include fulfilling orders, delivering packages, sending postal mail and e-mail, removing repetitive information from customer lists, analyzing data, providing marketing assistance, processing credit card payments, and providing customer service. They have access to personal information needed to perform their functions, but may not use it for other purposes. We may use service providers located outside of Australia, and, if applicable, your personal information may be processed and stored in other countries and therefore may be subject to disclosure under the laws of those countries. As we continue to develop our business, we might sell or buy stores, subsidiaries, or business units. In such transactions, customer information generally is one of the transferred business assets but remains subject to the promises made in any pre-existing Privacy Notice (unless, of course, the customer consents otherwise). Also, in the unlikely event that Peter John Flook or substantially all of its assets are acquired, customer information of course will be one of the transferred assets. You are deemed to consent to disclosure of your personal information for those purposes. If your personal information is shared with third parties, those third parties are bound by appropriate agreements with Peter John Flook to secure and protect the confidentiality of your personal information.</p> <p>Peter John Flook retains your personal information only as long as it is required for our business relationship or as required by federal and provincial laws.</p>"},{"location":"legal/privacy-policy/#we-keep-your-personal-information-up-to-date-and-accurate","title":"We keep your personal information up to date and accurate","text":"<p>Peter John Flook keeps your personal information up to date, accurate and relevant for its intended use.</p> <p>You may request access to the personal information we have on record in order to review and amend the information, as appropriate. In circumstances where your personal information has been provided by a third party, we will refer you to that party (e.g. credit bureaus). To access your personal information, refer to the \u201cHow to contact us\u201d section below.</p>"},{"location":"legal/privacy-policy/#the-security-of-your-personal-information-is-a-priority-for-peter-john-flook","title":"The security of your personal information is a priority for Peter John Flook","text":"<p>We take steps to safeguard your personal information, regardless of the format in which it is held, including:</p> <p>physical security measures such as restricted access facilities and locked filing cabinets electronic security measures for computerized personal information such as password protection, database encryption and personal identification numbers. We work to protect the security of your information during transmission by using \u201cTransport Layer Security\u201d (TLS) protocol. organizational processes such as limiting access to your personal information to a selected group of individuals contractual obligations with third parties who need access to your personal information requiring them to protect and secure your personal information It\u2019s important for you to protect against unauthorized access to your password and your computer. Be sure to sign off when you\u2019ve finished using any shared computer.</p>"},{"location":"legal/privacy-policy/#what-about-third-party-advertisers-and-links-to-other-websites","title":"What About Third-Party Advertisers and Links to Other Websites?","text":"<p>Our site may include third-party advertising and links to other websites. We do not provide any personally identifiable customer information to these advertisers or third-party websites.</p> <p>These third-party websites and advertisers, or Internet advertising companies working on their behalf, sometimes use technology to send (or \u201cserve\u201d) the advertisements that appear on our website directly to your browser. They automatically receive your IP address when this happens. They may also use cookies, JavaScript, web beacons (also known as action tags or single-pixel gifs), and other technologies to measure the effectiveness of their ads and to personalize advertising content. We do not have access to or control over cookies or other features that they may use, and the information practices of these advertisers and third-party websites are not covered by this Privacy Notice. Please contact them directly for more information about their privacy practices. In addition, the Network Advertising Initiative offers useful information about Internet advertising companies (also called \u201cad networks\u201d or \u201cnetwork advertisers\u201d), including information about how to opt-out of their information collection. You can access the Network Advertising Initiative at http://www.networkadvertising.org.</p>"},{"location":"legal/privacy-policy/#redirection-to-stripe","title":"Redirection to Stripe","text":"<p>In particular, when you submit an order to us, you may be automatically redirected to Stripe in order to complete the required payment. The payment page that is provided by Stripe is not part of this site. As noted above, we are not privy to any of the bank account, credit card or other personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address). We recommend that you refer to Stripe\u2019s privacy statement if you would like more information about how Stripe collects and handles your personal information.</p>"},{"location":"legal/privacy-policy/#we-are-open-about-our-privacy-and-security-policy","title":"We are open about our privacy and security policy","text":"<p>We are committed to providing you with understandable and easily available information about our policy and practices related to management of your personal information. This policy and any related information is available at all times on our website, https://data.catering/about/ under Privacy or on request. To contact us, refer to the \u201cHow to contact us\u201d section below.</p>"},{"location":"legal/privacy-policy/#we-provide-access-to-your-personal-information-stored-by-peter-john-flook","title":"We provide access to your personal information stored by Peter John Flook","text":"<p>You can request access to your personal information stored by Peter John Flook. To contact us, refer to the \u201cHow to contact us\u201d section below. Upon receiving such a request, Peter John Flook will:</p> <p>inform you about what type of personal information we have on record or in our control, how it is used and to whom it may have been disclosed provide you with access to your information so you can review and verify the accuracy and completeness and request changes to the information make any necessary updates to your personal information We respond to your questions, concerns and complaints about privacy Peter John Flook responds in a timely manner to your questions, concerns and complaints about the privacy of your personal information and our privacy policies and procedures.</p>"},{"location":"legal/privacy-policy/#how-to-contact-us","title":"How to contact us","text":"<ul> <li>by email at <code>peter.flook@data.catering</code></li> </ul> <p>Our business changes constantly, and this privacy notice will change also. We may e-mail periodic reminders of our notices and conditions, unless you have instructed us not to, but you should check our website frequently to see recent changes. We are, however, committed to protecting your information and will never materially change our policies and practices to make them less protective of customer information collected in the past without the consent of affected customers.</p>"},{"location":"legal/terms-of-service/","title":"Terms and Conditions","text":"<p>Last updated: September 25, 2023</p> <p>Please read these terms and conditions carefully before using Our Service.</p>"},{"location":"legal/terms-of-service/#interpretation-and-definitions","title":"Interpretation and Definitions","text":""},{"location":"legal/terms-of-service/#interpretation","title":"Interpretation","text":"<p>The words of which the initial letter is capitalized have meanings defined under the following conditions. The following definitions shall have the same meaning regardless of whether they appear in singular or in plural.</p>"},{"location":"legal/terms-of-service/#definitions","title":"Definitions","text":"<p>For the purposes of these Terms and Conditions:</p> <ul> <li>Application means the software program provided by the Company downloaded by You on any electronic device, named   Data Caterer</li> <li>Application Store means the digital distribution service operated and developed by Docker Inc. (\u201cDocker\u201d) in which   the Application has been downloaded.</li> <li>Affiliate means an entity that controls, is controlled by or is under common control with a party, where \"control\"   means ownership of 50% or more of the shares, equity interest or other securities entitled to vote for election of   directors or other managing authority.</li> <li>Country refers to: New South Wales, Australia</li> <li>Company (referred to as either \"the Company\", \"We\", \"Us\" or \"Our\" in this Agreement) refers to Peter John Flook (   ABN: 65153160916), 30 Anne William Drive, West Pennant Hills, 2125, NSW, Australia.</li> <li>Device means any device that can access the Service such as a computer, a cellphone or a digital tablet.</li> <li>Service refers to the Application.</li> <li>Terms and Conditions (also referred as \"Terms\") mean these Terms and Conditions that form the entire agreement   between You and the Company regarding the use of the Service.</li> <li>Third-party Social Media Service means any services or content (including data, information, products or services)   provided by a third party that may be displayed, included or made available by the Service.</li> <li>You means the individual accessing or using the Service, or the company, or other legal entity on behalf of which   such individual is accessing or using the Service, as applicable.</li> </ul>"},{"location":"legal/terms-of-service/#acknowledgment","title":"Acknowledgment","text":"<p>These are the Terms and Conditions governing the use of this Service and the agreement that operates between You and the Company. These Terms and Conditions set out the rights and obligations of all users regarding the use of the Service.</p> <p>Your access to and use of the Service is conditioned on Your acceptance of and compliance with these Terms and Conditions. These Terms and Conditions apply to all visitors, users and others who access or use the Service.</p> <p>By accessing or using the Service You agree to be bound by these Terms and Conditions. If You disagree with any part of these Terms and Conditions then You may not access the Service.</p> <p>You represent that you are over the age of 18. The Company does not permit those under 18 to use the Service.</p> <p>Your access to and use of the Service is also conditioned on Your acceptance of and compliance with the Privacy Policy of the Company. Our Privacy Policy describes Our policies and procedures on the collection, use and disclosure of Your personal information when You use the Application or the Website and tells You about Your privacy rights and how the law protects You. Please read Our Privacy Policy carefully before using Our Service.</p>"},{"location":"legal/terms-of-service/#links-to-other-websites","title":"Links to Other Websites","text":"<p>Our Service may contain links to third-party websites or services that are not owned or controlled by the Company.</p> <p>The Company has no control over, and assumes no responsibility for, the content, privacy policies, or practices of any third party websites or services. You further acknowledge and agree that the Company shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with the use of or reliance on any such content, goods or services available on or through any such websites or services.</p> <p>We strongly advise You to read the terms and conditions and privacy policies of any third-party websites or services that You visit.</p>"},{"location":"legal/terms-of-service/#termination","title":"Termination","text":"<p>We may terminate or suspend Your access immediately, without prior notice or liability, for any reason whatsoever, including without limitation if You breach these Terms and Conditions.</p> <p>Upon termination, Your right to use the Service will cease immediately.</p>"},{"location":"legal/terms-of-service/#limitation-of-liability","title":"Limitation of Liability","text":"<p>Notwithstanding any damages that You might incur, the entire liability of the Company and any of its suppliers under any provision of these Terms and Your exclusive remedy for all the foregoing shall be limited to the amount actually paid by You through the Service or 100 USD if You haven't purchased anything through the Service.</p> <p>To the maximum extent permitted by applicable law, in no event shall the Company or its suppliers be liable for any special, incidental, indirect, or consequential damages whatsoever (including, but not limited to, damages for loss of profits, loss of data or other information, for business interruption, for personal injury, loss of privacy arising out of or in any way related to the use of or inability to use the Service, third-party software and/or third-party hardware used with the Service, or otherwise in connection with any provision of these Terms), even if the Company or any supplier has been advised of the possibility of such damages and even if the remedy fails of its essential purpose.</p> <p>Some states do not allow the exclusion of implied warranties or limitation of liability for incidental or consequential damages, which means that some of the above limitations may not apply. In these states, each party's liability will be limited to the greatest extent permitted by law.</p>"},{"location":"legal/terms-of-service/#as-is-and-as-available-disclaimer","title":"\"AS IS\" and \"AS AVAILABLE\" Disclaimer","text":"<p>The Service is provided to You \"AS IS\" and \"AS AVAILABLE\" and with all faults and defects without warranty of any kind. To the maximum extent permitted under applicable law, the Company, on its own behalf and on behalf of its Affiliates and its and their respective licensors and service providers, expressly disclaims all warranties, whether express, implied, statutory or otherwise, with respect to the Service, including all implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and warranties that may arise out of course of dealing, course of performance, usage or trade practice. Without limitation to the foregoing, the Company provides no warranty or undertaking, and makes no representation of any kind that the Service will meet Your requirements, achieve any intended results, be compatible or work with any other software, applications, systems or services, operate without interruption, meet any performance or reliability standards or be error free or that any errors or defects can or will be corrected.</p> <p>Without limiting the foregoing, neither the Company nor any of the company's provider makes any representation or warranty of any kind, express or implied: (i) as to the operation or availability of the Service, or the information, content, and materials or products included thereon; (ii) that the Service will be uninterrupted or error-free; (iii) as to the accuracy, reliability, or currency of any information or content provided through the Service; or (iv) that the Service, its servers, the content, or e-mails sent from or on behalf of the Company are free of viruses, scripts, trojan horses, worms, malware, time-bombs or other harmful components.</p> <p>Some jurisdictions do not allow the exclusion of certain types of warranties or limitations on applicable statutory rights of a consumer, so some or all of the above exclusions and limitations may not apply to You. But in such a case the exclusions and limitations set forth in this section shall be applied to the greatest extent enforceable under applicable law.</p>"},{"location":"legal/terms-of-service/#governing-law","title":"Governing Law","text":"<p>The laws of the Country, excluding its conflicts of law rules, shall govern this Terms and Your use of the Service. Your use of the Application may also be subject to other local, state, national, or international laws.</p>"},{"location":"legal/terms-of-service/#disputes-resolution","title":"Disputes Resolution","text":"<p>If You have any concern or dispute about the Service, You agree to first try to resolve the dispute informally by contacting the Company.</p>"},{"location":"legal/terms-of-service/#for-european-union-eu-users","title":"For European Union (EU) Users","text":"<p>If You are a European Union consumer, you will benefit from any mandatory provisions of the law of the country in which you are resident in.</p>"},{"location":"legal/terms-of-service/#united-states-legal-compliance","title":"United States Legal Compliance","text":"<p>You represent and warrant that (i) You are not located in a country that is subject to the United States government embargo, or that has been designated by the United States government as a \"terrorist supporting\" country, and (ii) You are not listed on any United States government list of prohibited or restricted parties.</p>"},{"location":"legal/terms-of-service/#severability-and-waiver","title":"Severability and Waiver","text":""},{"location":"legal/terms-of-service/#severability","title":"Severability","text":"<p>If any provision of these Terms is held to be unenforceable or invalid, such provision will be changed and interpreted to accomplish the objectives of such provision to the greatest extent possible under applicable law and the remaining provisions will continue in full force and effect.</p>"},{"location":"legal/terms-of-service/#waiver","title":"Waiver","text":"<p>Except as provided herein, the failure to exercise a right or to require performance of an obligation under these Terms shall not affect a party's ability to exercise such right or require such performance at any time thereafter nor shall the waiver of a breach constitute a waiver of any subsequent breach.</p>"},{"location":"legal/terms-of-service/#translation-interpretation","title":"Translation Interpretation","text":"<p>These Terms and Conditions may have been translated if We have made them available to You on our Service. You agree that the original English text shall prevail in the case of a dispute.</p>"},{"location":"legal/terms-of-service/#changes-to-these-terms-and-conditions","title":"Changes to These Terms and Conditions","text":"<p>We reserve the right, at Our sole discretion, to modify or replace these Terms at any time. If a revision is material We will make reasonable efforts to provide at least 30 days' notice prior to any new terms taking effect. What constitutes a material change will be determined at Our sole discretion.</p> <p>By continuing to access or use Our Service after those revisions become effective, You agree to be bound by the revised terms. If You do not agree to the new terms, in whole or in part, please stop using the website and the Service.</p>"},{"location":"legal/terms-of-service/#contact-us","title":"Contact Us","text":"<p>If you have any questions about these Terms and Conditions, You can contact us:</p> <ul> <li>By email: peter.flook@data.catering</li> </ul>"},{"location":"setup/","title":"Setup","text":"<p>All the configurations and customisation related to Data Caterer can be found under here.</p>"},{"location":"setup/#guide","title":"Guide","text":"<p>If you want a guided tour of using the Java or Scala API, you can follow one of the guides found here.</p>"},{"location":"setup/#specific-configuration","title":"Specific Configuration","text":"<ul> <li> Configurations - Configurations relating to feature flags, folder pathways, metadata   analysis</li> <li> Connections - Explore the data source connections available</li> <li> Generators - Choose and configure the type of generator you want used for   fields</li> <li> Validations - How to validate data to ensure your system is performing as expected</li> <li> Foreign Keys - Define links between data elements across data sources</li> <li> Deployment - Deploy Data Caterer as a job to your chosen environment</li> <li> Advanced - Advanced usage of Data Caterer</li> </ul>"},{"location":"setup/#high-level-run-configurations","title":"High Level Run Configurations","text":""},{"location":"setup/advanced/","title":"Advanced use cases","text":""},{"location":"setup/advanced/#special-data-formats","title":"Special data formats","text":"<p>There are many options available for you to use when you have a scenario when data has to be a certain format.</p> <ol> <li>Create expression datafaker<ol> <li>Can be used to create names, addresses, or anything that can be found    under here</li> </ol> </li> <li>Create regex</li> </ol>"},{"location":"setup/advanced/#foreign-keys-across-data-sets","title":"Foreign keys across data sets","text":"<p>Details for how you can configure foreign keys can be found here.</p>"},{"location":"setup/advanced/#edge-cases","title":"Edge cases","text":"<p>For each given data type, there are edge cases which can cause issues when your application processes the data. This can be controlled at a column level by including the following flag in the generator options:</p> JavaScalaYAML <pre><code>field()\n.name(\"amount\")\n.type(DoubleType.instance())\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n</code></pre> <pre><code>field\n.name(\"amount\")\n.`type`(DoubleType)\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n</code></pre> <pre><code>fields:\n- name: \"amount\"\ntype: \"double\"\ngenerator:\ntype: \"random\"\noptions:\nenableEdgeCases: \"true\"\nedgeCaseProb: 0.1\n</code></pre> <p>If you want to know all the possible edge cases for each data type, can check the documentation here.</p>"},{"location":"setup/advanced/#scenario-testing","title":"Scenario testing","text":"<p>You can create specific scenarios by adjusting the metadata found in the plan and tasks to your liking. For example, if you had two data sources, a Postgres database and a parquet file, and you wanted to save account data into Postgres and transactions related to those accounts into a parquet file. You can alter the <code>status</code> column in the account data to only generate <code>open</code> accounts and define a foreign key between Postgres and parquet to ensure the same <code>account_id</code> is being used. Then in the parquet task, define 1 to 10 transactions per <code>account_id</code> to be generated.</p> <p>Postgres account generation example task Parquet transaction generation example task Plan</p>"},{"location":"setup/advanced/#cloud-storage","title":"Cloud storage","text":""},{"location":"setup/advanced/#data-source","title":"Data source","text":"<p>If you want to save the file types CSV, JSON, Parquet or ORC into cloud storage, you can do so via adding extra configurations. Below is an example for S3.</p> JavaScalaYAML <pre><code>var csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield().name(\"account_id\"),\n...\n);\n\nvar s3Configuration = configuration()\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n\nexecute(s3Configuration, csvTask);\n</code></pre> <pre><code>val csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield.name(\"account_id\"),\n...\n)\n\nval s3Configuration = configuration\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -&gt; \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -&gt; \"true\",\n\"spark.hadoop.fs.defaultFS\" -&gt; \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -&gt; \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -&gt; \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -&gt; \"secret_key\"\n))\n\nexecute(s3Configuration, csvTask)\n</code></pre> <pre><code>folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n</code></pre>"},{"location":"setup/advanced/#storing-plantasks","title":"Storing plan/task(s)","text":"<p>You can generate and store the plan/task files inside either AWS S3, Azure Blob Storage or Google GCS. This can be controlled via configuration set in the <code>application.conf</code> file where you can set something like the below:</p> JavaScalaYAML <pre><code>configuration()\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n</code></pre> <pre><code>configuration\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -&gt; \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -&gt; \"true\",\n\"spark.hadoop.fs.defaultFS\" -&gt; \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -&gt; \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -&gt; \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -&gt; \"secret_key\"\n))\n</code></pre> <pre><code>folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n</code></pre>"},{"location":"setup/configuration/","title":"Configuration","text":"<p>A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.</p> <p>These configurations are defined from within your Java or Scala class via <code>configuration</code> or for YAML file setup, <code>application.conf</code> file as seen  here.</p>"},{"location":"setup/configuration/#flags","title":"Flags","text":"<p>Flags are used to control which processes are executed when you run Data Caterer.</p> Config Default Paid Description <code>enableGenerateData</code> true N Enable/disable data generation <code>enableCount</code> true N Count the number of records generated. Can be disabled to improve performance <code>enableFailOnError</code> true N Whilst saving generated data, if there is an error, it will stop any further data from being generated <code>enableSaveReports</code> true N Enable/disable HTML reports summarising data generated, metadata of data generated (if <code>enableSinkMetadata</code> is enabled) and validation results (if <code>enableValidation</code> is enabled). Sample here <code>enableSinkMetadata</code> true N Run data profiling for the generated data. Shown in HTML reports if <code>enableSaveSinkMetadata</code> is enabled <code>enableValidation</code> false N Run validations as described in plan. Results can be viewed from logs or from HTML report if <code>enableSaveSinkMetadata</code> is enabled. Sample here <code>enableGeneratePlanAndTasks</code> false Y Enable/disable plan and task auto generation based off data source connections <code>enableRecordTracking</code> false Y Enable/disable which data records have been generated for any data source <code>enableDeleteGeneratedRecords</code> false Y Delete all generated records based off record tracking (if <code>enableRecordTracking</code> has been set to true) <code>enableGenerateValidations</code> false Y If enabled, it will generate validations based on the data sources defined. JavaScalaapplication.conf <pre><code>configuration()\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false);\n</code></pre> <pre><code>configuration\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false)\n</code></pre> <pre><code>flags {\n  enableCount = false\n  enableCount = ${?ENABLE_COUNT}\n  enableGenerateData = true\n  enableGenerateData = ${?ENABLE_GENERATE_DATA}\n  enableFailOnError = true\n  enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}\n  enableGeneratePlanAndTasks = false\n  enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}\n  enableRecordTracking = false\n  enableRecordTracking = ${?ENABLE_RECORD_TRACKING}\n  enableDeleteGeneratedRecords = false\n  enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}\n  enableGenerateValidations = false\n  enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}\n}\n</code></pre>"},{"location":"setup/configuration/#folders","title":"Folders","text":"<p>Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.</p> <p>These folder pathways can be defined as a cloud storage pathway (i.e. <code>s3a://my-bucket/task</code>).</p> Config Default Paid Description <code>planFilePath</code> /opt/app/plan/customer-create-plan.yaml N Plan file path to use when generating and/or validating data <code>taskFolderPath</code> /opt/app/task N Task folder path that contains all the task files (can have nested directories) <code>validationFolderPath</code> /opt/app/validation N Validation folder path that contains all the validation files (can have nested directories) <code>generatedReportsFolderPath</code> /opt/app/report N Where HTML reports get generated that contain information about data generated along with any validations performed <code>generatedPlanAndTaskFolderPath</code> /tmp Y Folder path where generated plan and task files will be saved <code>recordTrackingFolderPath</code> /opt/app/record-tracking Y Where record tracking parquet files get saved JavaScalaapplication.conf <pre><code>configuration()\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\");\n</code></pre> <pre><code>configuration\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n</code></pre> <pre><code>folders {\n  planFilePath = \"/opt/app/custom/plan/postgres-plan.yaml\"\n  planFilePath = ${?PLAN_FILE_PATH}\n  taskFolderPath = \"/opt/app/custom/task\"\n  taskFolderPath = ${?TASK_FOLDER_PATH}\n  validationFolderPath = \"/opt/app/custom/validation\"\n  validationFolderPath = ${?VALIDATION_FOLDER_PATH}\n  generatedReportsFolderPath = \"/opt/app/custom/report\"\n  generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}\n  generatedPlanAndTaskFolderPath = \"/opt/app/custom/generated\"\n  generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}\n  recordTrackingFolderPath = \"/opt/app/custom/record-tracking\"\n  recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}\n}\n</code></pre>"},{"location":"setup/configuration/#metadata","title":"Metadata","text":"<p>When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy related issues. Metadata gets generated from two processes: 1) if <code>enableGeneratePlanAndTasks</code> or 2) if <code>enableSinkMetadata</code> are enabled.</p> <p>During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.</p> Config Default Paid Description <code>numRecordsFromDataSource</code> 10000 Y Number of records read in from the data source that could be used for data profiling <code>numRecordsForAnalysis</code> 10000 Y Number of records used for data profiling from the records gathered in <code>numRecordsFromDataSource</code> <code>oneOfMinCount</code> 1000 Y Minimum number of records required before considering if a field can be of type <code>oneOf</code> <code>oneOfDistinctCountVsCountThreshold</code> 0.2 Y Threshold ratio to determine if a field is of type <code>oneOf</code> (i.e. a field called <code>status</code> that only contains <code>open</code> or <code>closed</code>. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as <code>oneOf</code>) <code>numGeneratedSamples</code> 10 N Number of sample records from generated data to take. Shown in HTML report JavaScalaapplication.conf <pre><code>configuration()\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10);\n</code></pre> <pre><code>configuration\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10)\n</code></pre> <pre><code>metadata {\n  numRecordsFromDataSource = 10000\n  numRecordsForAnalysis = 10000\n  oneOfMinCount = 1000\n  oneOfDistinctCountVsCountThreshold = 0.2\n  numGeneratedSamples = 10\n}\n</code></pre>"},{"location":"setup/configuration/#generation","title":"Generation","text":"<p>When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.</p> Config Default Paid Description <code>numRecordsPerBatch</code> 100000 N Number of records across all data sources to generate per batch <code>numRecordsPerStep</code> N Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated) ScalaScalaapplication.conf <pre><code>configuration()\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000);\n</code></pre> <pre><code>configuration\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000)\n</code></pre> <pre><code>generation {\n  numRecordsPerBatch = 100000\n  numRecordsPerStep = 1000\n}\n</code></pre>"},{"location":"setup/configuration/#runtime","title":"Runtime","text":"<p>Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your  specifications via configuration as seen here.</p> JavaScalaapplication.conf <pre><code>configuration()\n.master(\"local[*]\")\n.runtimeConfig(Map.of(\"spark.driver.cores\", \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\", \"10g\");\n</code></pre> <pre><code>configuration\n.master(\"local[*]\")\n.runtimeConfig(Map(\"spark.driver.cores\" -&gt; \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\" -&gt; \"10g\")\n</code></pre> <pre><code>runtime {\n  master = \"local[*]\"\n  master = ${?DATA_CATERER_MASTER}\n  config {\n    \"spark.driver.cores\" = \"5\"\n    \"spark.driver.memory\" = \"10g\"\n  }\n}\n</code></pre>"},{"location":"setup/connection/","title":"Data Source Connections","text":"<p>Details of all the connection configuration supported can be found in the below subsections for each type of connection.</p> <p>These configurations can be done via API or from configuration. Examples of both are shown for each data source below.</p>"},{"location":"setup/connection/#supported-data-connections","title":"Supported Data Connections","text":"Data Source Type Data Source Sponsor Database Postgres, MySQL, Cassandra N File CSV, JSON, ORC, Parquet N Messaging Kafka, Solace Y HTTP REST API Y Metadata Marquez, OpenMetadata, OpenAPI/Swagger Y"},{"location":"setup/connection/#api","title":"API","text":"<p>All connection details require a name. Depending on the data source, you can define additional options which may be used by the driver or connector for connecting to the data source.</p>"},{"location":"setup/connection/#configuration-file","title":"Configuration file","text":"<p>All connection details follow the same pattern.</p> <pre><code>&lt;connection format&gt; {\n    &lt;connection name&gt; {\n        &lt;key&gt; = &lt;value&gt;\n    }\n}\n</code></pre> <p>Overriding configuration</p> <p>When defining a configuration value that can be defined by a system property or environment variable at runtime, you can define that via the following:</p> <pre><code>url = \"localhost\"\nurl = ${?POSTGRES_URL}\n</code></pre> <p>The above defines that if there is a system property or environment variable named <code>POSTGRES_URL</code>, then that value will be used for the <code>url</code>, otherwise, it will default to <code>localhost</code>.</p>"},{"location":"setup/connection/#data-sources","title":"Data sources","text":"<p>To find examples of a task for each type of data source, please check out this page.</p>"},{"location":"setup/connection/#file","title":"File","text":"<p>Linked here is a list of generic options that can be included as part of your file data source configuration if required. Links to specific file type configurations can be found below.</p>"},{"location":"setup/connection/#csv","title":"CSV","text":"JavaScalaapplication.conf <pre><code>csv(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>csv(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>csv {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?CSV_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for CSV can be found here</p>"},{"location":"setup/connection/#json","title":"JSON","text":"JavaScalaapplication.conf <pre><code>json(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>json(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>json {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?JSON_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for JSON can be found here</p>"},{"location":"setup/connection/#orc","title":"ORC","text":"JavaScalaapplication.conf <pre><code>orc(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>orc(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>orc {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?ORC_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for ORC can be found here</p>"},{"location":"setup/connection/#parquet","title":"Parquet","text":"JavaScalaapplication.conf <pre><code>parquet(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>parquet(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>parquet {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?PARQUET_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for Parquet can be found here</p>"},{"location":"setup/connection/#delta-not-supported-yet","title":"Delta (not supported yet)","text":"JavaScalaapplication.conf <pre><code>delta(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>delta(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>delta {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?DELTA_PATH}\n  }\n}\n</code></pre>"},{"location":"setup/connection/#rmdbs","title":"RMDBS","text":"<p>Follows the same configuration used by Spark as found here. Sample can be found below</p> JavaScalaapplication.conf <pre><code>postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n</code></pre> <pre><code>postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n</code></pre> <pre><code>jdbc {\n    customer_postgres {\n        url = \"jdbc:postgresql://localhost:5432/customer\"\n        url = ${?POSTGRES_URL}\n        user = \"postgres\"\n        user = ${?POSTGRES_USERNAME}\n        password = \"postgres\"\n        password = ${?POSTGRES_PASSWORD}\n        driver = \"org.postgresql.Driver\"\n    }\n}\n</code></pre> <p>Ensure that the user has write permission, so it is able to save the table to the target tables.</p> SQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO &lt;user&gt;;\n</code></pre>"},{"location":"setup/connection/#postgres","title":"Postgres","text":"<p>Can see example API or Config definition for Postgres connection above.</p>"},{"location":"setup/connection/#permissions","title":"Permissions","text":"<p>Following permissions are required when generating plan and tasks:</p> SQL Permission Statements <pre><code>GRANT SELECT ON information_schema.tables TO &lt; user &gt;;\nGRANT SELECT ON information_schema.columns TO &lt; user &gt;;\nGRANT SELECT ON information_schema.key_column_usage TO &lt; user &gt;;\nGRANT SELECT ON information_schema.table_constraints TO &lt; user &gt;;\nGRANT SELECT ON information_schema.constraint_column_usage TO &lt; user &gt;;\n</code></pre>"},{"location":"setup/connection/#mysql","title":"MySQL","text":"JavaScalaapplication.conf <pre><code>mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n</code></pre> <pre><code>mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n</code></pre> <pre><code>jdbc {\n    customer_mysql {\n        url = \"jdbc:mysql://localhost:3306/customer\"\n        user = \"root\"\n        password = \"root\"\n        driver = \"com.mysql.cj.jdbc.Driver\"\n    }\n}\n</code></pre>"},{"location":"setup/connection/#permissions_1","title":"Permissions","text":"<p>Following permissions are required when generating plan and tasks:</p> SQL Permission Statements <pre><code>GRANT SELECT ON information_schema.columns TO &lt; user &gt;;\nGRANT SELECT ON information_schema.statistics TO &lt; user &gt;;\nGRANT SELECT ON information_schema.key_column_usage TO &lt; user &gt;;\n</code></pre>"},{"location":"setup/connection/#cassandra","title":"Cassandra","text":"<p>Follows same configuration as defined by the Spark Cassandra Connector as found here</p> JavaScalaapplication.conf <pre><code>cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap.of()                #optional additional connection options\n)\n</code></pre> <pre><code>cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap()                #optional additional connection options\n)\n</code></pre> <pre><code>org.apache.spark.sql.cassandra {\n    customer_cassandra {\n        spark.cassandra.connection.host = \"localhost\"\n        spark.cassandra.connection.host = ${?CASSANDRA_HOST}\n        spark.cassandra.connection.port = \"9042\"\n        spark.cassandra.connection.port = ${?CASSANDRA_PORT}\n        spark.cassandra.auth.username = \"cassandra\"\n        spark.cassandra.auth.username = ${?CASSANDRA_USERNAME}\n        spark.cassandra.auth.password = \"cassandra\"\n        spark.cassandra.auth.password = ${?CASSANDRA_PASSWORD}\n    }\n}\n</code></pre>"},{"location":"setup/connection/#permissions_2","title":"Permissions","text":"<p>Ensure that the user has write permission, so it is able to save the table to the target tables.</p> CQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO &lt;user&gt;;\n</code></pre> <p>Following permissions are required when enabling <code>configuration.enableGeneratePlanAndTasks(true)</code> as it will gather metadata information about tables and columns from the below tables.</p> CQL Permission Statements <pre><code>GRANT SELECT ON system_schema.tables TO &lt;user&gt;;\nGRANT SELECT ON system_schema.columns TO &lt;user&gt;;\n</code></pre>"},{"location":"setup/connection/#kafka","title":"Kafka","text":"<p>Define your Kafka bootstrap server to connect and send generated data to corresponding topics. Topic gets set at a step level. Further details can be found here</p> JavaScalaapplication.conf <pre><code>kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n</code></pre> <pre><code>kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n</code></pre> <pre><code>kafka {\n    customer_kafka {\n        kafka.bootstrap.servers = \"localhost:9092\"\n        kafka.bootstrap.servers = ${?KAFKA_BOOTSTRAP_SERVERS}\n    }\n}\n</code></pre> <p>When defining your schema for pushing data to Kafka, it follows a specific top level schema. An example can be found here . You can define the key, value, headers, partition or topic by following the linked schema.</p>"},{"location":"setup/connection/#jms","title":"JMS","text":"<p>Uses JNDI lookup to send messages to JMS queue. Ensure that the messaging system you are using has your queue/topic registered via JNDI otherwise a connection cannot be created.</p> JavaScalaapplication.conf <pre><code>solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n</code></pre> <pre><code>solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n</code></pre> <pre><code>jms {\n    customer_solace {\n        initialContextFactory = \"com.solacesystems.jndi.SolJNDIInitialContextFactory\"\n        connectionFactory = \"/jms/cf/default\"\n        url = \"smf://localhost:55555\"\n        url = ${?SOLACE_URL}\n        user = \"admin\"\n        user = ${?SOLACE_USER}\n        password = \"admin\"\n        password = ${?SOLACE_PASSWORD}\n        vpnName = \"default\"\n        vpnName = ${?SOLACE_VPN}\n    }\n}\n</code></pre>"},{"location":"setup/connection/#http","title":"HTTP","text":"<p>Define any username and/or password needed for the HTTP requests. The url is defined in the tasks to allow for generated data to be populated in the url.</p> JavaScalaapplication.conf <pre><code>http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n</code></pre> <pre><code>http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n</code></pre> <pre><code>http {\n    customer_api {\n        user = \"admin\"\n        user = ${?HTTP_USER}\n        password = \"admin\"\n        password = ${?HTTP_PASSWORD}\n    }\n}\n</code></pre>"},{"location":"setup/deployment/","title":"Deployment","text":"<p>Two main ways to deploy and run Data Caterer:</p> <ul> <li>Docker</li> <li>Helm</li> </ul>"},{"location":"setup/deployment/#docker","title":"Docker","text":"<p>To package up your class along with the Data Caterer base image, you can follow the Dockerfile that is created for you here.</p> <p>Then you can run the following:</p> <pre><code>./gradlew clean build\ndocker build -t &lt;my_image_name&gt;:&lt;my_image_tag&gt; .\n</code></pre>"},{"location":"setup/deployment/#helm","title":"Helm","text":"<p>Link to sample helm on GitHub here</p> <p>Update the configuration to your own data connections and configuration or own image created from above.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\nhelm install data-caterer ./data-caterer-example/helm/data-caterer\n</code></pre>"},{"location":"setup/design/","title":"Design","text":"<p>This document shows the thought process behind the design of Data Caterer to help give you insights as to how and why it was created to what it is today. Also, this serves as a reference for future design decisions which will get updated  here and thus is a living document.</p>"},{"location":"setup/design/#motivation","title":"Motivation","text":"<p>The main difficulties that I faced as a developer and team lead relating to testing were:</p> <ul> <li>Difficulty in testing with multiple data sources, both batch and real time</li> <li>Reliance on other teams for stable environments or domain knowledge</li> <li>Test environments with no reliable or consistent data flows</li> <li>Complex data masking/anonymization solutions</li> <li>Relying on production data (potential privacy and data breach issues)</li> <li>Cost of data production issues can be very high</li> <li>Unknown unknowns staying hidden until problems occur in production</li> <li>Underutilised metadata</li> </ul>"},{"location":"setup/design/#guiding-principles","title":"Guiding Principles","text":"<p>These difficulties helped formed the basis of the principles for which Data Caterer should follow:</p> <ul> <li>Data source agnostic: Connect to any batch or real time data sources for data generation or validation</li> <li>Configurable: Run the application the way you want</li> <li>Extensible: Allow for new innovations to seamlessly integrate with Data Caterer</li> <li>Integrate with existing solutions: Utilise existing metadata to make it easy for users to use straight away</li> <li>Secure: No production connections required, metadata based solution</li> <li>Fast: Give developers fast feedback loops to encourage them to thoroughly test data flows</li> </ul>"},{"location":"setup/design/#high-level-flow","title":"High level flow","text":"<pre><code>graph LR\n  subgraph userTasks [User Configuration]\n  dataGen[Data Generation]\n  dataValid[Data Validation]\n  runConf[Runtime Config]\n  end\n\n  subgraph dataProcessor [Processor]\n  dataCaterer[Data Caterer]\n  end\n\n  subgraph existingMetadata [Metadata]\n  metadataService[Metadata Services]\n  metadataDataSource[Data Sources]\n  end\n\n  subgraph output [Output]\n  outputDataSource[Data Sources]\n  report[Report]\n  end\n\n  dataGen --&gt; dataCaterer\n  dataValid --&gt; dataCaterer\n  runConf --&gt; dataCaterer\n  direction TB\n  dataCaterer -.-&gt; metadataService\n  dataCaterer -.-&gt; metadataDataSource\n  direction LR\n  dataCaterer ---&gt; outputDataSource\n  dataCaterer ---&gt; report</code></pre> <ol> <li>User Configuration<ol> <li>Users define data generation, validation and runtime configuration</li> </ol> </li> <li>Processor<ol> <li>Engine will take user configuration to decide how to run</li> <li>User defined configuration merged with metadata from external sources</li> </ol> </li> <li>Metadata<ol> <li>Automatically retrieve schema, data profiling, relationship or validation rule metadata from data sources or metadata services</li> </ol> </li> <li>Output<ol> <li>Execute data generation and validation tasks on data sources</li> <li>Generate report summarising outcome</li> </ol> </li> </ol>"},{"location":"setup/foreign-key/","title":"Foreign Keys","text":"<p>Foreign keys can be defined to represent the relationships between datasets where values are required to match for particular columns.</p>"},{"location":"setup/foreign-key/#single-column","title":"Single column","text":"<p>Define a column in one data source to match against another column. Below example shows a <code>postgres</code> data source with two tables, <code>accounts</code> and <code>transactions</code> that have a foreign key for <code>account_id</code>.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList.of(Map.entry(postgresTxn, \"account_id\"))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList(postgresTxn -&gt; \"account_id\")\n)\n</code></pre> <pre><code>---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"postgres.accounts.account_id\":\n- \"postgres.transactions.account_id\"\n</code></pre>"},{"location":"setup/foreign-key/#multiple-columns","title":"Multiple columns","text":"<p>You may have a scenario where multiple columns need to be aligned. From the same example, we want <code>account_id</code> and <code>name</code> from <code>accounts</code> to match with <code>account_id</code> and <code>full_name</code> to match in <code>transactions</code> respectively.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(postgresTxn, List.of(\"account_id\", \"full_name\")))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(postgresTxn -&gt; List(\"account_id\", \"full_name\"))\n)\n</code></pre> <pre><code>---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_postgres.transactions.account_id,full_name\"\n</code></pre>"},{"location":"setup/foreign-key/#nested-column","title":"Nested column","text":"<p>Your schema structure can have nested fields which can also be referenced as foreign keys. But to do so, you need to create a proxy field that gets omitted from the final saved data.</p> <p>In the example below, the nested <code>customer_details.name</code> field inside the <code>json</code> task needs to match with <code>name</code> from <code>postgres</code>. A new field in the <code>json</code> called <code>_txn_name</code> is used as a temporary column to facilitate the foreign key definition.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n),\nfield().name(\"_txn_name\").omit(true)       #value will not be included in output\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(jsonTask, List.of(\"account_id\", \"_txn_name\")))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n), field.name(\"_txn_name\").omit(true)       #value will not be included in output\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(jsonTask -&gt; List(\"account_id\", \"_txn_name\"))\n)\n</code></pre> <pre><code>---\n#postgres task yaml\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n---\n#json task yaml\nname: \"json_data\"\nsteps:\n- name: \"transactions\"\ntype: \"json\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"_txn_name\"\ngenerator:\noptions:\nomit: true\n- name: \"cusotmer_details\"\nschema:\nfields:\nname: \"name\"\ngenerator:\ntype: \"sql\"\noptions:\nsql: \"_txn_name\"\n\n---\n#plan yaml\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n- name: \"json_data\"\ndataSourceName: \"my_json\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_json.transactions.account_id,_txn_name\"\n</code></pre>"},{"location":"setup/validation/","title":"Validations","text":"<p>Validations can be used to run data checks after you have run the data generator or even as a standalone task. A report summarising the success or failure of the validations is produced and can be examined for further investigation.</p> <ul> <li>Basic - Basic column level validations</li> <li>Group by/Aggregate - Run aggregates over grouped data, then validate</li> <li>Upstream data source - Ensure record values exist in datasets based on other data sources or data generated</li> <li>[Data Profile (Coming soon)] - Score how close the data profile of generated data is against the target data profile</li> </ul>"},{"location":"setup/validation/#define-validations","title":"Define Validations","text":"<p>Full example validation can be found below. For more details, check out each of the subsections defined further below.</p> JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().col(\"amount\").lessThan(100),\nvalidation().col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation().col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)\n.validationWait(waitCondition().pause(1));\n\nvar conf = configuration().enableValidation(true);\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validations(\nvalidation.col(\"amount\").lessThan(100),\nvalidation.col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation.col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)  .validationWait(waitCondition.pause(1))\n\nval conf = configuration.enableValidation(true)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is &gt; 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is &gt; 200, then fail\ndescription: \"Should be lots of Peters\"\nwaitCondition:\npauseInSeconds: 1\n</code></pre>"},{"location":"setup/validation/#wait-condition","title":"Wait Condition","text":"<p>Once data has been generated, you may want to wait for a certain condition to be met before starting the data validations. This can be via:</p> <ul> <li>Pause for seconds</li> <li>When file is available</li> <li>Data exists</li> <li>Webhook</li> </ul>"},{"location":"setup/validation/#pause","title":"Pause","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().pause(1));\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.pause(1))\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npauseInSeconds: 1\n</code></pre>"},{"location":"setup/validation/#data-exists","title":"Data exists","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWaitDataExists(\"updated_date &gt; DATE('2023-01-01')\");\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWaitDataExists(\"updated_date &gt; DATE('2023-01-01')\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"transactions\"\noptions:\npath: \"/tmp/csv\"\nexpr: \"updated_date &gt; DATE('2023-01-01')\"\n</code></pre>"},{"location":"setup/validation/#webhook","title":"Webhook","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\")); //by default, GET request successful when 200 status code\n\n//or\n\nvar csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202));  //successful if 200 or 202 status code\n\n//or\n\nvar csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"my_http\", \"http://localhost:8080/finished\"));  //use connection configuration from existing 'my_http' connection definition\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\"))  //by default, GET request successful when 200 status code\n\n//or\n\nval csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)) //successful if 200 or 202 status code\n\n//or\n\nval csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"my_http\", \"http://localhost:8080/finished\")) //use connection configuration from existing 'my_http' connection definition\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\" #by default, GET request successful when 200 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\"\nmethod: \"GET\"\nstatusCodes: [200, 202] #successful if 200 or 202 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"my_http\" #use connection configuration from existing 'my_http' connection definition\nurl: \"http://localhost:8080/finished\"\n</code></pre>"},{"location":"setup/validation/#file-exists","title":"File exists","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().file(\"/tmp/json\"));\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition.file(\"/tmp/json\"))\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npath: \"/tmp/json\"\n</code></pre>"},{"location":"setup/validation/#report","title":"Report","text":"<p>Once run, it will produce a report like this.</p>"},{"location":"setup/generator/count/","title":"Record Count","text":"<p>There are options related to controlling the number of records generated that can help in generating the scenarios or data required.</p>"},{"location":"setup/generator/count/#record-count_1","title":"Record Count","text":"<p>Record count is the simplest as you define the total number of records you require for that particular step. For example, in the below step, it will generate 1000 records for the CSV file  </p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\n</code></pre>"},{"location":"setup/generator/count/#generated-count","title":"Generated Count","text":"<p>As like most things in Data Caterer, the count can be generated based on some metadata. For example, if I wanted to generate between 1000 and 2000 records, I could define that by the below configuration:</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator().min(1000).max(2000));\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator.min(1000).max(2000))\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\ngenerator:\ntype: \"random\"\noptions:\nmin: 1000\nmax: 2000\n</code></pre>"},{"location":"setup/generator/count/#per-column-count","title":"Per Column Count","text":"<p>When defining a per column count, this allows you to generate records \"per set of columns\". This means that for a given set of columns, it will generate a particular amount of records per combination of values for those columns.  </p> <p>One example of this would be when generating transactions relating to a customer, a customer may be defined by columns <code>account_id, name</code>. A number of transactions would be generated per <code>account_id, name</code>.  </p> <p>You can also use a combination of the above two methods to generate the number of records per column.</p>"},{"location":"setup/generator/count/#records","title":"Records","text":"<p>When defining a base number of records within the <code>perColumn</code> configuration, it translates to creating <code>(count.records * count.recordsPerColumn)</code> records. This is a fixed number of records that will be generated each time, with no variation between runs.</p> <p>In the example below, we have <code>count.records = 1000</code> and <code>count.recordsPerColumn = 2</code>. Which means that <code>1000 * 2 = 2000</code> records will be generated in total.</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\nrecords: 2\ncolumnNames:\n- \"account_id\"\n- \"name\"\n</code></pre>"},{"location":"setup/generator/count/#generated","title":"Generated","text":"<p>You can also define a generator for the count per column. This can be used in scenarios where you want a variable number of records per set of columns.</p> <p>In the example below, it will generate between <code>(count.records * count.perColumnGenerator.generator.min) = (1000 * 1) = 1000</code> and <code>(count.records * count.perColumnGenerator.generator.max) = (1000 * 2) = 2000</code> records.</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumnGenerator(generator().min(1).max(2), \"account_id\", \"name\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumnGenerator(generator.min(1).max(2), \"account_id\", \"name\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\ncolumnNames:\n- \"account_id\"\n- \"name\"\ngenerator:\ntype: \"random\"\noptions:\nmin: 1\nmax: 2\n</code></pre>"},{"location":"setup/generator/data-generator/","title":"Data Generators","text":""},{"location":"setup/generator/data-generator/#data-types","title":"Data Types","text":"<p>Below is a list of all supported data types for generating data:</p> Data Type Spark Data Type Options Description string StringType <code>minLen, maxLen, expression, enableNull</code> integer IntegerType <code>min, max, stddev, mean</code> long LongType <code>min, max, stddev, mean</code> short ShortType <code>min, max, stddev, mean</code> decimal(precision, scale) DecimalType(precision, scale) <code>min, max, stddev, mean</code> double DoubleType <code>min, max, stddev, mean</code> float FloatType <code>min, max, stddev, mean</code> date DateType <code>min, max, enableNull</code> timestamp TimestampType <code>min, max, enableNull</code> boolean BooleanType binary BinaryType <code>minLen, maxLen, enableNull</code> byte ByteType array ArrayType <code>arrayMinLen, arrayMaxLen, arrayType</code> _ StructType Implicitly supported when a schema is defined for a field"},{"location":"setup/generator/data-generator/#options","title":"Options","text":""},{"location":"setup/generator/data-generator/#all-data-types","title":"All data types","text":"<p>Some options are available to use for all types of data generators. Below is the list along with example and descriptions:</p> Option Default Example Description <code>enableEdgeCase</code> false <code>enableEdgeCase: \"true\"</code> Enable/disable generated data to contain edge cases based on the data type. For example, integer data type has edge cases of (Int.MaxValue, Int.MinValue and 0) <code>edgeCaseProbability</code> 0.0 <code>edgeCaseProb: \"0.1\"</code> Probability of generating a random edge case value if <code>enableEdgeCase</code> is true <code>isUnique</code> false <code>isUnique: \"true\"</code> Enable/disable generated data to be unique for that column. Errors will be thrown when it is unable to generate unique data <code>seed</code> <code>seed: \"1\"</code> Defines the random seed for generating data for that particular column. It will override any seed defined at a global level <code>sql</code> <code>sql: \"CASE WHEN amount &lt; 10 THEN true ELSE false END\"</code> Define any SQL statement for generating that columns value. Computation occurs after all non-SQL fields are generated. This means any columns used in the SQL cannot be based on other SQL generated columns. Data type of generated value from SQL needs to match data type defined for the field"},{"location":"setup/generator/data-generator/#string","title":"String","text":"Option Default Example Description <code>minLen</code> 1 <code>minLen: \"2\"</code> Ensures that all generated strings have at least length <code>minLen</code> <code>maxLen</code> 10 <code>maxLen: \"15\"</code> Ensures that all generated strings have at most length <code>maxLen</code> <code>expression</code> <code>expression: \"#{Name.name}\"</code><code>expression:\"#{Address.city}/#{Demographic.maritalStatus}\"</code> Will generate a string based on the faker expression provided. All possible faker expressions can be found here Expression has to be in format <code>#{&lt;faker expression name&gt;}</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", \"\u0130yi g\u00fcnler\", \"\u0421\u043f\u0430\u0441\u0438\u0431\u043e\", \"\u039a\u03b1\u03bb\u03b7\u03bc\u03ad\u03c1\u03b1\", \"\u0635\u0628\u0627\u062d \u0627\u0644\u062e\u064a\u0631\", \" F\u00f6rl\u00e5t\", \"\u4f60\u597d\u5417\", \"Nh\u00e0 v\u1ec7 sinh \u1edf \u0111\u00e2u\", \"\u3053\u3093\u306b\u3061\u306f\", \"\u0928\u092e\u0938\u094d\u0924\u0947\", \"\u0532\u0561\u0580\u0565\u0582\", \"\u0417\u0434\u0440\u0430\u0432\u0435\u0439\u0442\u0435\")</p>"},{"location":"setup/generator/data-generator/#sample","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield()\n.name(\"name\")\n.type(StringType.instance())\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield\n.name(\"name\")\n.`type`(StringType)\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\nschema:\nfields:\n- name: \"name\"\ntype: \"string\"\ngenerator:\noptions:\nexpression: \"#{Name.name}\"\nenableNull: true\nnullProb: 0.1\nminLength: 4\nmaxLength: 20\n</code></pre>"},{"location":"setup/generator/data-generator/#numeric","title":"Numeric","text":"<p>For all the numeric data types, there are 4 options to choose from: min, max and maxValue. Generally speaking, you only need to define one of min or minValue, similarly with max or maxValue. The reason why there are 2 options for each is because of when metadata is automatically gathered, we gather the statistics of the observed min and max values. Also, it will attempt to gather any restriction on the min or max value as defined by the data source (i.e. max value as per database type).</p>"},{"location":"setup/generator/data-generator/#integerlongshort","title":"Integer/Long/Short","text":"Option Default Example Description <code>min</code> 0 <code>min: \"2\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000 <code>max: \"25\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <p>Edge cases Integer: (2147483647, -2147483648, 0) Edge cases Long: (9223372036854775807, -9223372036854775808, 0) Edge cases Short: (32767, -32768, 0)</p>"},{"location":"setup/generator/data-generator/#sample_1","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"year\").type(IntegerType.instance()).min(2020).max(2023),\nfield().name(\"customer_id\").type(LongType.instance()),\nfield().name(\"customer_group\").type(ShortType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"year\").`type`(IntegerType).min(2020).max(2023),\nfield.name(\"customer_id\").`type`(LongType),\nfield.name(\"customer_group\").`type`(ShortType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"year\"\ntype: \"integer\"\ngenerator:\noptions:\nmin: 2020\nmax: 2023\n- name: \"customer_id\"\ntype: \"long\"\n- name: \"customer_group\"\ntype: \"short\"\n</code></pre>"},{"location":"setup/generator/data-generator/#decimal","title":"Decimal","text":"Option Default Example Description <code>min</code> 0 <code>min: \"2\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000 <code>max: \"25\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <code>numericPrecision</code> 10 <code>precision: \"25\"</code> The maximum number of digits <code>numericScale</code> 0 <code>scale: \"25\"</code> The number of digits on the right side of the decimal point (has to be less than or equal to precision) <p>Edge cases Decimal: (9223372036854775807, -9223372036854775808, 0)</p>"},{"location":"setup/generator/data-generator/#sample_2","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"balance\").type(DecimalType.instance()).numericPrecision(10).numericScale(5)\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"balance\").`type`(DecimalType).numericPrecision(10).numericScale(5)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"balance\"\ntype: \"decimal\"\ngenerator:\noptions:\nprecision: 10\nscale: 5\n</code></pre>"},{"location":"setup/generator/data-generator/#doublefloat","title":"Double/Float","text":"Option Default Example Description <code>min</code> 0.0 <code>min: \"2.1\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000.0 <code>max: \"25.9\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <p>Edge cases Double: (+infinity, 1.7976931348623157e+308, 4.9e-324, 0.0, -0.0, -1.7976931348623157e+308, -infinity, NaN) Edge cases Float: (+infinity, 3.4028235e+38, 1.4e-45, 0.0, -0.0, -3.4028235e+38, -infinity, NaN)</p>"},{"location":"setup/generator/data-generator/#sample_3","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"amount\").type(DoubleType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"amount\").`type`(DoubleType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"amount\"\ntype: \"double\"\n</code></pre>"},{"location":"setup/generator/data-generator/#date","title":"Date","text":"Option Default Example Description <code>min</code> now() - 365 days <code>min: \"2023-01-31\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> now() <code>max: \"2023-12-31\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (0001-01-01, 1582-10-15, 1970-01-01, 9999-12-31) (reference)</p>"},{"location":"setup/generator/data-generator/#sample_4","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_date\").type(DateType.instance()).min(java.sql.Date.valueOf(\"2020-01-01\"))\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_date\").`type`(DateType).min(java.sql.Date.valueOf(\"2020-01-01\"))\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_date\"\ntype: \"date\"\ngenerator:\noptions:\nmin: \"2020-01-01\"\n</code></pre>"},{"location":"setup/generator/data-generator/#timestamp","title":"Timestamp","text":"Option Default Example Description <code>min</code> now() - 365 days <code>min: \"2023-01-31 23:10:10\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> now() <code>max: \"2023-12-31 23:10:10\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (0001-01-01 00:00:00, 1582-10-15 23:59:59, 1970-01-01 00:00:00, 9999-12-31 23:59:59)</p>"},{"location":"setup/generator/data-generator/#sample_5","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_time\").type(TimestampType.instance()).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_time\").`type`(TimestampType).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_time\"\ntype: \"timestamp\"\ngenerator:\noptions:\nmin: \"2020-01-01 00:00:00\"\n</code></pre>"},{"location":"setup/generator/data-generator/#binary","title":"Binary","text":"Option Default Example Description <code>minLen</code> 1 <code>minLen: \"2\"</code> Ensures that all generated array of bytes have at least length <code>minLen</code> <code>maxLen</code> 20 <code>maxLen: \"15\"</code> Ensures that all generated array of bytes have at most length <code>maxLen</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", -128, 127)</p>"},{"location":"setup/generator/data-generator/#sample_6","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"payload\").type(BinaryType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"payload\").`type`(BinaryType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"payload\"\ntype: \"binary\"\n</code></pre>"},{"location":"setup/generator/data-generator/#array","title":"Array","text":"Option Default Example Description <code>arrayMinLen</code> 0 <code>arrayMinLen: \"2\"</code> Ensures that all generated arrays have at least length <code>arrayMinLen</code> <code>arrayMaxLen</code> 5 <code>arrayMaxLen: \"15\"</code> Ensures that all generated arrays have at most length <code>arrayMaxLen</code> <code>arrayType</code> <code>arrayType: \"double\"</code> Inner data type of the array. Optional when using Java/Scala API. Allows for nested data types to be defined like struct <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true"},{"location":"setup/generator/data-generator/#sample_7","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"last_5_amounts\").type(ArrayType.instance()).arrayType(\"double\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"last_5_amounts\").`type`(ArrayType).arrayType(\"double\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"last_5_amounts\"\ntype: \"array&lt;double&gt;\"\n</code></pre>"},{"location":"setup/generator/report/","title":"Report","text":"<p>Data Caterer can be configured to produce a report of the data generated to help users understand what was run, how much  data was generated, where it was generated, validation results and any associated metadata. </p>"},{"location":"setup/generator/report/#sample","title":"Sample","text":"<p>Once run, it will produce a report like this.</p>"},{"location":"setup/guide/","title":"Guides","text":"<p>Below are a list of guides you can follow to create your data generation for your use case.</p> <p>For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.</p>"},{"location":"setup/guide/#scenarios","title":"Scenarios","text":"<ul> <li>First Data Generation - If you are new, this is the place to start</li> <li>Multiple Records Per Column Value - How you can generate multiple records per set of columns</li> <li>Foreign Keys Across Data Sources - Generate matching values across generated data sets</li> <li>Data Validations - Run data validations after generating data</li> <li>Auto Generate From Data Connection - Automatically generating data from just defining data sources</li> <li>Delete Generated Data - Delete the generated data whilst leaving other data</li> <li>Generate Batch and Event Data - Generate matching batch and event data</li> </ul>"},{"location":"setup/guide/#data-sources","title":"Data Sources","text":"<ul> <li>Files (CSV, JSON, ORC, Parquet) - Generate data for popular file formats</li> <li>Postgres - JDBC Postgres tables</li> <li>Cassandra - Cassandra tables</li> <li>Kafka - Kafka topics</li> <li>Solace - Solace messages</li> <li>Marquez - Generate data based on metadata in Marquez</li> <li>OpenMetadata - Generate data based on metadata in OpenMetadata</li> <li>HTTP - HTTP requests</li> <li>Files (Fixed width) - (Soon to document) A variant of CSV but with no separator</li> <li>MySql - (Soon to document) JDBC MySql tables</li> </ul>"},{"location":"setup/guide/#yaml-files","title":"YAML Files","text":""},{"location":"setup/guide/#base-concept","title":"Base Concept","text":"<p>The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources. A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.</p>"},{"location":"setup/guide/#plan","title":"Plan","text":""},{"location":"setup/guide/#foreign-keys","title":"Foreign Keys","text":"<p>Define foreign keys across data sources in your plan to ensure generated data can match Link to associated task 1 Link to associated task 2</p>"},{"location":"setup/guide/#task","title":"Task","text":"Data Source Type Data Source Sample Task Notes Database Postgres Sample Database MySQL Sample Database Cassandra Sample File CSV Sample File JSON Sample Contains nested schemas and use of SQL for generated values File Parquet Sample Partition by year column Kafka Kafka Sample Specific base schema to be used, define headers, key, value, etc. JMS Solace Sample JSON formatted message HTTP PUT Sample JSON formatted PUT body"},{"location":"setup/guide/#configuration","title":"Configuration","text":"<p>Basic configuration</p>"},{"location":"setup/guide/#docker-compose","title":"Docker-compose","text":"<p>To see how it runs against different data sources, you can run using <code>docker-compose</code> and set <code>DATA_SOURCE</code> like below</p> <pre><code>./gradlew build\ncd docker\nDATA_SOURCE=postgres docker-compose up -d datacaterer\n</code></pre> <p>Can set it to one of the following:</p> <ul> <li>postgres</li> <li>mysql</li> <li>cassandra</li> <li>solace</li> <li>kafka</li> <li>http</li> </ul>"},{"location":"setup/guide/data-source/cassandra/","title":"Cassandra","text":"<p>Info</p> <p>Writing data to Cassandra is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Cassandra. You will build a Docker image that will be able to populate data in Cassandra for the tables you configure.</p>"},{"location":"setup/guide/data-source/cassandra/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Cassandra</li> </ul>"},{"location":"setup/guide/data-source/cassandra/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Cassandra instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/cassandra/#cassandra-setup","title":"Cassandra Setup","text":"<p>Next, let's make sure you have an instance of Cassandra up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d cassandra\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#permissions","title":"Permissions","text":"<p>Let's make a new user that has the required permissions needed to push data into the Cassandra tables we want.</p> CQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO data_caterer_user;\n</code></pre> <p>Following permissions are required when enabling <code>configuration.enableGeneratePlanAndTasks(true)</code> as it will gather metadata information about tables and columns from the below tables.</p> CQL Permission Statements <pre><code>GRANT SELECT ON system_schema.tables TO data_caterer_user;\nGRANT SELECT ON system_schema.columns TO data_caterer_user;\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedCassandraJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedCassandraPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedCassandraJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedCassandraPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/cassandra/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Cassandra.</p> JavaScala <pre><code>var accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap.of()                //optional additional connection options\n)\n</code></pre> <p>Additional options such as SSL configuration, etc can be found here.</p> <pre><code>val accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap()                   //optional additional connection options\n)\n</code></pre> <p>Additional options such as SSL configuration, etc can be found here.</p>"},{"location":"setup/guide/data-source/cassandra/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>account.accounts</code> and <code>account.account_status_history</code> tables as defined under<code>docker/data/cql/customer.cql</code>. This table should already be setup for you if you followed this step. We can check if the table is setup already via the following command:</p> <pre><code>docker exec host.docker.internal cqlsh -e 'describe account.accounts; describe account.account_status_history;'\n</code></pre> <p>Here we should see some output that looks like the below. This tells us what schema we need to follow when generating data. We need to define that alongside any metadata that is useful to add constraints on what are possible values the generated data should contain.</p> <pre><code>CREATE TABLE account.accounts (\naccount_id text PRIMARY KEY,\n    amount double,\n    created_by text,\n    name text,\n    open_time timestamp,\n    status text\n)...\n\nCREATE TABLE account.account_status_history (\naccount_id text,\n    eod_date date,\n    status text,\n    updated_by text,\n    updated_time timestamp,\n    PRIMARY KEY (account_id, eod_date)\n)...\n</code></pre> <p>Trimming the connection details to work with the docker-compose Cassandra, we have a base Cassandra connection to define the table and schema required. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code> which corresponds to <code>text</code> in Cassandra.</p> JavaScala <pre><code>{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n}\n</code></pre> <pre><code>val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#field-metadata","title":"Field Metadata","text":"<p>We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata that add guidelines that the data generator will understand when generating data.</p>"},{"location":"setup/guide/data-source/cassandra/#account_id","title":"account_id","text":"<p><code>account_id</code> follows a particular pattern that where it starts with <code>ACC</code> and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that unique values are generated.</p> JavaScala <pre><code>field().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n</code></pre> <pre><code>field.name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#amount","title":"amount","text":"<p><code>amount</code> the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between <code>1</code> and <code>1000</code>.</p> JavaScala <pre><code>field().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\n</code></pre> <pre><code>field.name(\"amount\").`type`(DoubleType).min(1).max(1000),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#name","title":"name","text":"<p><code>name</code> is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an <code>expression</code> to generate real looking name. All possible faker expressions can be found here</p> JavaScala <pre><code>field().name(\"name\").expression(\"#{Name.name}\"),\n</code></pre> <pre><code>field.name(\"name\").expression(\"#{Name.name}\"),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#open_time","title":"open_time","text":"<p><code>open_time</code> is a timestamp that we want to have a value greater than a specific date. We can define a min date by using <code>java.sql.Date</code> like below.</p> JavaScala <pre><code>field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre> <pre><code>field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#status","title":"status","text":"<p><code>status</code> is a field that can only obtain one of four values, <code>open, closed, suspended or pending</code>.</p> JavaScala <pre><code>field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre> <pre><code>field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#created_by","title":"created_by","text":"<p><code>created_by</code> is a field that is based on the <code>status</code> field where it follows the logic: <code>if status is open or closed, then it is created_by eod else created_by event</code>. This can be achieved by defining a SQL expression like below.</p> JavaScala <pre><code>field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <pre><code>field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <p>Putting it all the fields together, our class should now look like this.</p> JavaScala <pre><code>var accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n</code></pre> <pre><code>val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>accountTask</code>, we have to call <code>execute</code> . So our full plan run will look like this.</p> JavaScala <pre><code>public class MyAdvancedCassandraJavaPlan extends PlanRun {\n{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n</code></pre> <pre><code>class MyAdvancedCassandraPlan extends PlanRun {\nval accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class MyAdvancedCassandraJavaPlan or MyAdvancedCassandraPlan\n#after completing\ndocker exec docker-cassandraserver-1 cqlsh -e 'select count(1) from account.accounts;select * from account.accounts limit 10;'\n</code></pre> <p>Your output should look like this.</p> <pre><code> count\n-------\n  1000\n\n(1 rows)\n\nWarnings :\nAggregation query used without partition key\n\n\n account_id  | amount    | created_by         | name                   | open_time                       | status\n-------------+-----------+--------------------+------------------------+---------------------------------+-----------\n ACC13554145 | 917.00418 | zb CVvbBTTzitjo5fK |          Jan Sanford I | 2023-06-21 21:50:10.463000+0000 | suspended\n ACC19154140 |  46.99177 |             VH88H9 |       Clyde Bailey PhD | 2023-07-18 11:33:03.675000+0000 |      open\n ACC50587836 |  774.9872 |         GENANwPm t |           Sang Monahan | 2023-03-21 00:16:53.308000+0000 |    closed\n ACC67619387 | 452.86706 |       5msTpcBLStTH |         Jewell Gerlach | 2022-10-18 19:13:07.606000+0000 | suspended\n ACC69889784 |  14.69298 |           WDmOh7NT |          Dale Schulist | 2022-10-25 12:10:52.239000+0000 | suspended\n ACC41977254 |  51.26492 |          J8jAKzvj2 |           Norma Nienow | 2023-08-19 18:54:39.195000+0000 | suspended\n ACC40932912 | 349.68067 |   SLcJgKZdLp5ALMyg | Vincenzo Considine III | 2023-05-16 00:22:45.991000+0000 |    closed\n ACC20642011 | 658.40713 |          clyZRD4fI |  Lannie McLaughlin DDS | 2023-05-11 23:14:30.249000+0000 |      open\n ACC74962085 | 970.98218 |       ZLETTSnj4NpD |          Ima Jerde DVM | 2023-05-07 10:01:56.218000+0000 |   pending\n ACC72848439 | 481.64267 |                 cc |        Kyla Deckow DDS | 2023-08-16 13:28:23.362000+0000 | suspended\n\n(10 rows)\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/data-source/http/","title":"HTTP Source","text":"<p>Info</p> <p>Generating data based on OpenAPI/Swagger document and pushing to HTTP endpoint is a paid feature. Try the free trial here.</p> <p>Creating a data generator based on an OpenAPI/Swagger document.</p> <p></p>"},{"location":"setup/guide/data-source/http/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/http/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/http/#http-setup","title":"HTTP Setup","text":"<p>We will be using the http-bin docker image to help simulate a service with HTTP endpoints.</p> <p>Start it via:</p> <pre><code>cd docker\ndocker-compose up -d http\ndocker ps\n</code></pre>"},{"location":"setup/guide/data-source/http/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedHttpJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedHttpPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedHttpJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedHttpPlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/http/#schema","title":"Schema","text":"<p>We can point the schema of a data source to a OpenAPI/Swagger document or URL. For this example, we will use the OpenAPI document found under <code>docker/mount/http/petstore.json</code> in the data-caterer-example repo. This is a simplified version of the original OpenAPI spec that can be found here.</p> <p>We have kept the following endpoints to test out:</p> <ul> <li>GET /pets - get all pets</li> <li>POST /pets - create a new pet</li> <li>GET /pets/{id} - get a pet by id</li> <li>DELETE /pets/{id} - delete a pet by id</li> </ul> JavaScala <pre><code>var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count().records(2));\n</code></pre> <pre><code>val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count.records(2))\n</code></pre> <p>The above defines that the schema will come from an OpenAPI document found on the pathway defined. It will then generate 2 requests per request method and endpoint combination.</p>"},{"location":"setup/guide/data-source/http/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\n#after completing\ndocker logs -f docker-http-1\n</code></pre> <p>It should look something like this.</p> <pre><code>172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DeXQxFUHVja+EYm%26limit%3D33895 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DSXaFvAqwYGF%26tags%3DjdNRFONA%26limit%3D40975 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/kbH8D7rDuq HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/REsa0tnu7dvekGDvxR HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/EqrOr1dHFfKUjWb HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/7WG7JHPaNxP HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p>"},{"location":"setup/guide/data-source/http/#foreign-keys","title":"Foreign keys","text":"<p>The four different requests that get sent could have the same <code>id</code> passed across to each of them if we define a foreign key relationship. This will make it more realistic to a real life scenario as pets get created and queried by a particular <code>id</code> value. We note that the <code>id</code> value is first used when a pet is created in the body of the POST request. Then it gets used as a path parameter in the DELETE and GET requests.</p> <p>To link them all together, we must follow a particular pattern when referring to request body, query parameter or path parameter columns.</p> HTTP Type Column Prefix Example Request Body <code>bodyContent</code> <code>bodyContent.id</code> Path Parameter <code>pathParam</code> <code>pathParamid</code> Query Parameter <code>queryParam</code> <code>queryParamid</code> Header <code>header</code> <code>headerContent_Type</code> <p>Also note, that when creating a foreign field definition for a HTTP data source, to refer to a specific endpoint and method, we have to follow the pattern of <code>{http method}{http path}</code>. For example, <code>POST/pets</code>. Let's apply this knowledge to link all the <code>id</code> values together.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n);\n\nexecute(myPlan, conf, httpTask);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n)\n\nexecute(myPlan, conf, httpTask)\n</code></pre> <p>Let's test it out by running it again</p> <pre><code>./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n</code></pre> <pre><code>172.21.0.1 [06/Nov/2023:01:33:59 +0000] GET /anything/pets?limit%3D45971 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:00 +0000] GET /anything/pets?limit%3D62015 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:04 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:05 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Now we have the same <code>id</code> values being produced across the POST, DELETE and GET requests! What if we knew that the <code>id</code> values should follow a particular pattern?</p>"},{"location":"setup/guide/data-source/http/#custom-metadata","title":"Custom metadata","text":"<p>So given that we have defined a foreign key where the root of the foreign key values is from the POST request, we can update the metadata of the <code>id</code> column for the POST request and it will proliferate to the other endpoints as well. Given the <code>id</code> column is a nested column as noted in the foreign key, we can alter its metadata via the following:</p> JavaScala <pre><code>var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field().name(\"bodyContent\").schema(field().name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count().records(2));\n</code></pre> <pre><code>val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field.name(\"bodyContent\").schema(field.name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count.records(2))\n</code></pre> <p>We first get the column <code>bodyContent</code>, then get the nested schema and get the column <code>id</code> and add metadata stating that <code>id</code> should follow the patter <code>ID[0-9]{8}</code>.</p> <p>Let's try run again, and hopefully we should see some proper ID values.</p> <pre><code>./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n</code></pre> <pre><code>172.21.0.1 [06/Nov/2023:01:45:45 +0000] GET /anything/pets?tags%3D10fWnNoDz%26limit%3D66804 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:46 +0000] GET /anything/pets?tags%3DhyO6mI8LZUUpS HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:50 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:51 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Great! Now we have replicated a production-like flow of HTTP requests.</p>"},{"location":"setup/guide/data-source/http/#ordering","title":"Ordering","text":"<p>If you wanted to change the ordering of the requests, you can alter the order from within the OpenAPI/Swagger document. This is particularly useful when you want to simulate the same flow that users would take when utilising your application (i.e. create account, query account, update account).</p>"},{"location":"setup/guide/data-source/http/#rows-per-second","title":"Rows per second","text":"<p>By default, Data Caterer will push requests per method and endpoint at a rate of around 5 requests per second. If you want to alter this value, you can do so via the below configuration. The lowest supported requests per second is 1.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.api.model.Constants;\n\n...\nvar httpTask = http(\"my_http\", Map.of(Constants.ROWS_PER_SECOND(), \"1\"))\n...\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.model.Constants.ROWS_PER_SECOND\n\n...\nval httpTask = http(\"my_http\", options = Map(ROWS_PER_SECOND -&gt; \"1\"))\n...\n</code></pre> <p>Check out the full example under <code>AdvancedHttpPlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/kafka/","title":"Kafka","text":"<p>Info</p> <p>Writing data to Kafka is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Kafka. You will build a Docker image that will be able to populate data in kafka for the topics you configure.</p>"},{"location":"setup/guide/data-source/kafka/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Kafka</li> </ul>"},{"location":"setup/guide/data-source/kafka/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Kafka instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/kafka/#kafka-setup","title":"Kafka Setup","text":"<p>Next, let's make sure you have an instance of Kafka up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d kafka\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedKafkaJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedKafkaPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedKafkaJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedKafkaPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/kafka/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Kafka.</p> JavaScala <pre><code>var accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap.of()          //optional additional connection options\n);\n</code></pre> <p>Additional options can be found here.</p> <pre><code>val accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap()             //optional additional connection options\n)\n</code></pre> <p>Additional options can be found here.</p>"},{"location":"setup/guide/data-source/kafka/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>account-topic</code> that is already defined under<code>docker/data/kafka/setup_kafka.sh</code>. This topic should already be setup for you if you followed this step. We can check if the topic is set up already via the following command:</p> <pre><code>docker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n</code></pre> <p>Trimming the connection details to work with the docker-compose Kafka, we have a base Kafka connection to define the topic we will publish to. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>{\nvar kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield().name(\"key\").sql(\"content.account_id\"),\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),  can define partition here\nfield().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n),\nfield().name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield().name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n}\n</code></pre> <pre><code>val kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield.name(\"key\").sql(\"content.account_id\"),\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").type(IntegerType),  can define partition here\nfield.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\nfield.name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield.name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#fields","title":"Fields","text":"<p>The schema defined for Kafka has a format that needs to be followed as noted above. Specifically, the required fields are: - value</p> <p>Whilst, the other fields are optional: - key - partition - headers</p>"},{"location":"setup/guide/data-source/kafka/#headers","title":"headers","text":"<p><code>headers</code> follows a particular pattern that where it is of type <code>array&lt;struct&lt;key: string,value: binary&gt;&gt;</code>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the  <code>value</code> part, it refers to <code>content.account_id</code> where <code>content</code> is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.</p> JavaScala <pre><code>field().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n</code></pre> <pre><code>field.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#transactions","title":"transactions","text":"<p><code>transactions</code> is an array that contains an inner structure of <code>txn_date</code> and <code>amount</code>. The size of the array generated can be controlled via <code>arrayMinLength</code> and <code>arrayMaxLength</code>.</p> JavaScala <pre><code>field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n</code></pre> <pre><code>field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#details","title":"details","text":"<p><code>details</code> is another example of a nested schema structure where it also has a nested structure itself in <code>updated_by</code>. One thing to note here is the <code>first_txn_date</code> field has a reference to the <code>content.transactions</code> array where it will  sort the array by <code>txn_date</code> and get the first element.</p> JavaScala <pre><code>field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n</code></pre> <pre><code>field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>kafkaTask</code>, we have to call <code>execute</code> .</p>"},{"location":"setup/guide/data-source/kafka/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class AdvancedKafkaJavaPlanRun or AdvancedKafkaPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n</code></pre> <p>Your output should look like this.</p> <pre><code>{\"account_id\":\"ACC56292178\",\"year\":2022,\"amount\":18338.627721151555,\"details\":{\"name\":\"Isaias Reilly\",\"first_txn_date\":\"2021-01-22\",\"updated_by\":{\"user\":\"FgYXbKDWdhHVc3\",\"time\":\"2022-12-30T13:49:07.309Z\"}},\"transactions\":[{\"txn_date\":\"2021-01-22\",\"amount\":30556.52125487579},{\"txn_date\":\"2021-10-29\",\"amount\":39372.302259554635},{\"txn_date\":\"2021-10-29\",\"amount\":61887.31389495968}]}\n{\"account_id\":\"ACC37729457\",\"year\":2022,\"amount\":96885.31758764731,\"details\":{\"name\":\"Randell Witting\",\"first_txn_date\":\"2021-06-30\",\"updated_by\":{\"user\":\"HCKYEBHN8AJ3TB\",\"time\":\"2022-12-02T02:05:01.144Z\"}},\"transactions\":[{\"txn_date\":\"2021-06-30\",\"amount\":98042.09647765031},{\"txn_date\":\"2021-10-06\",\"amount\":41191.43564742036},{\"txn_date\":\"2021-11-16\",\"amount\":78852.08184809204},{\"txn_date\":\"2021-10-09\",\"amount\":13747.157653571106}]}\n{\"account_id\":\"ACC23127317\",\"year\":2023,\"amount\":81164.49304198896,\"details\":{\"name\":\"Jed Wisozk\",\"updated_by\":{\"user\":\"9MBFZZ\",\"time\":\"2023-07-12T05:56:52.397Z\"}},\"transactions\":[]}\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/data-source/marquez-metadata-source/","title":"Metadata Source","text":"<p>Info</p> <p>Generating data based on an external metadata source is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Postgres tables and CSV file based on metadata stored in Marquez ( follows OpenLineage API).</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/marquez-metadata-source/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/marquez-metadata-source/#marquez-setup","title":"Marquez Setup","text":"<p>You can follow the README found here to help with setting up Marquez in your local environment. This comes with an instance of Postgres which we will also be using as a data store for generated data.</p> <p>The command that was run for this example to help with setup of dummy data was <code>./docker/up.sh -a 5001 -m 5002 --seed</code>.</p> <p>Check that the following url shows some data like below once you click on <code>food_delivery</code> from the <code>ns</code> drop down in the top right corner.</p> <p></p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#postgres-setup","title":"Postgres Setup","text":"<p>Since we will also be using the Marquez Postgres instance as a data source, we will set up a separate database to store the generated data in via:</p> <pre><code>docker exec marquez-db psql -Upostgres -c 'CREATE DATABASE food_delivery'\n</code></pre>"},{"location":"setup/guide/data-source/marquez-metadata-source/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedMetadataSourceJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedMetadataSourcePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#schema","title":"Schema","text":"<p>We can point the schema of a data source to our Marquez instance. For the Postgres data source, we will point to a <code>namespace</code>, which in Marquez or OpenLineage, represents a set of datasets. For the CSV data source, we will point to a specific <code>namespace</code> and <code>dataset</code>.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala <pre><code>var csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map.of(\"saveMode\", \"overwrite\", \"header\", \"true\"))\n.schema(metadataSource().marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count().records(10));\n</code></pre> <pre><code>val csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map(\"saveMode\" -&gt; \"overwrite\", \"header\" -&gt; \"true\"))\n.schema(metadataSource.marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count.records(10))\n</code></pre> <p>The above defines that the schema will come from Marquez, which is a type of metadata source that contains information about schemas. Specifically, it points to the <code>food_delivery</code> namespace and <code>public.categories</code> dataset to retrieve the schema information from.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#multiple-schemas","title":"Multiple Schemas","text":"JavaScala <pre><code>var postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\", Map.of())\n.schema(metadataSource().marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count().records(10));\n</code></pre> <pre><code>val postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\")\n.schema(metadataSource.marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count.records(10))\n</code></pre> <p>We now have pointed this Postgres instance to produce multiple schemas that are defined under the <code>food_delivery</code> namespace. Also note that we are using database <code>food_delivery</code> in Postgres to push our generated data to, and we have set the number of records per sub data source (in this case, per table) to be 10.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\n#after completing\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n</code></pre> <p>It should look something like this.</p> <pre><code> order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |         customer_email         |                     customer_address                     | menu_id | restaurant_id |                        restaurant_address\n   | menu_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+--------------------------------+----------------------------------------------------------+---------+---------------+---------------------------------------------------------------\n---+--------------+-------------+-------------+---------+-----------\n    38736 | 2023-02-05 06:05:23.755 | 2023-09-08 04:29:10.878 | 2023-09-03 23:58:34.285 | april.skiles@hotmail.com       | 5018 Lang Dam, Gaylordfurt, MO 35172                     |   59841 |         30971 | Suite 439 51366 Bartoletti Plains, West Lashawndamouth, CA 242\n42 |        55697 |       36370 |       21574 |   88022 |     16569\n4376 | 2022-12-19 14:39:53.442 | 2023-08-30 07:40:06.948 | 2023-03-15 20:38:26.11  | adelina.balistreri@hotmail.com | Apt. 340 9146 Novella Motorway, East Troyhaven, UT 34773 |   66195 |         42765 | Suite 670 8956 Rob Fork, Rennershire, CA 04524\n|        26516 |       81335 |       87615 |   27433 |     45649\n11083 | 2022-10-30 12:46:38.692 | 2023-06-02 13:05:52.493 | 2022-11-27 18:38:07.873 | johnny.gleason@gmail.com       | Apt. 385 99701 Lemke Place, New Irvin, RI 73305          |   66427 |         44438 | 1309 Danny Cape, Weimanntown, AL 15865\n|        41686 |       36508 |       34498 |   24191 |     92405\n58759 | 2023-07-26 14:32:30.883 | 2022-12-25 11:04:08.561 | 2023-04-21 17:43:05.86  | isabelle.ohara@hotmail.com     | 2225 Evie Lane, South Ardella, SD 90805                  |   27106 |         25287 | Suite 678 3731 Dovie Park, Port Luigi, ID 08250\n|        94205 |       66207 |       81051 |   52553 |     27483\n</code></pre> <p>You can also try query some other tables. Let's also check what is in the CSV file.</p> <pre><code>$ head docker/sample/csv/part-0000*\nmenu_item_id,category_id,discount_id,city_id,driver_id,order_id,order_placed_on,order_dispatched_on,order_delivered_on,customer_email,customer_address,menu_id,restaurant_id,restaurant_address\n72248,37098,80135,45888,5036,11090,2023-09-20T05:33:08.036+08:00,2023-05-16T23:10:57.119+08:00,2023-05-01T22:02:23.272+08:00,demetrice.rohan@hotmail.com,\"406 Harmony Rue, Wisozkburgh, MD 12282\",33762,9042,\"Apt. 751 0796 Ellan Flats, Lake Chetville, WI 81957\"\n41644,40029,48565,83373,89919,58359,2023-04-18T06:28:26.194+08:00,2022-10-15T18:17:48.998+08:00,2023-02-06T17:02:04.104+08:00,joannie.okuneva@yahoo.com,\"Suite 889 022 Susan Lane, Zemlakport, OR 56996\",27467,6216,\"Suite 016 286 Derick Grove, Dooleytown, NY 14664\"\n49299,53699,79675,40821,61764,72234,2023-07-16T21:33:48.739+08:00,2023-02-14T21:23:10.265+08:00,2023-09-18T02:08:51.433+08:00,ina.heller@yahoo.com,\"Suite 600 86844 Heller Island, New Celestinestad, DE 42622\",48002,12462,\"5418 Okuneva Mountain, East Blairchester, MN 04060\"\n83197,86141,11085,29944,81164,65382,2023-01-20T06:08:25.981+08:00,2023-01-11T13:24:32.968+08:00,2023-09-09T02:30:16.890+08:00,lakisha.bashirian@yahoo.com,\"Suite 938 534 Theodore Lock, Port Caitlynland, LA 67308\",69109,47727,\"4464 Stewart Tunnel, Marguritemouth, AR 56791\"\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p> <p>What if we wanted the same records in Postgres <code>public.delivery_7_days</code> to also show up in the CSV file? That's where we can use a foreign key definition.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#foreign-key","title":"Foreign Key","text":"<p>We can take a look at the report (under <code>docker/sample/report/index.html</code>) to see what we need to do to create the  foreign key. From the overview, you should see under <code>Tasks</code> there is a <code>my_postgres</code> task which has  <code>food_delivery_public.delivery_7_days</code> as a step. Click on the link for <code>food_delivery_public.delivery_7_days</code> and it  will take us to a page where we can find out about the columns used in this table. Click on the <code>Fields</code> button on the  far right to see.</p> <p>We can copy all of a subset of fields that we want matched across the CSV file and Postgres. For this example, we will  take all the fields.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\npostgresTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask);\n</code></pre> <pre><code>val foreignCols = List(\"order_id\", \"order_placed_on\", \"order_dispatched_on\", \"order_delivered_on\", \"customer_email\",\n\"customer_address\", \"menu_id\", \"restaurant_id\", \"restaurant_address\", \"menu_item_id\", \"category_id\", \"discount_id\",\n\"city_id\", \"driver_id\")\n\nval myPlan = plan.addForeignKeyRelationships(\ncsvTask, foreignCols,\nList(foreignField(postgresTask, \"food_delivery_public.delivery_7_days\", foreignCols))\n)\n\nval conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask)\n</code></pre> <p>Notice how we have defined the <code>csvTask</code> and <code>foreignCols</code> as the main foreign key but for <code>postgresTask</code>, we had to  define it as a <code>foreignField</code>. This is because <code>postgresTask</code> has multiple tables within it, and we only want to define our foreign key with respect to the <code>public.delivery_7_days</code> table. We use the step name (can be seen from the report)  to specify the table to target. </p> <p>To test this out, we will truncate the <code>public.delivery_7_days</code> table in Postgres first, and then try run again.</p> <pre><code>docker exec marquez-db psql -Upostgres -d food_delivery -c 'TRUNCATE public.delivery_7_days'\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n</code></pre> <pre><code> order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |        customer_email        |\ncustomer_address                     | menu_id | restaurant_id |                   restaurant_address                   | menu\n_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+------------------------------+-------------\n--------------------------------------------+---------+---------------+--------------------------------------------------------+-----\n---------+-------------+-------------+---------+-----------\n    53333 | 2022-10-15 08:40:23.394 | 2023-01-23 09:42:48.397 | 2023-08-12 08:50:52.397 | normand.aufderhar@gmail.com  | Apt. 036 449\n27 Wilderman Forge, Marvinchester, CT 15952 |   40412 |         70130 | Suite 146 98176 Schaden Village, Grahammouth, SD 12354 |\n90141 |       44210 |       83966 |   78614 |     77449\n</code></pre> <p>Let's grab the first email from the Postgres table and check whether the same record exists in the CSV file.</p> <pre><code>$ cat docker/sample/csv/part-0000* | grep normand.aufderhar\n90141,44210,83966,78614,77449,53333,2022-10-15T08:40:23.394+08:00,2023-01-23T09:42:48.397+08:00,2023-08-12T08:50:52.397+08:00,normand.aufderhar@gmail.com,\"Apt. 036 44927 Wilderman Forge, Marvinchester, CT 15952\",40412,70130,\"Suite 146 98176 Schaden Village, Grahammouth, SD 12354\"\n</code></pre> <p>Great! Now we have the ability to get schema information from an external source, add our own foreign keys and generate  data.</p> <p>Check out the full example under <code>AdvancedMetadataSourcePlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/open-metadata-source/","title":"OpenMetadata Source","text":"<p>Info</p> <p>Generating data based on an external metadata source is a paid feature. Try the free trial here.</p> <p>Creating a data generator for a JSON file based on metadata stored in OpenMetadata.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/open-metadata-source/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/open-metadata-source/#openmetadata-setup","title":"OpenMetadata Setup","text":"<p>You can follow the local docker setup found here to help with setting up OpenMetadata in your local environment.</p> <p>If that page becomes outdated or the link doesn't work, below are the commands I used to run it:</p> <pre><code>mkdir openmetadata-docker &amp;&amp; cd openmetadata-docker\ncurl -sL https://github.com/open-metadata/OpenMetadata/releases/download/1.2.0-release/docker-compose.yml &gt; docker-compose.yml\ndocker compose -f docker-compose.yml up --detach\n</code></pre> <p>Check that the following url works and login with <code>admin:admin</code>. Then you should see some data  like below:</p> <p></p>"},{"location":"setup/guide/data-source/open-metadata-source/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedOpenMetadataSourceJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedOpenMetadataSourcePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedOpenMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedOpenMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#schema","title":"Schema","text":"<p>We can point the schema of a data source to our OpenMetadata instance. We will use a JSON data source so that we can show how nested data types are handled and how we could customise it.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala <pre><code>import com.github.pflooky.datacaterer.api.model.Constants;\n...\n\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(metadataSource().openMetadataJava(\n\"http://localhost:8585/api\",                                                              //url\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(),                                        //auth type\nMap.of(                                                                                   //additional options (including auth options)\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\",                                        //get from settings/bots/ingestion-bot\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count().records(10));\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.model.Constants.{OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, OPEN_METADATA_JWT_TOKEN, OPEN_METADATA_TABLE_FQN, SAVE_MODE}\n...\n\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -&gt; \"overwrite\"))\n.schema(metadataSource.openMetadata(\n\"http://localhost:8585/api\",                                                  //url\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA,                                        //auth type\nMap(                                                                          //additional options (including auth options)\nOPEN_METADATA_JWT_TOKEN -&gt; \"abc123\",                                        //get from settings/bots/ingestion-bot\nOPEN_METADATA_TABLE_FQN -&gt; \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count.records(10))\n</code></pre> <p>The above defines that the schema will come from OpenMetadata, which is a type of metadata source that contains information about schemas. Specifically, it points to the <code>sample_data.ecommerce_db.shopify.raw_customer</code> table. You can check out the schema here to see what it looks like.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedOpenMetadataSourceJavaPlanRun or MyAdvancedOpenMetadataSourcePlanRun\n#after completing\ncat docker/sample/json/part-00000-*\n</code></pre> <p>It should look something like this.</p> <pre><code>{\n\"comments\": \"Mh6jqpD5e4M\",\n\"creditcard\": \"6771839575926717\",\n\"membership\": \"Za3wCQUl9E  EJj712\",\n\"orders\": [\n{\n\"product_id\": \"Aa6NG0hxfHVq\",\n\"price\": 16139,\n\"onsale\": false,\n\"tax\": 58134,\n\"weight\": 40734,\n\"others\": 45813,\n\"vendor\": \"Kh\"\n},\n{\n\"product_id\": \"zbHBY \",\n\"price\": 17903,\n\"onsale\": false,\n\"tax\": 39526,\n\"weight\": 9346,\n\"others\": 52035,\n\"vendor\": \"jbkbnXAa\"\n},\n{\n\"product_id\": \"5qs3gakppd7Nw5\",\n\"price\": 48731,\n\"onsale\": true,\n\"tax\": 81105,\n\"weight\": 2004,\n\"others\": 20465,\n\"vendor\": \"nozCDMSXRPH Ev\"\n},\n{\n\"product_id\": \"CA6h17ANRwvb\",\n\"price\": 62102,\n\"onsale\": true,\n\"tax\": 96601,\n\"weight\": 78849,\n\"others\": 79453,\n\"vendor\": \" ihVXEJz7E2EFS\"\n}\n],\n\"platform\": \"GLt9\",\n\"preference\": {\n\"key\": \"nmPmsPjg C\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Loren Bechtelar\",\n\"street_address\": \"Suite 526 293 Rohan Road, Wunschshire, NE 25532\",\n\"city\": \"South Norrisland\",\n\"postcode\": \"56863\"\n}\n],\n\"shipping_date\": \"2022-11-03\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"lance.murphy\",\n\"name\": \"Zane Brakus DVM\",\n\"sex\": \"7HcAaPiO\",\n\"address\": \"594 Loida Haven, Gilland, MA 26071\",\n\"mail\": \"Un3fhbvK2rEbenIYdnq\",\n\"birthdate\": \"2023-01-31\"\n}\n}\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#custom-metadata","title":"Custom metadata","text":"<p>We can see from the data generated, that it isn't quite what we want. The metadata is not sufficient for us to produce production-like data yet. Let's try to add some enhancements to it.</p> <p>Let's make the <code>platform</code> field a choice field that can only be a set of certain values and the nested field <code>customer.sex</code> is also from a predefined set of values.</p> JavaScala <pre><code>var jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield().name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield().name(\"customer\").schema(field().name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count().records(10));\n</code></pre> <pre><code>val jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -&gt; \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield.name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield.name(\"customer\").schema(field.name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count.records(10))\n</code></pre> <p>Let's test it out by running it again</p> <pre><code>./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ncat docker/sample/json/part-00000-*\n</code></pre> <pre><code>{\n\"comments\": \"vqbPUm\",\n\"creditcard\": \"6304867705548636\",\n\"membership\": \"GZ1xOnpZSUOKN\",\n\"orders\": [\n{\n\"product_id\": \"rgOokDAv\",\n\"price\": 77367,\n\"onsale\": false,\n\"tax\": 61742,\n\"weight\": 87855,\n\"others\": 26857,\n\"vendor\": \"04XHR64ImMr9T\"\n}\n],\n\"platform\": \"mobile\",\n\"preference\": {\n\"key\": \"IB5vNdWka\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Isiah Bins\",\n\"street_address\": \"36512 Ross Spurs, Hillhaven, IA 18760\",\n\"city\": \"Averymouth\",\n\"postcode\": \"75818\"\n},\n{\n\"name\": \"Scott Prohaska\",\n\"street_address\": \"26573 Haley Ports, Dariusland, MS 90642\",\n\"city\": \"Ashantimouth\",\n\"postcode\": \"31792\"\n},\n{\n\"name\": \"Rudolf Stamm\",\n\"street_address\": \"Suite 878 0516 Danica Path, New Christiaport, ID 10525\",\n\"city\": \"Doreathaport\",\n\"postcode\": \"62497\"\n}\n],\n\"shipping_date\": \"2023-08-24\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"jolie.cremin\",\n\"name\": \"Fay Klein\",\n\"sex\": \"O\",\n\"address\": \"Apt. 174 5084 Volkman Creek, Hillborough, PA 61959\",\n\"mail\": \"BiTmzb7\",\n\"birthdate\": \"2023-04-07\"\n}\n}\n</code></pre> <p>Great! Now we have the ability to get schema information from an external source, add our own metadata and generate  data.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#data-validation","title":"Data validation","text":"<p>Another aspect of OpenMetadata that can be leveraged is the definition of data quality rules. These rules can be  incorporated into your Data Caterer job as well by enabling data validations via <code>enableGenerateValidations</code> in  <code>configuration</code>.</p> JavaScala <pre><code>var conf = configuration().enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(conf, jsonTask);\n</code></pre> <pre><code>val conf = configuration.enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(conf, jsonTask)\n</code></pre> <p>Check out the full example under <code>AdvancedOpenMetadataSourcePlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/solace/","title":"Solace","text":"<p>Info</p> <p>Writing data to Solace is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Solace. You will build a Docker image that will be able to populate data in Solace for the queues/topics you configure.</p> <p></p>"},{"location":"setup/guide/data-source/solace/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Solace</li> </ul>"},{"location":"setup/guide/data-source/solace/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Solace instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/solace/#solace-setup","title":"Solace Setup","text":"<p>Next, let's make sure you have an instance of Solace up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d solace\n</code></pre> <p>Open up localhost:8080 and login with <code>admin:admin</code> and check there is the <code>default</code> VPN like below. Notice there is 2 queues/topics created. If you do not see 2 created, try to run the script found under <code>docker/data/solace/setup_solace.sh</code> and change the <code>host</code> to <code>localhost</code>.</p> <p></p>"},{"location":"setup/guide/data-source/solace/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedSolaceJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedSolacePlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedSolaceJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedSolacePlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/solace/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Solace.</p> JavaScala <pre><code>var accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap.of()                            //optional additional connection options\n);\n</code></pre> <p>Additional connection options can be found here.</p> <pre><code>val accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap()                               //optional additional connection options\n)\n</code></pre> <p>Additional connection options can be found here.</p>"},{"location":"setup/guide/data-source/solace/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>rest_test_queue</code> or <code>rest_test_topic</code> that is already created for us from this step.</p> <p>Trimming the connection details to work with the docker-compose Solace, we have a base Solace connection to define the JNDI destination we will publish to. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>{\nvar solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),   //can define message JMS priority here\nfield().name(\"headers\")                                     //set message properties via headers field\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()).min(2021).max(2023),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n)\n)\n.count(count().records(10));\n}\n</code></pre> <pre><code>val solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").`type`(IntegerType),  //can define message JMS priority here\nfield.name(\"headers\")                           //set message properties via headers field\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\n).count(count.records(10))\n</code></pre>"},{"location":"setup/guide/data-source/solace/#fields","title":"Fields","text":"<p>The schema defined for Solace has a format that needs to be followed as noted above. Specifically, the required fields are:</p> <ul> <li>value</li> </ul> <p>Whilst, the other fields are optional:</p> <ul> <li>partition - refers to JMS priority of the message</li> <li>headers - refers to JMS message properties</li> </ul>"},{"location":"setup/guide/data-source/solace/#headers","title":"headers","text":"<p><code>headers</code> follows a particular pattern that where it is of type <code>HeaderType.getType</code> which behind the scenes, translates to<code>array&lt;struct&lt;key: string,value: binary&gt;&gt;</code>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the<code>value</code> part, it refers to <code>content.account_id</code> where  <code>content</code> is another field defined at the top level of the schema. This allows you to reference other values that have  already been generated.</p> JavaScala <pre><code>field().name(\"headers\")\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n</code></pre> <pre><code>field.name(\"headers\")\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#transactions","title":"transactions","text":"<p><code>transactions</code> is an array that contains an inner structure of <code>txn_date</code> and <code>amount</code>. The size of the array generated can be controlled via <code>arrayMinLength</code> and <code>arrayMaxLength</code>.</p> JavaScala <pre><code>field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n</code></pre> <pre><code>field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#details","title":"details","text":"<p><code>details</code> is another example of a nested schema structure where it also has a nested structure itself in <code>updated_by</code>. One thing to note here is the <code>first_txn_date</code> field has a reference to the <code>content.transactions</code> array where it will sort the array by <code>txn_date</code> and get the first element.</p> JavaScala <pre><code>field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n</code></pre> <pre><code>field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n</code></pre>"},{"location":"setup/guide/data-source/solace/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>kafkaTask</code>, we have to call <code>execute</code>.</p>"},{"location":"setup/guide/data-source/solace/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class AdvancedSolaceJavaPlanRun or AdvancedSolacePlanRun\n#after completing, check http://localhost:8080 from browser\n</code></pre> <p>Your output should look like this.</p> <p></p> <p>Unfortunately, there is no easy way to see the message content. You can check the message content from your application or service that consumes these messages.</p> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed. Or view the sample report found here.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/","title":"Auto Generate From Data Connection","text":"<p>Info</p> <p>Auto data generation from data connection is a paid feature. Try the free trial here.</p> <p>Creating a data generator based on only a data connection to Postgres.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/auto-generate-connection/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/auto-generate-connection/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedAutomatedJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedAutomatedPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedAutomatedJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedAutomatedPlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n</code></pre> <p>In the above code, we note the following:</p> <ol> <li>Data source configuration to a Postgres data source called <code>my_postgres</code></li> <li>We have enabled the flag <code>enableGeneratePlanAndTasks</code> which tells Data Caterer to go to <code>my_postgres</code> and generate    data for all the tables found under the database <code>customer</code> (which is defined in the connection string).</li> <li>The config <code>generatedPlanAndTaskFolderPath</code> defines where the metadata that is gathered from <code>my_postgres</code> should be    saved at so that we could re-use it later.</li> <li><code>enableUniqueCheck</code> is set to true to ensure that generated data is unique based on primary key or foreign key    definitions.</li> </ol> <p>Note</p> <p>Unique check will only ensure generated data is unique. Any existing data in your data source is not taken into  account, so generated data may fail to insert depending on the data source restrictions</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#postgres-setup","title":"Postgres Setup","text":"<p>If you don't have your own Postgres up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n</code></pre> <p>This will create the tables found under <code>docker/data/sql/postgres/customer.sql</code>. You can change this file to contain your own tables. We can see there are 4 tables created for us, <code>accounts, balances, transactions and mapping</code>.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedAutomatedJavaPlanRun or MyAdvancedAutomatedPlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1;'\n</code></pre> <p>It should look something like this.</p> <pre><code>   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n</code></pre> <p>The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.</p> <p>Also check the HTML report that gets generated under <code>docker/sample/report/index.html</code>. You can see a summary of what was generated along with other metadata.</p> <p>You can now look to play around with other tables or data sources and auto generate for them.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#learn-from-existing-data","title":"Learn From Existing Data","text":"<p>If you have any existing data within your data source, Data Caterer will gather metadata about the existing data to help guide it when generating new data. There are configurations that can help tune the metadata analysis found here.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#filter-out-schematables","title":"Filter Out Schema/Tables","text":"<p>As part of your connection definition, you can define any schemas and/or tables your don't want to generate data for. In the example below, it will not generate any data for any tables under the <code>history</code> and <code>audit</code> schemas. Also, any table with the name <code>balances</code> or <code>transactions</code> in any schema will also not have data generated.</p> JavaScala <pre><code>var autoRun = configuration()\n.postgres(\n\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap.of(\n\"filterOutSchema\", \"history, audit\",\n\"filterOutTable\", \"balances, transactions\")\n)\n)\n</code></pre> <pre><code>val autoRun = configuration\n.postgres(\n\"my_postgres\",\n\"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap(\n\"filterOutSchema\" -&gt; \"history, audit\",\n\"filterOutTable\" -&gt; \"balances, transactions\")\n)\n)\n</code></pre>"},{"location":"setup/guide/scenario/auto-generate-connection/#define-record-count","title":"Define record count","text":"<p>You can control the record count per sub data source via <code>numRecordsPerStep</code>.</p> JavaScala <pre><code>var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n</code></pre> <pre><code>val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/","title":"Generate Batch and Event Data","text":"<p>Info</p> <p>Generating event data is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Kafka topic with matching records in a CSV file.</p>"},{"location":"setup/guide/scenario/batch-and-event/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/batch-and-event/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/#kafka-setup","title":"Kafka Setup","text":"<p>If you don't have your own Kafka up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d kafka\ndocker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n</code></pre> <p>Let's create a task for inserting data into the <code>account-topic</code> that is already defined under<code>docker/data/kafka/setup_kafka.sh</code>.</p>"},{"location":"setup/guide/scenario/batch-and-event/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedBatchEventJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedBatchEventPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedBatchEventJavaPlanRun extends PlanRun {\n{\nvar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedBatchEventPlanRun extends PlanRun {\nval kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n}\n</code></pre> <p>We will borrow the Kafka task that is already defined under the class <code>AdvancedKafkaPlanRun</code> or <code>AdvancedKafkaJavaPlanRun</code>. You can go through the Kafka guide here for more details.</p>"},{"location":"setup/guide/scenario/batch-and-event/#schema","title":"Schema","text":"<p>Let us set up the corresponding schema for the CSV file where we want to match the values that are generated for the Kafka messages.</p> JavaScala <pre><code>var kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n\nvar csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield().name(\"account_number\"),\nfield().name(\"year\"),\nfield().name(\"name\"),\nfield().name(\"payload\")\n);\n</code></pre> <pre><code>val kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n\nval csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield.name(\"account_number\"),\nfield.name(\"year\"),\nfield.name(\"name\"),\nfield.name(\"payload\")\n)\n</code></pre> <p>This is a simple schema where we want to use the values and metadata that is already defined in the <code>kafkaTask</code> to determine what the data will look like for the CSV file. Even if we defined some metadata here, it would be overridden when we define our foreign key relationships.</p>"},{"location":"setup/guide/scenario/batch-and-event/#foreign-keys","title":"Foreign Keys","text":"<p>From the above CSV schema, we see note the following against the Kafka schema:</p> <ul> <li><code>account_number</code> in CSV needs to match with the <code>account_id</code> in Kafka<ul> <li>We see that <code>account_id</code> is referred to in the <code>key</code> column as <code>field.name(\"key\").sql(\"content.account_id\")</code></li> </ul> </li> <li><code>year</code> needs to match with <code>content.year</code> in Kafka, which is a nested field<ul> <li>We can only do foreign key relationships with top level fields, not nested fields. So we define a new column   called <code>tmp_year</code> which will not appear in the final output for the Kafka messages but is used as an intermediate   step <code>field.name(\"tmp_year\").sql(\"content.year\").omit(true)</code></li> </ul> </li> <li><code>name</code> needs to match with <code>content.details.name</code> in Kafka, also a nested field<ul> <li>Using the same logic as above, we define a temporary column called <code>tmp_name</code> which will take the value of the   nested field but will be omitted <code>field.name(\"tmp_name\").sql(\"content.details.name\").omit(true)</code></li> </ul> </li> <li><code>payload</code> represents the whole JSON message sent to Kafka, which matches to <code>value</code> column</li> </ul> <p>Our foreign keys are therefore defined like below. Order is important when defining the list of columns. The index needs to match with the corresponding column in the other data source.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\nkafkaTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(myPlan, conf, kafkaTask, csvTask);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\nkafkaTask, List(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList(csvTask -&gt; List(\"account_number\", \"year\", \"name\", \"payload\"))\n)\n\nval conf = configuration.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(myPlan, conf, kafkaTask, csvTask)\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedBatchEventJavaPlanRun or MyAdvancedBatchEventPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n</code></pre> <p>It should look something like this.</p> <pre><code>{\"account_id\":\"ACC03093143\",\"year\":2023,\"amount\":87990.37196728592,\"details\":{\"name\":\"Nadine Heidenreich Jr.\",\"first_txn_date\":\"2021-11-09\",\"updated_by\":{\"user\":\"YfEyJCe8ohrl0j IfyT\",\"time\":\"2022-09-26T20:47:53.404Z\"}},\"transactions\":[{\"txn_date\":\"2021-11-09\",\"amount\":97073.7914706189}]}\n{\"account_id\":\"ACC08764544\",\"year\":2021,\"amount\":28675.58758765888,\"details\":{\"name\":\"Delila Beer\",\"first_txn_date\":\"2021-05-19\",\"updated_by\":{\"user\":\"IzB5ksXu\",\"time\":\"2023-01-26T20:47:26.389Z\"}},\"transactions\":[{\"txn_date\":\"2021-10-01\",\"amount\":80995.23818711648},{\"txn_date\":\"2021-05-19\",\"amount\":92572.40049217848},{\"txn_date\":\"2021-12-11\",\"amount\":99398.79832225188}]}\n{\"account_id\":\"ACC62505420\",\"year\":2023,\"amount\":96125.3125884202,\"details\":{\"name\":\"Shawn Goodwin\",\"updated_by\":{\"user\":\"F3dqIvYp2pFtena4\",\"time\":\"2023-02-11T04:38:29.832Z\"}},\"transactions\":[]}\n</code></pre> <p>Let's also check if there is a corresponding record in the CSV file.</p> <pre><code>$ cat docker/sample/csv/account/part-0000* | grep ACC03093143\nACC03093143,2023,Nadine Heidenreich Jr.,\"{\\\"account_id\\\":\\\"ACC03093143\\\",\\\"year\\\":2023,\\\"amount\\\":87990.37196728592,\\\"details\\\":{\\\"name\\\":\\\"Nadine Heidenreich Jr.\\\",\\\"first_txn_date\\\":\\\"2021-11-09\\\",\\\"updated_by\\\":{\\\"user\\\":\\\"YfEyJCe8ohrl0j IfyT\\\",\\\"time\\\":\\\"2022-09-26T20:47:53.404Z\\\"}},\\\"transactions\\\":[{\\\"txn_date\\\":\\\"2021-11-09\\\",\\\"amount\\\":97073.7914706189}]}\"\n</code></pre> <p>Great! The account, year, name and payload look to all match up.</p>"},{"location":"setup/guide/scenario/batch-and-event/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/batch-and-event/#order-of-execution","title":"Order of execution","text":"<p>You may notice that the events are generated first, then the CSV file. This is because as part of the <code>execute</code> function, we passed in the <code>kafkaTask</code> first, before the <code>csvTask</code>. You can change the order of execution by passing in <code>csvTask</code> before <code>kafkaTask</code> into the <code>execute</code> function.</p>"},{"location":"setup/guide/scenario/data-validation/","title":"Data Validations","text":"<p>Creating a data validator for a JSON file.</p> <p></p>"},{"location":"setup/guide/scenario/data-validation/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/data-validation/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#data-setup","title":"Data Setup","text":"<p>To aid in showing the functionality of data validations, we will first generate some data that our validations will run against. Run the below command and it will generate JSON files under <code>docker/sample/json</code> folder.</p> <pre><code>./run.sh JsonPlan\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyValidationJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyValidationPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyValidationJavaPlan extends PlanRun {\n{\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\");\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyValidationPlan extends PlanRun {\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n}\n</code></pre> <p>As noted above, we create a JSON task that points to where the JSON data has been created at folder <code>/opt/app/data/json</code> . We also note that <code>enableValidation</code> is set to <code>true</code> and <code>enableGenerateData</code> to <code>false</code> to tell Data Catering, we only want to validate data.</p>"},{"location":"setup/guide/scenario/data-validation/#validations","title":"Validations","text":"<p>For reference, the schema in which we will be validating against looks like the below.</p> <pre><code>.schema(\nfield.name(\"account_id\"),\n  field.name(\"year\").`type`(IntegerType),\n  field.name(\"balance\").`type`(DoubleType),\n  field.name(\"date\").`type`(DateType),\n  field.name(\"status\"),\n  field.name(\"update_history\").`type`(ArrayType)\n.schema(\nfield.name(\"updated_time\").`type`(TimestampType),\n      field.name(\"status\").oneOf(\"open\", \"closed\", \"pending\", \"suspended\"),\n    ),\n  field.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\n      field.name(\"age\").`type`(IntegerType),\n      field.name(\"city\").expression(\"#{Address.city}\")\n)\n)\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#basic-validation","title":"Basic Validation","text":"<p>Let's say our goal is to validate the <code>customer_details.name</code> field to ensure it conforms to the regex pattern <code>[A-Z][a-z]+ [A-Z][a-z]+</code>. Given the diversity in naming conventions across cultures and countries, variations such as middle names, suffixes, prefixes, or language-specific differences are tolerated to a certain extent. The validation considers an acceptable error threshold before marking it as failed.</p>"},{"location":"setup/guide/scenario/data-validation/#validation-criteria","title":"Validation Criteria","text":"<ul> <li>Field to Validate: <code>customer_details.name</code></li> <li>Regex Pattern: <code>[A-Z][a-z]+ [A-Z][a-z]+</code></li> <li>Error Tolerance: If more than 10% do not match the regex, then fail.</li> </ul>"},{"location":"setup/guide/scenario/data-validation/#considerations","title":"Considerations","text":"<ul> <li>Customisation<ul> <li>Adjust the regex pattern and error threshold based on your specific data schema and validation requirements.</li> <li>For the full list of types of basic validations that can be   used, check this page.</li> </ul> </li> <li>Understanding Tolerance<ul> <li>Be mindful of the error threshold, as it directly influences what percentage of deviations from the pattern is   acceptable.</li> </ul> </li> </ul> JavaScala <pre><code>validation().col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1)                                      //&lt;=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"),  //description to add context in report or other developers\n</code></pre> <pre><code>validation.col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1)                                      //&lt;=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"),  //description to add context in report or other developers\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#custom-validation","title":"Custom Validation","text":"<p>There will be situation where you have a complex data setup and require you own custom logic to use for data validation. You can achieve this via setting your own SQL expression that returns a boolean value. An example is seen below where we want to check the array <code>update_history</code>, that each entry has <code>updated_time</code> greater than a certain timestamp.</p> JavaScala <pre><code>validation().expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\n</code></pre> <pre><code>validation.expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\n</code></pre> <p>If you want to know what other SQL function are available for you to use, check this page.</p>"},{"location":"setup/guide/scenario/data-validation/#group-by-validation","title":"Group By Validation","text":"<p>There are scenarios where you want to validate against grouped values or the whole dataset via aggregations. An example would be validating that each customer's transactions sum is greater than 0.</p>"},{"location":"setup/guide/scenario/data-validation/#validation-criteria_1","title":"Validation Criteria","text":"<p>Line 1: <code>validation.groupBy().count().isEqual(100)</code></p> <ul> <li>Method Chaining<ul> <li><code>groupBy()</code>: Group by whole dataset.</li> <li><code>count()</code>: Counts the number of dataset elements.</li> <li><code>isEqual(100)</code>: Checks if the count is equal to 100.</li> </ul> </li> <li>Validation Rule<ul> <li>This line ensures that the count of the total dataset is exactly 100.</li> </ul> </li> </ul> <p>Line 2: <code>validation.groupBy(\"account_id\").max(\"balance\").lessThan(900)</code></p> <ul> <li>Method Chaining<ul> <li><code>groupBy(\"account_id\")</code>: Groups the data based on the <code>account_id</code> field.</li> <li><code>max(\"balance\")</code>: Calculates the maximum value of the <code>balance</code> field within each group.</li> <li><code>lessThan(900)</code>: Checks if the maximum balance in each group is less than 900.</li> </ul> </li> <li>Validation Rule<ul> <li>This line ensures that, for each group identified by <code>account_id</code> the maximum balance is less than 900.</li> </ul> </li> </ul>"},{"location":"setup/guide/scenario/data-validation/#considerations_1","title":"Considerations","text":"<ul> <li>Adjust the <code>errorThreshold</code> or validation to your specification scenario. The full list   of types of validations can be found here.</li> <li>For the full list of types of group by validations that can be   used, check this page.</li> </ul> JavaScala <pre><code>validation().groupBy().count().isEqual(100),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n</code></pre> <pre><code>validation.groupBy().count().isEqual(100),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#sample-validation","title":"Sample Validation","text":"<p>To try cover the majority of validation cases, the below has been created.</p> JavaScala <pre><code>var jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation().col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation().col(\"date\").isNotNull().errorThreshold(10),\nvalidation().col(\"balance\").greaterThan(500),\nvalidation().expr(\"YEAR(date) == year\"),\nvalidation().col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation().col(\"customer_details.age\").greaterThan(18),\nvalidation().expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation().col(\"update_history\").greaterThanSize(2),\nvalidation().unique(\"account_id\"),\nvalidation().groupBy().count().isEqual(1000),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n</code></pre> <pre><code>val jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation.col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation.col(\"date\").isNotNull.errorThreshold(10),\nvalidation.col(\"balance\").greaterThan(500),\nvalidation.expr(\"YEAR(date) == year\"),\nvalidation.col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation.col(\"customer_details.age\").greaterThan(18),\nvalidation.expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation.col(\"update_history\").greaterThanSize(2),\nvalidation.unique(\"account_id\"),\nvalidation.groupBy().count().isEqual(1000),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>./run.sh\n#input class MyValidationJavaPlan or MyValidationPlan\n#after completing, check report at docker/sample/report/index.html\n</code></pre> <p>It should look something like this.</p> <p>Check the full example at <code>ValidationPlanRun</code> inside the examples repo.</p>"},{"location":"setup/guide/scenario/delete-generated-data/","title":"Delete Generated Data","text":"<p>Info</p> <p>Delete generated data is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Postgres and delete the generated data after using it.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/delete-generated-data/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/delete-generated-data/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedDeleteJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedDeletePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedDeleteJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedDeletePlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n</code></pre> <p>In the above code we note the following:</p> <ol> <li>We have defined a Postgres connection called <code>my_postgres</code></li> <li><code>enableGeneratePlanAndTasks</code> is enabled to auto generate data for all tables under <code>customer</code> database</li> <li><code>enableRecordTracking</code> is enabled to ensure that all generated records are tracked. This will get used when we want    to delete data afterwards</li> <li><code>enableDeleteGeneratedRecords</code> is disabled for now. We want to see the generated data first and delete sometime after</li> <li><code>generatedPlanAndTaskFolderPath</code> is the folder path where we saved the metadata we have gathered from <code>my_postgres</code></li> <li><code>recordTrackingFolderPath</code> is the folder path where record tracking is maintained. We need to persist this data to    ensure it is still available when we want to delete data</li> </ol>"},{"location":"setup/guide/scenario/delete-generated-data/#postgres-setup","title":"Postgres Setup","text":"<p>If you don't have your own Postgres up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n</code></pre> <p>This will create the tables found under <code>docker/data/sql/postgres/customer.sql</code>. You can change this file to contain your own tables. We can see there are 4 tables created for us, <code>accounts, balances, transactions and mapping</code>.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\n</code></pre> <p>It should look something like this.</p> <pre><code>   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n</code></pre> <p>The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.</p> <p>Check the number of records via:</p> <pre><code>docker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n#open report under docker/sample/report/index.html\n</code></pre>"},{"location":"setup/guide/scenario/delete-generated-data/#delete","title":"Delete","text":"<p>We are now at a stage where we want to delete the data that was generated. All we need to do is flip two flags.</p> <pre><code>.enableDeleteGeneratedRecords(true)\n.enableGenerateData(false)  //we need to explicitly disable generating data\n</code></pre> <p>Enable delete generated records and disable generating data. </p> <p>Before we run again, let us insert a record manually to see if that data will survive after running the job to delete the generated data.</p> <pre><code>docker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"insert into account.accounts (account_number) values ('my_account_number')\"\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"select count(1) from account.accounts\"\n</code></pre> <p>We now should have 1001 records in our <code>account.accounts</code> table. Let's delete the generated data now.</p> <pre><code>./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n</code></pre> <p>You should see that only 1 record is left, the one that we manually inserted. Great, now we can generate data reliably  and also be able to clean it up.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/delete-generated-data/#one-class-for-generating-another-for-deleting","title":"One class for generating, another for deleting?","text":"<p>Yes, this is possible. There are two requirements: - the connection names used need to be the same across both classes - <code>recordTrackingFolderPath</code> needs to be set to the same value</p>"},{"location":"setup/guide/scenario/delete-generated-data/#define-record-count","title":"Define record count","text":"<p>You can control the record count per sub data source via <code>numRecordsPerStep</code>.</p> JavaScala <pre><code>var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n</code></pre> <pre><code>val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/","title":"First Data Generation","text":"<p>Creating a data generator for a CSV file.</p> <p></p>"},{"location":"setup/guide/scenario/first-data-generation/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/first-data-generation/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyCsvPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyCsvPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyCsvJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyCsvPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/scenario/first-data-generation/#connection-configuration","title":"Connection Configuration","text":"<p>When dealing with CSV files, we need to define a path for our generated CSV files to be saved at, along with any other high level configurations.</p> JavaScala <pre><code>csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap.of(\"header\", \"true\")          //optional additional options\n)\n</code></pre> <p>Other additional options for CSV can be found here</p> <pre><code>csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap(\"header\" -&gt; \"true\")           //optional additional options\n)\n</code></pre> <p>Other additional options for CSV can be found here</p>"},{"location":"setup/guide/scenario/first-data-generation/#schema","title":"Schema","text":"<p>Our CSV file that we generate should adhere to a defined schema where we can also define data types.</p> <p>Let's define each field along with their corresponding data type. You will notice that the <code>string</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"balance\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"balance\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#field-metadata","title":"Field Metadata","text":"<p>We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata attributes that add guidelines that the data generator will understand when generating data.</p>"},{"location":"setup/guide/scenario/first-data-generation/#account_id","title":"account_id","text":"<ul> <li><code>account_id</code> follows a particular pattern that where it starts with <code>ACC</code> and has 8 digits after it.   This can be defined via a regex like below. Alongside, we also mention that values are unique ensure that   unique values are generated.</li> </ul> JavaScala <pre><code>field().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n</code></pre> <pre><code>field.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#balance","title":"balance","text":"<ul> <li><code>balance</code> let's make the numbers not too large, so we can define a min and max for the generated numbers to be between   <code>1</code> and <code>1000</code>.</li> </ul> JavaScala <pre><code>field().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\n</code></pre> <pre><code>field.name(\"balance\").`type`(DoubleType).min(1).max(1000),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#name","title":"name","text":"<ul> <li><code>name</code> is a string that also follows a certain pattern, so we could also define a regex but here we will choose to   leverage the DataFaker library and create an <code>expression</code> to generate real looking name. All possible faker   expressions   can be found here</li> </ul> JavaScala <pre><code>field().name(\"name\").expression(\"#{Name.name}\"),\n</code></pre> <pre><code>field.name(\"name\").expression(\"#{Name.name}\"),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#open_time","title":"open_time","text":"<ul> <li><code>open_time</code> is a timestamp that we want to have a value greater than a specific date. We can define a min date by   using   <code>java.sql.Date</code> like below.</li> </ul> JavaScala <pre><code>field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre> <pre><code>field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#status","title":"status","text":"<ul> <li><code>status</code> is a field that can only obtain one of four values, <code>open, closed, suspended or pending</code>.</li> </ul> JavaScala <pre><code>field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre> <pre><code>field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#created_by","title":"created_by","text":"<ul> <li><code>created_by</code> is a field that is based on the <code>status</code> field where it follows the   logic: <code>if status is open or closed, then   it is created_by eod else created_by event</code>. This can be achieved by defining a SQL expression like below.</li> </ul> JavaScala <pre><code>field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <pre><code>field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <p>Putting it all the fields together, our class should now look like this.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#record-count","title":"Record Count","text":"<p>We only want to generate 100 records, so that we can see what the output looks like. This is controlled at the <code>accountTask</code> level like below. If you want to generate more records, set it to the value you want.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().records(100));\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.records(100))\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>accountTask</code>, we have to call <code>execute</code> . So our full plan run will look like this.</p> JavaScala <pre><code>public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n</code></pre> <pre><code>class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing\nhead docker/sample/customer/account/part-00000*\n</code></pre> <p>Your output should look like this.</p> <pre><code>account_id,balance,created_by,name,open_time,status\nACC06192462,853.9843359645766,eod,Hoyt Kertzmann MD,2023-07-22T11:17:01.713Z,closed\nACC15350419,632.5969895326234,eod,Dr. Claude White,2022-12-13T21:57:56.840Z,open\nACC25134369,592.0958847218986,eod,Fabian Rolfson,2023-04-26T04:54:41.068Z,open\nACC48021786,656.6413439322964,eod,Dewayne Stroman,2023-05-17T06:31:27.603Z,open\nACC26705211,447.2850352884595,event,Garrett Funk,2023-07-14T03:50:22.746Z,pending\nACC03150585,750.4568929015996,event,Natisha Reichel,2023-04-11T11:13:10.080Z,suspended\nACC29834210,686.4257811608622,event,Gisele Ondricka,2022-11-15T22:09:41.172Z,suspended\nACC39373863,583.5110618128994,event,Thaddeus Ortiz,2022-09-30T06:33:57.193Z,suspended\nACC39405798,989.2623959059525,eod,Shelby Reinger,2022-10-23T17:29:17.564Z,open\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/scenario/first-data-generation/#join-with-another-csv","title":"Join With Another CSV","text":"<p>Now that we have generated some accounts, let's also try to generate a set of transactions for those accounts in CSV format as well. The transactions could be in any other format, but to keep this simple, we will continue using CSV.</p> <p>We can define our schema the same way along with any additional metadata.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#records-per-column","title":"Records Per Column","text":"<p>Usually, for a given <code>account_id, full_name</code>, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the <code>count</code> function.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#random-records-per-column","title":"Random Records Per Column","text":"<p>Above, you will notice that we are generating 5 records per <code>account_id, full_name</code>. This is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n</code></pre> <p>Here we set the minimum number of records per column to be 0 and the maximum to 5.</p>"},{"location":"setup/guide/scenario/first-data-generation/#foreign-key","title":"Foreign Key","text":"<p>In this scenario, we want to match the <code>account_id</code> in <code>account</code> to match the same column values in <code>transaction</code>. We also want to match <code>name</code> in <code>account</code> to <code>full_name</code> in <code>transaction</code>. This can be done via plan configuration like below.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"), //the task and columns we want linked\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\"))) //list of other tasks and their respective column names we want matched\n);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),  //the task and columns we want linked\nList(transactionTask -&gt; List(\"account_id\", \"full_name\"))  //list of other tasks and their respective column names we want matched\n)\n</code></pre> <p>Now, stitching it all together for the <code>execute</code> function, our final plan should look like this.</p> JavaScala <pre><code>public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count().records(100));\n\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nvar myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\")))\n);\n\nexecute(myPlan, config, accountTask, transactionTask);\n}\n}\n</code></pre> <pre><code>class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count.records(100))\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nval myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),\nList(transactionTask -&gt; List(\"account_id\", \"full_name\"))\n)\n\nexecute(myPlan, config, accountTask, transactionTask)\n}\n</code></pre> <p>Let's try run again.</p> <pre><code>#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing, let's pick an account and check the transactions for that account\naccount=$(tail -1 docker/sample/customer/account/part-00000* | awk -F \",\" '{print $1 \",\" $4}')\necho $account\ncat docker/sample/customer/transaction/part-00000* | grep $account\n</code></pre> <p>It should look something like this.</p> <pre><code>ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n</code></pre> <p>Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the <code>DocumentationJavaPlanRun.java</code> or <code>DocumentationPlanRun.scala</code> files as well to check that your plan is the same.</p> <p>We can now look to consume this CSV data from a job or service. Usually, once we have consumed the data, we would also want to check and validate that our consumer has correctly ingested the data.</p>"},{"location":"setup/guide/scenario/first-data-generation/#validate","title":"Validate","text":"<p>In this scenario, our consumer will read in the CSV file, do some transformations, and then save the data to Postgres. Let's try to configure data validations for the data that gets pushed into Postgres.</p>"},{"location":"setup/guide/scenario/first-data-generation/#postgres-setup","title":"Postgres setup","text":"<p>First, we define our connection properties for Postgres. You can check out the full options available here.</p> JavaScala <pre><code>var postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\");\n</code></pre> <pre><code>val postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\")\n</code></pre> <p>We can connect and access the data inside the table <code>account.transactions</code>. Now to define our data validations.</p>"},{"location":"setup/guide/scenario/first-data-generation/#validations","title":"Validations","text":"<p>For full information about validation options and configurations, check here. Below, we have an example that should give you a good understanding of what validations are possible.</p> JavaScala <pre><code>var postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation().col(\"account_id\").isNotNull(),\nvalidation().col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation().col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation().expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation().unique(\"account_id\", \"name\"),\nvalidation().groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n);\n</code></pre> <pre><code>val postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation.col(\"account_id\").isNotNull,\nvalidation.col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation.col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation.expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation.unique(\"account_id\", \"name\"),\nvalidation.groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#name_1","title":"name","text":"<p>For all values in the <code>name</code> column, we check if they match the regex <code>[A-Z][a-z]+ [A-Z][a-z]+</code>. As we know in the real world, names do not always follow the same pattern, so we allow for an <code>errorThreshold</code> before marking the validation as failed. Here, we define the <code>errorThreshold</code> to be <code>0.2</code>, which means, if the error percentage is greater than 20%, then fail the validation. We also append on a helpful description so other developers/users can understand the context of the validation.</p>"},{"location":"setup/guide/scenario/first-data-generation/#balance_1","title":"balance","text":"<p>We check that all <code>balance</code> values are greater than or equal to 0. This time, we have a slightly different <code>errorThreshold</code> as it is set to <code>10</code>, which means, if the number of errors is greater than 10, then fail the validation.</p>"},{"location":"setup/guide/scenario/first-data-generation/#expr","title":"expr","text":"<p>Sometimes, we may need to include the values of multiple columns to validate a certain condition. This is where we can use <code>expr</code> to define a SQL expression that returns a boolean. In this scenario, we are checking if the <code>status</code> column has value <code>closed</code>, then the <code>close_date</code> should be not null, otherwise, <code>close_date</code> is null.</p>"},{"location":"setup/guide/scenario/first-data-generation/#unique","title":"unique","text":"<p>We check whether the combination of <code>account_id</code> and <code>name</code> are unique within the dataset. You can define one or more columns for <code>unique</code> validations.</p>"},{"location":"setup/guide/scenario/first-data-generation/#groupby","title":"groupBy","text":"<p>There may be some business rule that states the number of <code>login_retry</code> should be less than 10 for each account. We can check this via a group by validation where we group by the <code>account_id, name</code>, take the maximum value for <code>login_retry</code> per <code>account_id,name</code> combination, then check if it is less than 10.</p> <p>You can now look to play around with other configurations or data sources to meet your needs. Also, make sure to explore the docs further as it can guide you on what can be configured.</p>"},{"location":"setup/guide/scenario/records-per-column/","title":"Multiple Records Per Column","text":"<p>Creating a data generator for a CSV file where there are multiple records per column values.</p>"},{"location":"setup/guide/scenario/records-per-column/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/records-per-column/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/records-per-column/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyMultipleRecordsPerColJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyMultipleRecordsPerColPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyMultipleRecordsPerColJavaPlan extends PlanRun {\n{\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, transactionTask);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyMultipleRecordsPerColPlan extends PlanRun {\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"), field.name(\"full_name\").expression(\"#{Name.name}\"), field.name(\"amount\").`type`(DoubleType.instance).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType.instance).min(java.sql.Date.valueOf(\"2022-01-01\")), field.name(\"date\").`type`(DateType.instance).sql(\"DATE(time)\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(config, transactionTask)\n}\n</code></pre>"},{"location":"setup/guide/scenario/records-per-column/#record-count","title":"Record Count","text":"<p>By default, tasks will generate 1000 records. You can alter this value via the <code>count</code> configuration which can be applied to individual tasks. For example, in Scala, <code>csv(...).count(count.records(100))</code> to generate only 100 records.</p>"},{"location":"setup/guide/scenario/records-per-column/#records-per-column","title":"Records Per Column","text":"<p>In this scenario, for a given <code>account_id, full_name</code>, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the <code>count</code> function.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n</code></pre> <p>This will generate <code>1000 * 5 = 5000</code> records as the default number of records is set (1000) and per <code>account_id, full_name</code> from the initial 1000 records, 5 records will be generated.</p>"},{"location":"setup/guide/scenario/records-per-column/#random-records-per-column","title":"Random Records Per Column","text":"<p>Generating 5 records per column is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n</code></pre> <p>Here we set the minimum number of records per column to be 0 and the maximum to 5. This will follow a uniform distribution so the average number of records per account is 2.5. We could also define other metadata, just like we did with fields, when defining the generator. For example, we could set <code>standardDeviation</code> and <code>mean</code> for the number of records generated per column to follow a normal distribution.</p>"},{"location":"setup/guide/scenario/records-per-column/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyMultipleRecordsPerColJavaPlan or MyMultipleRecordsPerColPlan\n#after completing\nhead docker/sample/customer/transaction/part-00000*\n</code></pre> <p>It should look something like this.</p> <pre><code>ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n</code></pre> <p>You can now look to play around with other count configurations found here.</p>"},{"location":"setup/validation/basic-validation/","title":"Basic Validations","text":"<p>Run validations on a column to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).</p>"},{"location":"setup/validation/basic-validation/#equal","title":"Equal","text":"<p>Ensure all data in column is equal to certain value. Value can be of any data type. Can use <code>isEqualCol</code> to define SQL expression that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isEqual(2021),\nvalidation().col(\"year\").isEqualCol(\"YEAR(date)\"),\n</code></pre> <pre><code>validation.col(\"year\").isEqual(2021),\nvalidation.col(\"year\").isEqualCol(\"YEAR(date)\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year == 2021\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-equal","title":"Not Equal","text":"<p>Ensure all data in column is not equal to certain value. Value can be of any data type. Can use <code>isNotEqualCol</code> to  define SQL expression that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNotEqual(2021),\nvalidation().col(\"year\").isNotEqualCol(\"YEAR(date)\"),\n</code></pre> <pre><code>validation.col(\"year\").isNotEqual(2021)\nvalidation.col(\"year\").isEqualCol(\"YEAR(date)\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year != 2021\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#null","title":"Null","text":"<p>Ensure all data in column is null.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNull()\n</code></pre> <pre><code>validation.col(\"year\").isNull\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNULL(year)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-null","title":"Not Null","text":"<p>Ensure all data in column is not null.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNotNull()\n</code></pre> <pre><code>validation.col(\"year\").isNotNull\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNOTNULL(year)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#contains","title":"Contains","text":"<p>Ensure all data in column is contains certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"name\").contains(\"peter\")\n</code></pre> <pre><code>validation.col(\"name\").contains(\"peter\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"CONTAINS(name, 'peter')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-contains","title":"Not Contains","text":"<p>Ensure all data in column does not contain certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"name\").notContains(\"peter\")\n</code></pre> <pre><code>validation.col(\"name\").notContains(\"peter\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!CONTAINS(name, 'peter')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#unique","title":"Unique","text":"<p>Ensure all data in column is unique.</p> JavaScalaYAML <pre><code>validation().unique(\"account_id\", \"name\")\n</code></pre> <pre><code>validation.unique(\"account_id\", \"name\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- unique: [\"account_id\", \"name\"]\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than","title":"Less Than","text":"<p>Ensure all data in column is less than certain value. Can use <code>lessThanCol</code> to define SQL expression that can reference  other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").lessThan(100),\nvalidation().col(\"amount\").lessThanCol(\"balance + 1\"),\n</code></pre> <pre><code>validation.col(\"amount\").lessThan(100),\nvalidation.col(\"amount\").lessThanCol(\"balance + 1\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"amount &lt; balance + 1\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-or-equal","title":"Less Than Or Equal","text":"<p>Ensure all data in column is less than or equal to certain value. Can use <code>lessThanOrEqualCol</code> to define SQL expression  that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").lessThanOrEqual(100),\nvalidation().col(\"amount\").lessThanOrEqualCol(\"balance + 1\"),\n</code></pre> <pre><code>validation.col(\"amount\").lessThanOrEqual(100),\nvalidation.col(\"amount\").lessThanCol(\"balance + 1\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &lt;= 100\"\n- expr: \"amount &lt;= balance + 1\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than","title":"Greater Than","text":"<p>Ensure all data in column is greater than certain value. Can use <code>greaterThanCol</code> to define SQL expression that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").greaterThan(100),\nvalidation().col(\"amount\").greaterThanCol(\"balance\"),\n</code></pre> <pre><code>validation.col(\"amount\").greaterThan(100),\nvalidation.col(\"amount\").greaterThanCol(\"balance\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &gt; 100\"\n- expr: \"amount &gt; balance\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-or-equal","title":"Greater Than Or Equal","text":"<p>Ensure all data in column is greater than or equal to certain value. Can use <code>greaterThanOrEqualCol</code> to define SQL  expression that can reference other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").greaterThanOrEqual(100),\nvalidation().col(\"amount\").greaterThanOrEqualCol(\"balance\"),\n</code></pre> <pre><code>validation.col(\"amount\").greaterThanOrEqual(100),\nvalidation.col(\"amount\").greaterThanOrEqualCol(\"balance\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &gt;= 100\"\n- expr: \"amount &gt;= balance\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#between","title":"Between","text":"<p>Ensure all data in column is between two values. Can use <code>betweenCol</code> to define SQL expression that references other  columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").between(100, 200),\nvalidation().col(\"amount\").betweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n</code></pre> <pre><code>validation.col(\"amount\").between(100, 200),\nvalidation.col(\"amount\").betweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount BETWEEN 100 AND 200\"\n- expr: \"amount BETWEEN balance * 0.9 AND balance * 1.1\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-between","title":"Not Between","text":"<p>Ensure all data in column is not between two values. Can use <code>notBetweenCol</code> to define SQL expression that references  other columns.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").notBetween(100, 200),\nvalidation().col(\"amount\").notBetweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n</code></pre> <pre><code>validation.col(\"amount\").notBetween(100, 200)\nvalidation.col(\"amount\").notBetweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount NOT BETWEEN 100 AND 200\"\n- expr: \"amount NOT BETWEEN balance * 0.9 AND balance * 1.1\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#in","title":"In","text":"<p>Ensure all data in column is in set of defined values.</p> JavaScalaYAML <pre><code>validation().col(\"status\").in(\"open\", \"closed\")\n</code></pre> <pre><code>validation.col(\"status\").in(\"open\", \"closed\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"status IN ('open', 'closed')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#matches","title":"Matches","text":"<p>Ensure all data in column matches certain regex expression.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").matches(\"ACC[0-9]{8}\")\n</code></pre> <pre><code>validation.col(\"account_id\").matches(\"ACC[0-9]{8}\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"REGEXP(account_id, ACC[0-9]{8})\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-matches","title":"Not Matches","text":"<p>Ensure all data in column does not match certain regex expression.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notMatches(\"^acc.*\")\n</code></pre> <pre><code>validation.col(\"account_id\").notMatches(\"^acc.*\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!REGEXP(account_id, '^acc.*')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#starts-with","title":"Starts With","text":"<p>Ensure all data in column starts with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").startsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").startsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"STARTSWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-starts-with","title":"Not Starts With","text":"<p>Ensure all data in column does not start with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notStartsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").notStartsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!STARTSWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#ends-with","title":"Ends With","text":"<p>Ensure all data in column ends with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").endsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").endsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ENDWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-ends-with","title":"Not Ends With","text":"<p>Ensure all data in column does not end with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notEndsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").notEndsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!ENDWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#size","title":"Size","text":"<p>Ensure all data in column has certain size. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").size(5)\n</code></pre> <pre><code>validation.col(\"transactions\").size(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions, 5)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-size","title":"Not Size","text":"<p>Ensure all data in column does not have certain size. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").notSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").notSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) != 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-size","title":"Less Than Size","text":"<p>Ensure all data in column has size less than certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").lessThanSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").lessThanSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &lt; 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-or-equal-size","title":"Less Than Or Equal Size","text":"<p>Ensure all data in column has size less than or equal to certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").lessThanOrEqualSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").lessThanOrEqualSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &lt;= 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-size","title":"Greater Than Size","text":"<p>Ensure all data in column has size greater than certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").greaterThanSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").greaterThanSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &gt; 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-or-equal-size","title":"Greater Than Or Equal Size","text":"<p>Ensure all data in column has size greater than or equal to certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").greaterThanOrEqualSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").greaterThanOrEqualSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &gt;= 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#luhn-check","title":"Luhn Check","text":"<p>Ensure all data in column passes luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).</p> JavaScalaYAML <pre><code>validation().col(\"credit_card\").luhnCheck()\n</code></pre> <pre><code>validation.col(\"credit_card\").luhnCheck\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"LUHN_CHECK(credit_card)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#has-type","title":"Has Type","text":"<p>Ensure all data in column has certain data type.</p> JavaScalaYAML <pre><code>validation().col(\"id\").hasType(\"string\")\n</code></pre> <pre><code>validation.col(\"id\").hasType(\"string\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"TYPEOF(id) == 'string'\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#expression","title":"Expression","text":"<p>Ensure all data in column adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple columns.</p> <p>For example, <code>CASE WHEN status == 'open' THEN balance &gt; 0 ELSE balance == 0 END</code> would check all rows with <code>status</code> open to have <code>balance</code> greater than 0, otherwise, check the <code>balance</code> is 0.</p> JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().expr(\"amount &lt; 100\"),\nvalidation().expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation().expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n);\n\nvar conf = configuration().enableValidation(true);\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validations(\nvalidation.expr(\"amount &lt; 100\"),\nvalidation.expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation.expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)\n\nval conf = configuration.enableValidation(true)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is &gt; 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is &gt; 200, then fail\ndescription: \"Should be lots of Peters\"\n\n#enableValidation inside application.conf\n</code></pre>"},{"location":"setup/validation/group-by-validation/","title":"Group By Validation","text":"<p>If you want to run aggregations based on a particular set of columns or just the whole dataset, you can do so via group by validations. An example would be checking that the sum of <code>amount</code> is less than 1000 per <code>account_id, year</code>. The validations applied can be one of the validations from the basic validation set found here.</p>"},{"location":"setup/validation/group-by-validation/#record-count","title":"Record count","text":"<p>Check the number of records across the whole dataset.</p> JavaScala <pre><code>validation().groupBy().count().lessThan(1000)\n</code></pre> <pre><code>validation.groupBy().count().lessThan(1000)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#record-count-per-group","title":"Record count per group","text":"<p>Check the number of records for each group.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").count().lessThan(10)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").count().lessThan(10)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#sum","title":"Sum","text":"<p>Check the sum of a columns values for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#count","title":"Count","text":"<p>Check the count for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#min","title":"Min","text":"<p>Check the min for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#max","title":"Max","text":"<p>Check the max for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#average","title":"Average","text":"<p>Check the average for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#standard-deviation","title":"Standard deviation","text":"<p>Check the standard deviation for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/","title":"Upstream Data Source Validation","text":"<p>If you want to run data validations based on data generated or data from another data source, you can use the upstream data source validations. An example would be generating a Parquet file that gets ingested by a job and inserted into Postgres. The validations can then check for each <code>account_id</code> generated in the Parquet, it exists in <code>account_number</code> column in Postgres. The validations can be chained with basic and group by validations or even other upstream data sources, to cover any complex validations.</p>"},{"location":"setup/validation/upstream-data-source-validation/#basic-join","title":"Basic join","text":"<p>Join across datasets by particular columns. Then run validations on the joined dataset. You will notice that the data source name is appended onto the column names when joined (i.e. <code>my_first_json_customer_details</code>), to ensure column names do not clash and make it obvious which columns are being validated.</p> <p>In the below example, we check that the for the same <code>account_id</code>, then <code>customer_details.name</code> in the <code>my_first_json</code> dataset should equal to the <code>name</code> column in the <code>my_second_json</code>.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(\nvalidation().col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(\nvalidation.col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/#join-expression","title":"Join expression","text":"<p>Define join expression to link two datasets together. This can be any SQL expression that returns a boolean value.  Useful in situations where join is based on transformations or complex logic.</p> <p>In the below example, we have to use <code>CONCAT</code> SQL function to combine <code>'ACC'</code> and <code>account_number</code> to join with  <code>account_id</code> column in <code>my_first_json</code> dataset.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinExpr(\"my_first_json_account_id == CONCAT('ACC', account_number)\")\n.withValidation(\nvalidation().col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinExpr(\"my_first_json_account_id == CONCAT('ACC', account_number)\")\n.withValidation(\nvalidation.col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/#different-join-type","title":"Different join type","text":"<p>By default, an outer join is used to gather columns from both datasets together for validation. But there may be  scenarios where you want to control the join type.</p> <p>Possible join types include: - inner - outer, full, fullouter, full_outer - leftouter, left, left_outer - rightouter, right, right_outer - leftsemi, left_semi, semi - leftanti, left_anti, anti - cross</p> <p>In the example below, we do an <code>anti</code> join by column <code>account_id</code> and check if there are no records. This essentially  checks that all <code>account_id</code>'s from <code>my_second_json</code> exist in <code>my_first_json</code>. The second validation also does something similar but does an <code>outer</code> join (by default) and checks that the joined dataset has 30 records.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.joinType(\"anti\")\n.withValidation(validation().count().isEqual(0)),\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation().count().isEqual(30))\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.joinType(\"anti\")\n.withValidation(validation.count().isEqual(0)),\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation.count().isEqual(30))\n)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/#join-then-group-by-validation","title":"Join then group by validation","text":"<p>We can apply aggregate or group by validations to the resulting joined dataset as the <code>withValidation</code> method accepts any type of validation.</p> <p>Here we group by <code>account_id, my_first_json_balance</code> to check that when the <code>amount</code> field is summed up per group, it is  between 0.8 and 1.2 times the balance.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"balance\").type(DoubleType.instance()).min(10).max(1000),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation().groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n)\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"balance\").`type`(DoubleType).min(10).max(1000),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation.groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n)\n)\n</code></pre>"},{"location":"setup/validation/upstream-data-source-validation/#chained-validations","title":"Chained validations","text":"<p>Given that the <code>withValidation</code> method accepts any other type of validation, you can chain other upstream data sources with it. Here we will show a third upstream data source being checked to ensure 30 records exists after joining them  together by <code>account_id</code>.</p> JavaScala <pre><code>var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"balance\").type(DoubleType.instance()).min(10).max(1000),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n)\n.count(count().records(10));\n\nvar thirdJsonTask = json(\"my_third_json\", \"/tmp/data/third_json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(IntegerType.instance()).min(1).max(100),\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n.count(count().records(10));\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(\nvalidation().upstreamData(thirdJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation().count().isEqual(30))\n)\n);\n</code></pre> <pre><code>val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"balance\").`type`(DoubleType).min(10).max(1000),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n.count(count.records(10))\n\nval thirdJsonTask = json(\"my_third_json\", \"/tmp/data/third_json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(IntegerType).min(1).max(100),\nfield.name(\"name\").expression(\"#{Name.name}\"),\n)\n.count(count.records(10))\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation.groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n),\n)\n</code></pre>"},{"location":"use-case/business-value/","title":"Business Value","text":"<p>Below is a list of the business related benefits from using Data Caterer which may be applicable for your use case.</p> Problem Data Caterer Solution Resources Effects Reliable test data creation - Profile existing data- Create scenarios- Generate data Software Engineers, QA, Testers Cost reduction in labor, more time spent on development, more bugs caught before production Faster development cycles - Generate data in local, test, UAT, pre-prod- Run different scenarios Software Engineers, QA, Testers More defects caught in lower environments, features pushed to production faster, common framework used across all environments Data compliance - Profiling existing data- Generate based on metadata- No complex masking- No production data used in lower environments Audit and compliance No chance for production data breaches Storage costs - Delete generated data- Test specific scenarios Infrastructure Lower data storage costs, less time spent on data management and clean up Schema evolution - Create metadata from data sources- Generate data based off fresh metadata Software Engineers, QA, Testers Less time spent altering tests due to schema changes, ease of use between environments and application versions"},{"location":"use-case/comparison/","title":"Comparison to similar tools","text":"<p>I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.</p> <p>The companies/products not shown below either have:</p> <ul> <li>a website with insufficient information about the technology side of data generation/validation</li> <li>no/little documentation</li> <li>don't have a free, no sign-up version of their app to use</li> </ul>"},{"location":"use-case/comparison/#data-generation","title":"Data Generation","text":"Tool Description Cost Pros Cons Clearbox AI Python based data generation tool via ML Unclear  Python SDK UI interface Detect private data Report generation  Batch data only No data clean up Limited/no documentation Curiosity Software Platform solution for test data management Unclear  Extensive documentation Generate data based off test cases UI interface Web/API/UI/mobile testing  No quick start No SDK Many components that may not be required No event generation support DataCebo Synthetic Data Vault Python based data generation tool via ML Unclear  Python SDK Report generation Data quality checks Business logic constraints  No data connection support No data clean up No foreign key support Datafaker Realistic data generation library Free  SDK for many languages Simple, easy to use Extensible Open source Generate realistic values  No data connection support No data clean up No validation No foreign key support DBLDatagen Python based data generation tool Free  Python SDK Open source Good documentation Customisable scenarios Customisable column generation Generate from existing data/schemas Plugin third-party libraries  Limited support if issues Code required No data clean up No data validation Gatling HTTP API load testing tool Free (Open Source)Gatling Enterprise, usage based, starts from \u20ac89 per month, 1 user, 6.25 hours of testing  Kotlin, Java &amp; Scala SDK Widely used Open source Clear documentation Extensive testing/validation support Customisable scenarios Report generation  Only supports HTTP, JMS and JDBC No data clean up Data feeders not based off metadata Gretel Python based data generation tool via ML Usage based, starts from $295 per month, $2.20 per credit, assumed USD  CLI &amp; Python SDK UI interface Training and re-use of models Detect private data Customisable scenarios  Batch data only No relationships between data sources Only simple foreign key relations defined No data clean up Charge by usage Howso Python based data generation tool via ML Unclear  Python SDK Playground to try Open source library Customisable scenarios  No support for data sources No data validation No data clean up Mostly AI Python based data generation tool via ML Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD  Report generation Non-technical users can use UI Customisable scenarios  Charge by usage Batch data only No data clean up Confusing use of 'smart select' for multiple foreign keys Limited custom column generation logic Multiple deployment components No SDK Octopize Python based data generation tool via ML Unclear  Python &amp; R SDK Report generation API for metadata Customisable scenarios  Input data source is only CSV Multiple manual steps before starting Quickstart is not a quickstart Documentation lacks code examples Synthesized Python based data generation tool via ML Unclear  CLI &amp; Python SDK API for metadata IDE setup Data quality checks  Not sure what is SDK &amp; TDK Charge by usage No report of what was generated No relationships between data sources Tonic Platform solution for generating data Unclear  UI interface Good documentation Detect private data Support for encrypted columns Report generation Alerting  Batch data only Multiple deployment components No relationships between data sources No data validation No data clean up No SDK (only API) Difficult to embed complex business logic YData Python based data generation tool via ML. Platform solution as well Unclear  Python SDK Open source Detect private data Compare datasets Report generation  No data connection support Batch data only No data clean up Separate data generation and data validation No foreign key support"},{"location":"use-case/comparison/#use-of-ml-models","title":"Use of ML models","text":"<p>You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.</p> <p>Pros</p> <ul> <li>Simple setup</li> <li>Ability to reproduce complex logic</li> <li>Flexible to accept all types of data</li> </ul> <p>Cons</p> <ul> <li>Long time for model learning</li> <li>Black box of logic</li> <li>Maintain, store and update of ML models</li> <li>Restriction on input data lengths</li> <li>May not maintain referential integrity</li> <li>Require deeper understanding of ML models for fine-tuning</li> <li>Accuracy may be worse than non-ML models</li> </ul>"},{"location":"use-case/roadmap/","title":"Roadmap","text":"<p>Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.</p> Feature Description Sub Tasks Data source support Batch or real time data sources that can be added to Data Caterer. Support data sources that users want - AWS, GCP and Azure related data services ( cloud storage)- Deltalake- RabbitMQ- ActiveMQ- MongoDB- Elasticsearch- Snowflake- Databricks- Pulsar Metadata discovery Allow for schema and data profiling from external metadata sources -  HTTP (OpenAPI spec)- JMS- Read from samples-  OpenLineage metadata (Marquez)-  OpenMetadata- ODCS (Open Data Contract Standard)- Amundsen- Datahub- Solace Event Portal- Airflow- DBT Developer API Scala/Java interface for developers/testers to create data generation and validation tasks -  Scala-  Java Report generation Generate a report that summarises the data generation or validation results -  Report for data generated and validation rules UI portal Allow users to access a UI to input data generation or validation tasks. Also be able to view report results - Metadata stored in database- Store data generation/validation run information in file/database Integration with data validation tools Derive data validation rules from existing data validation tools - Great Expectation- DBT constraints- SodaCL- MonteCarlo Data validation rule suggestions Based on metadata, generate data validation rules appropriate for the dataset -  Suggest basic data validations (yet to document) Wait conditions before data validation Define certain conditions to be met before starting data validations -  Webhook-  File exists-  Data exists via SQL expression-  Pause Validation types Ability to define simple/complex data validations -  Basic validations-  Aggregates (sum of amount per account is &gt; 500)- Ordering (transactions are ordered by date)-  Relationship (at least one account entry in history table per account in accounts table)- Data profile (how close the generated data profile is compared to the expected data profile)- Column name (check column count, column names, ordering) Data generation record count Generate scenarios where there are one to many, many to many situations relating to record count. Also ability to cover all edge cases or scenarios - Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values etc.)- Ability to override edge cases Alerting When tasks have completed, ability to define alerts based on certain conditions - Slack- Email Metadata enhancements Based on data profiling or inference, can add to existing metadata - PII detection (can integrate with Presidio)- Relationship detection across data sources- SQL generation- Ordering information Data cleanup Ability to clean up generated data -  Clean up generated data- Clean up data in consumer data sinks- Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS) Trial version Trial version of the full app for users to test out all the features -  Trial app to try out all features Code generation Based on metadata or existing classes, code for data generation and validation could be generated - Code generation- Schema generation from Scala/Java class Real time response data validations Ability to define data validations based on the response from real time data sources (e.g. HTTP response) - HTTP response data validation"},{"location":"use-case/blog/shift-left-data-quality/","title":"Shifting Data Quality Left with Data Catering","text":""},{"location":"use-case/blog/shift-left-data-quality/#empowering-proactive-data-management","title":"Empowering Proactive Data Management","text":"<p>In the ever-evolving landscape of data-driven decision-making, ensuring data quality is non-negotiable. Traditionally, data quality has been a concern addressed late in the development lifecycle, often leading to reactive measures and increased costs. However, a paradigm shift is underway with the adoption of a \"shift left\" approach, placing data quality at the forefront of the development process.</p>"},{"location":"use-case/blog/shift-left-data-quality/#today","title":"Today","text":"<pre><code>graph LR\n  subgraph badQualityData[&lt;b&gt;Manually generated data, limited data scenarios&lt;/b&gt;]\n  local[&lt;b&gt;Local&lt;/b&gt;\\nManual test, unit test]\n  dev[&lt;b&gt;Dev&lt;/b&gt;\\nManual test, integration test]\n  stg[&lt;b&gt;Staging&lt;/b&gt;\\nSanity checks]\n  end\n\n  subgraph qualityData[&lt;b&gt;Reliable data, the true test&lt;/b&gt;]\n  prod[&lt;b&gt;Production&lt;/b&gt;\\nData quality checks, monitoring, observaibility]\n  end\n\n  style badQualityData fill:#d9534f,fill-opacity:0.7\n  style qualityData fill:#5cb85c,fill-opacity:0.7\n\n  local --&gt; dev\n  dev --&gt; stg\n  stg --&gt; prod</code></pre>"},{"location":"use-case/blog/shift-left-data-quality/#with-data-caterer","title":"With Data Caterer","text":"<pre><code>graph LR\n  subgraph qualityData[&lt;b&gt;Reliable data for testing anywhere&lt;br&gt;Common testing tool&lt;/b&gt;]\n  direction LR\n  local[&lt;b&gt;Local&lt;/b&gt;\\nManual test, unit test]\n  dev[&lt;b&gt;Dev&lt;/b&gt;\\nManual test, integration test]\n  stg[&lt;b&gt;Staging&lt;/b&gt;\\nSanity checks]\n  prod[&lt;b&gt;Production&lt;/b&gt;\\nData quality checks, monitoring, observaibility]\n  end\n\n  style qualityData fill:#5cb85c,fill-opacity:0.7\n\n  local --&gt; dev\n  dev --&gt; stg\n  stg --&gt; prod</code></pre>"},{"location":"use-case/blog/shift-left-data-quality/#understanding-the-shift-left-approach","title":"Understanding the Shift Left Approach","text":"<p>\"Shift left\" is a philosophy that advocates for addressing tasks and concerns earlier in the development lifecycle. Applied to data quality, it means tackling data issues as early as possible, ideally during the development and testing phases. This approach aims to catch data anomalies, inaccuracies, or inconsistencies before they propagate through the system, reducing the likelihood of downstream errors.</p>"},{"location":"use-case/blog/shift-left-data-quality/#data-caterer-the-catalyst-for-shifting-left","title":"Data Caterer: The Catalyst for Shifting Left","text":"<p>Enter Data Caterer, a metadata-driven data generation and validation tool designed to empower organizations in shifting data quality left. By incorporating Data Caterer into the early stages of development, teams can proactively test complex data flows, validate data sources, and ensure data quality before it reaches downstream processes.</p>"},{"location":"use-case/blog/shift-left-data-quality/#key-advantages-of-shifting-data-quality-left-with-data-caterer","title":"Key Advantages of Shifting Data Quality Left with Data Caterer","text":"<ol> <li>Early Issue Detection:<ul> <li>Identify data quality issues early in the development process, reducing the risk of errors downstream.</li> </ul> </li> <li>Proactive Validation:<ul> <li>Validate data sources and complex data flows in a simplified manner, promoting a proactive approach to data quality.</li> </ul> </li> <li>Efficient Testing Across Sources:<ul> <li>Seamlessly test data across various sources, including databases, file formats, HTTP, and messaging, all within    your local laptop or development environment.</li> <li>Fast feedback loop to motivate developers to ensure thorough testing of data scenarios.</li> </ul> </li> <li>Integration with Development Pipelines:<ul> <li>Easily integrate Data Caterer as a task in your development pipelines, ensuring that data quality is a continuous    consideration rather than an isolated event.</li> </ul> </li> <li>Integration with Existing Metadata:<ul> <li>By harnessing the power of existing metadata from data catalogs, schema registries, or other data validation tools,   Data Caterer streamlines the process, automating the generation and validation of your data effortlessly.</li> </ul> </li> <li>Improved Collaboration:<ul> <li>Facilitate collaboration between developers, testers, and data professionals by providing a common platform for   early data validation.</li> </ul> </li> </ol>"},{"location":"use-case/blog/shift-left-data-quality/#realizing-the-vision-of-proactive-data-quality","title":"Realizing the Vision of Proactive Data Quality","text":"<p>As organizations strive for excellence in their data-driven endeavors, the shift left approach with Data Caterer becomes a strategic imperative. By instilling a proactive data quality culture, teams can minimize the risk of costly errors, enhance the reliability of their data, and streamline the entire development lifecycle.</p> <p>In conclusion, the marriage of the shift left philosophy and Data Caterer brings forth a new era of data management, where data quality is not just a final checkpoint but an integral part of every development milestone. Embrace the shift left approach with Data Caterer and empower your teams to build robust, high-quality data solutions from the very beginning.</p> <p>Shift Left, Validate Early, and Accelerate with Data Caterer.</p>"}]}
\ No newline at end of file
diff --git a/site/sitemap.xml b/site/sitemap.xml
index 644582a2..23afcc7c 100644
--- a/site/sitemap.xml
+++ b/site/sitemap.xml
@@ -2,192 +2,192 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
     <url>
          <loc>https://data.catering/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/about/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/sponsor/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/use-case/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/get-started/docker/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/legal/privacy-policy/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/legal/terms-of-service/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/advanced/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/configuration/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/connection/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/deployment/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/design/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/foreign-key/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/validation/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/generator/count/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/generator/data-generator/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/generator/report/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/cassandra/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/http/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/kafka/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/marquez-metadata-source/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/open-metadata-source/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/solace/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/auto-generate-connection/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/batch-and-event/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/data-validation/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/delete-generated-data/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/first-data-generation/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/records-per-column/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/validation/basic-validation/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/validation/group-by-validation/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/validation/upstream-data-source-validation/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/use-case/business-value/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/use-case/comparison/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/use-case/roadmap/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/use-case/blog/shift-left-data-quality/</loc>
-         <lastmod>2023-11-29</lastmod>
+         <lastmod>2023-11-30</lastmod>
          <changefreq>daily</changefreq>
     </url>
 </urlset>
\ No newline at end of file
diff --git a/site/sitemap.xml.gz b/site/sitemap.xml.gz
index 66f807f4..3f66b257 100644
Binary files a/site/sitemap.xml.gz and b/site/sitemap.xml.gz differ
diff --git a/site/sponsor/index.html b/site/sponsor/index.html
index 92b21aef..18fc07dc 100644
--- a/site/sponsor/index.html
+++ b/site/sponsor/index.html
@@ -2050,6 +2050,8 @@ <h1 id="sponsor">Sponsor</h1>
 maintaining, documenting and updating it. I hope that it will help with developers and companies with their testing 
 by saving time and effort, allowing you to focus on what is important. If you fall under this boat, please consider
 sponsorship to allow me to further maintain and upgrade the solution. Any contributions are much appreciated.</p>
+<p>Those who are wanting to use this project for open source applications, <a href="#contact">please contact me</a> as I would be 
+happy to contribute.</p>
 <p>This is inspired by the <a href="https://github.com/squidfunk/mkdocs-material">mkdocs-material project</a> that
 <a href="https://squidfunk.github.io/mkdocs-material/insiders/">follows the same model</a>.</p>
 <h2 id="features">Features</h2>
diff --git a/site/use-case/blog/shift-left-data-quality/index.html b/site/use-case/blog/shift-left-data-quality/index.html
index 33026b44..f63de390 100644
--- a/site/use-case/blog/shift-left-data-quality/index.html
+++ b/site/use-case/blog/shift-left-data-quality/index.html
@@ -2150,7 +2150,7 @@ <h2 id="empowering-proactive-data-management">Empowering Proactive Data Manageme
 quality at the forefront of the development process.</p>
 <h3 id="today">Today</h3>
 <pre class="mermaid"><code>graph LR
-  subgraph badQualityData[&lt;b&gt;Manually generated data, data quality always passes&lt;/b&gt;]
+  subgraph badQualityData[&lt;b&gt;Manually generated data, limited data scenarios&lt;/b&gt;]
   local[&lt;b&gt;Local&lt;/b&gt;\nManual test, unit test]
   dev[&lt;b&gt;Dev&lt;/b&gt;\nManual test, integration test]
   stg[&lt;b&gt;Staging&lt;/b&gt;\nSanity checks]
@@ -2168,7 +2168,7 @@ <h3 id="today">Today</h3>
   stg --&gt; prod</code></pre>
 <h3 id="with-data-caterer">With Data Caterer</h3>
 <pre class="mermaid"><code>graph LR
-  subgraph qualityData[&lt;b&gt;Reliable data for testing anywhere&lt;/b&gt;]
+  subgraph qualityData[&lt;b&gt;Reliable data for testing anywhere&lt;br&gt;Common testing tool&lt;/b&gt;]
   direction LR
   local[&lt;b&gt;Local&lt;/b&gt;\nManual test, unit test]
   dev[&lt;b&gt;Dev&lt;/b&gt;\nManual test, integration test]
@@ -2219,7 +2219,6 @@ <h2 id="key-advantages-of-shifting-data-quality-left-with-data-caterer">Key Adva
 <li><strong>Improved Collaboration:</strong><ul>
 <li>Facilitate collaboration between developers, testers, and data professionals by providing a common platform for
   early data validation.</li>
-<li>No need to rely on seeking domain expertise or external teams for data testing.</li>
 </ul>
 </li>
 </ol>
diff --git a/site/use-case/roadmap/index.html b/site/use-case/roadmap/index.html
index b38c9cda..0388fb75 100644
--- a/site/use-case/roadmap/index.html
+++ b/site/use-case/roadmap/index.html
@@ -2058,7 +2058,7 @@ <h1 id="roadmap">Roadmap</h1>
 <tr>
 <td>Validation types</td>
 <td>Ability to define simple/complex data validations</td>
-<td>- <img alt="✅" class="twemoji" src="https://cdn.jsdelivr.net/gh/jdecked/twemoji@14.1.2/assets/svg/2705.svg" title=":white_check_mark:" /> <a href="../../setup/validation/basic-validation/">Basic validations</a><br>- <img alt="✅" class="twemoji" src="https://cdn.jsdelivr.net/gh/jdecked/twemoji@14.1.2/assets/svg/2705.svg" title=":white_check_mark:" /> <a href="../../setup/validation/group-by-validation/">Aggregates</a> (sum of amount per account is &gt; 500)<br>- Ordering (transactions are ordered by date)<br>- <img alt="✅" class="twemoji" src="https://cdn.jsdelivr.net/gh/jdecked/twemoji@14.1.2/assets/svg/2705.svg" title=":white_check_mark:" /> <a href="../../setup/validation/upstream-data-source-validation/">Relationship</a> (at least one account entry in history table per account in accounts table)<br>- Data profile (how close the generated data profile is compared to the expected data profile)</td>
+<td>- <img alt="✅" class="twemoji" src="https://cdn.jsdelivr.net/gh/jdecked/twemoji@14.1.2/assets/svg/2705.svg" title=":white_check_mark:" /> <a href="../../setup/validation/basic-validation/">Basic validations</a><br>- <img alt="✅" class="twemoji" src="https://cdn.jsdelivr.net/gh/jdecked/twemoji@14.1.2/assets/svg/2705.svg" title=":white_check_mark:" /> <a href="../../setup/validation/group-by-validation/">Aggregates</a> (sum of amount per account is &gt; 500)<br>- Ordering (transactions are ordered by date)<br>- <img alt="✅" class="twemoji" src="https://cdn.jsdelivr.net/gh/jdecked/twemoji@14.1.2/assets/svg/2705.svg" title=":white_check_mark:" /> <a href="../../setup/validation/upstream-data-source-validation/">Relationship</a> (at least one account entry in history table per account in accounts table)<br>- Data profile (how close the generated data profile is compared to the expected data profile)<br>- Column name (check column count, column names, ordering)</td>
 </tr>
 <tr>
 <td>Data generation record count</td>