diff --git a/docs/get-started/docker.md b/docs/get-started/docker.md index 67b7e012..5dea1a87 100644 --- a/docs/get-started/docker.md +++ b/docs/get-started/docker.md @@ -8,14 +8,14 @@ image: "https://data.catering/diagrams/logo/data_catering_logo.svg" ## Quick start -1. [Mac download (Coming soon)](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-mac.zip) -2. [Windows download (Coming soon)](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-windows.zip) - 1. After downloaded, go to 'Downloads' folder and 'Extract All' from data-caterer-windows - 2. Double-click 'DataCaterer-1.0.0' to install Data Caterer - 3. Click on 'More info' then at the bottom, click 'Run anyway' - 4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application - 5. If your browser doesn't open, go to [http://localhost:9898](http://localhost:9898) in your preferred browser -3. [Linux download (Coming soon)](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-linux.zip) +1. [Mac download (Coming soon)]() +2. [Windows download (Coming soon)]() + 1. After downloaded, go to 'Downloads' folder and 'Extract All' from data-caterer-windows + 2. Double-click 'DataCaterer-1.0.0' to install Data Caterer + 3. Click on 'More info' then at the bottom, click 'Run anyway' + 4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application + 5. If your browser doesn't open, go to [http://localhost:9898](http://localhost:9898) in your preferred browser +3. [Linux download (Coming soon)]() 4. Docker ```shell docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.7.0 diff --git a/site/get-started/docker/index.html b/site/get-started/docker/index.html index 9438b50c..37331ee8 100644 --- a/site/get-started/docker/index.html +++ b/site/get-started/docker/index.html @@ -2172,14 +2172,16 @@
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.7.0
Try now
Data testing is difficult and fragmentedTry now
What you need is a reliable tool that can handle changes to your data landscape
With Data Caterer, you get:
Try now
"},{"location":"#tech-summary","title":"Tech Summary","text":"Use the Java, Scala API, or YAML files to help with setup or customisation that are all run via a Docker image. Want to get into details? Checkout the setup pages here to get code examples and guides that will take you through scenarios and data sources.
Main features include:
Check other run configurations here.
"},{"location":"#what-is-it","title":"What is it","text":"Data generation and testing tool
Generate synthetic production-like data to be consumed and validated.
Designed for any data source
We aim to support pushing data to any data source, in any format.
Low/no code solution
Can use the tool via either Scala, Java or YAML. Connect to data or metadata sources to generate data and validate.
Developer productivity tool
If you are a new developer or seasoned veteran, cut down on your feedback loop when developing with data.
Metadata storage/platform
You could store and use metadata within the data generation/validation tasks but is not the recommended approach. Rather, this metadata should be gathered from existing services who handle metadata on behalf of Data Caterer.
Data contract
The focus of Data Caterer is on the data generation and testing, which can include details about how the data looks like and how it behaves. But it does not encompass all the additional metadata that comes with a data contract such as SLAs, security, etc.
Metrics from load testing
Although millions of records can be generated, there are limited capabilities in terms of metric capturing.
Try now
Data Catering vs Other tools vs In-houseData Catering Other tools In-house Data flow Batch and events generation with validation Batch generation only or validation only Depends on architecture and design Time to results 1 day 1+ month to integrate, deploy and onboard 1+ month to build and deploy Solution Connect with your existing data ecosystem, automatic generation and validation Manual UI data entry or via SDK Depends on engineer(s) building it
"},{"location":"about/","title":"About","text":"Hi, my name is Peter. I am a Software Developer, mainly focussing on data related services. My experience can be found on my LinkedIn.
I have created Data Caterer to help serve individuals and companies with data generation and data testing. It is a complex area that has many edge cases or intricacies that are hard to summarise or turn into something actionable and repeatable. Through the use of metadata, Data Caterer can help simplify your data testing, simulating production environment data, aid in data debugging, or whatever your data use case may be.
Given that it is going to save you and your team time and money, please help in considering financial support. This will help the product grow into a sustainable and feature-full service.
"},{"location":"about/#contact","title":"Contact","text":"Please contact Peter Flook via Slack or via email peter.flook@data.catering
if you have any questions or queries.
To have access to all the features of Data Caterer, you can subscribe according to your situation. You will not be charged by usage. As you continue to subscribe, you will have access to the latest version of Data Caterer as new bug fixes and features get published.
This has been a passion project of mine where I have spent countless hours thinking of the idea, implementing, maintaining, documenting and updating it. I hope that it will help with developers and companies with their testing by saving time and effort, allowing you to focus on what is important. If you fall under this boat, please consider sponsorship to allow me to further maintain and upgrade the solution. Any contributions are much appreciated.
Those who are wanting to use this project for open source applications, please contact me as I would be happy to contribute.
This is inspired by the mkdocs-material project that follows the same model.
"},{"location":"sponsor/#features","title":"Features","text":"Manage via this link
"},{"location":"sponsor/#contact","title":"Contact","text":"Please contact Peter Flook via Slack or via email peter.flook@data.catering
if you have any questions or queries.
Having a stable and reliable test environment is a challenge for a number of companies, especially where teams are asynchronously deploying and testing changes at faster rates. Data Caterer can help alleviate these issues by doing the following:
Similar to the above, being able to replicate production like data in your local environment can be key to developing more reliable code as you can test directly against data in your local computer. This has a number of benefits including:
When working with third-party, external or internal data providers, it can be difficult to have all setup ready to produce reliable data that abides by relationship contracts between each of the systems. You have to rely on these data providers in order for you to run your tests which may not align to their priorities. With Data Caterer, you can generate the same data that they would produce, along with maintaining referential integrity across the data providers, so that you can run your tests without relying on their systems being up and reliable in their corresponding lower environments.
"},{"location":"use-case/#scenario-testing","title":"Scenario testing","text":"If you want to set up particular data scenarios, you can customise the generated data to fit your scenario. Once the data gets generated and is consumed, you can also run validations to ensure your system has consumed the data correctly. These scenarios can be put together from existing tasks or data sources can be enabled/disabled based on your requirement. Built into Data Caterer and controlled via feature flags, is the ability to test edge cases based on the data type of the fields used for data generation (enableEdgeCases
flag within <field>.generator.options
, see more here).
When data related issues occur in production, it may be difficult to replicate in a lower or local environment. It could be related to specific fields not containing expected results, size of data is too large or missing corresponding referenced data. This becomes key to resolving the issue as you can directly code against the exact data scenario and have confidence that your code changes will fix the problem. Data Caterer can be used to generate the appropriate data in whichever environment you want to test your changes against.
"},{"location":"use-case/#data-profiling","title":"Data profiling","text":"When using Data Caterer with the feature flag enableGeneratePlanAndTasks
enabled (see here), metadata relating all the fields defined in the data sources you have configured will be generated via data profiling. You can run this as a standalone job (can disable enableGenerateData
) so that you can focus on the profile of the data you are utilising. This can be run against your production data sources to ensure the metadata can be used to accurately generate data in other environments. This is a key feature of Data Caterer as no direct production connections need to be maintained to generate data in other environments (which can lead to serious concerns about data security as seen here).
When using Data Caterer with the feature flag enableGeneratePlanAndTasks
enabled (see here), all schemas of the data sources defined will be tracked in a common format (as tasks). This data, along with the data profiling metadata, could then feed back into your schema registries to help keep them up to date with your system.
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.7.0\n
Open localhost:9898.git clone git@github.com:data-catering/data-caterer-example.git\ncd data-caterer-example && ./run.sh\n#check results under docker/sample/report/index.html folder\n
"},{"location":"get-started/docker/#report","title":"Report","text":"Check the report generated under docker/data/custom/report/index.html
.
Sample report can also be seen here
"},{"location":"get-started/docker/#paid-version-trial","title":"Paid Version Trial","text":"30 day trial of the paid version can be accessed via these steps:
/token
in the Slack group (will only be visible to you)git clone git@github.com:data-catering/data-caterer-example.git\ncd data-caterer-example && export DATA_CATERING_API_KEY=<insert api key>\n./run.sh\n
If you want to check how long your trial has left, you can check back in the Slack group or type /token
again.
Check out the starter guide here that will take your through step by step. You can also check the other guides here to see the other possibilities of what Data Caterer can achieve for you.
"},{"location":"legal/privacy-policy/","title":"Privacy Policy","text":"Last updated September 25, 2023
"},{"location":"legal/privacy-policy/#data-caterer-policy-on-privacy-of-customer-personal-information","title":"Data Caterer Policy on Privacy of Customer Personal Information","text":"Peter John Flook is committed to protecting the privacy and security of your personal information obtained by reason of your use of Data Caterer. This policy explains the types of customer personal information we collect, how it is used, and the steps we take to ensure your personal information is handled appropriately.
"},{"location":"legal/privacy-policy/#who-is-peter-john-flook","title":"Who is Peter John Flook?","text":"For purposes of this Privacy Policy, \u201cPeter John Flook\u201d means Peter John Flook, the company developing and providing Data Caterer and related websites and services.
"},{"location":"legal/privacy-policy/#what-is-personal-information","title":"What is personal information?","text":"Personal information is information that refers to an individual specifically and is recorded in any form. Personal information includes such things as age, income, date of birth, ethnic origin and credit records. Information about individuals contained in the following documents is not considered personal information:
Peter John Flook is responsible for all personal information under its control. Our team is accountable for compliance with these privacy and security principles.
"},{"location":"legal/privacy-policy/#we-let-you-know-why-we-collect-and-use-your-personal-information-and-get-your-consent","title":"We let you know why we collect and use your personal information and get your consent","text":"Peter John Flook identifies the purpose for which your personal information is collected and will be used or disclosed. If that purpose is not listed below we will do this before or at the time the information is actually being collected. You will be deemed to consent to our use of your personal information for the purpose of:
Otherwise, Peter John Flook will obtain your express consent (by verbal, written or electronic agreement) to collect, use or disclose your personal information. You can change your consent preferences at any time by contacting Peter John Flook (please refer to the \u201cHow to contact us\u201d section below).
"},{"location":"legal/privacy-policy/#we-limit-collection-of-your-personal-information","title":"We limit collection of your personal information","text":"Peter John Flook collects only the information required to provide products and services to you. Peter John Flook will collect personal information only by clear, fair and lawful means.
We receive and store any information you enter on our website or give us in any other way. You can choose not to provide certain information, but then you might not be able to take advantage of many of our features.
Peter John Flook does not receive or store personal content saved to your local device while using Data Caterer.
We also receive and store certain types of information whenever you interact with us.
"},{"location":"legal/privacy-policy/#information-provided-to-stripe","title":"Information provided to Stripe","text":"All purchases that are made through this site are processed securely and externally by Stripe. Unless you expressly consent otherwise, we do not see or have access to any personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address).
"},{"location":"legal/privacy-policy/#we-limit-disclosure-and-retention-of-your-personal-information","title":"We limit disclosure and retention of your personal information","text":"Peter John Flook does not disclose personal information to any organization or person for any reason except the following:
We employ other companies and individuals to perform functions on our behalf. Examples include fulfilling orders, delivering packages, sending postal mail and e-mail, removing repetitive information from customer lists, analyzing data, providing marketing assistance, processing credit card payments, and providing customer service. They have access to personal information needed to perform their functions, but may not use it for other purposes. We may use service providers located outside of Australia, and, if applicable, your personal information may be processed and stored in other countries and therefore may be subject to disclosure under the laws of those countries. As we continue to develop our business, we might sell or buy stores, subsidiaries, or business units. In such transactions, customer information generally is one of the transferred business assets but remains subject to the promises made in any pre-existing Privacy Notice (unless, of course, the customer consents otherwise). Also, in the unlikely event that Peter John Flook or substantially all of its assets are acquired, customer information of course will be one of the transferred assets. You are deemed to consent to disclosure of your personal information for those purposes. If your personal information is shared with third parties, those third parties are bound by appropriate agreements with Peter John Flook to secure and protect the confidentiality of your personal information.
Peter John Flook retains your personal information only as long as it is required for our business relationship or as required by federal and provincial laws.
"},{"location":"legal/privacy-policy/#we-keep-your-personal-information-up-to-date-and-accurate","title":"We keep your personal information up to date and accurate","text":"Peter John Flook keeps your personal information up to date, accurate and relevant for its intended use.
You may request access to the personal information we have on record in order to review and amend the information, as appropriate. In circumstances where your personal information has been provided by a third party, we will refer you to that party (e.g. credit bureaus). To access your personal information, refer to the \u201cHow to contact us\u201d section below.
"},{"location":"legal/privacy-policy/#the-security-of-your-personal-information-is-a-priority-for-peter-john-flook","title":"The security of your personal information is a priority for Peter John Flook","text":"We take steps to safeguard your personal information, regardless of the format in which it is held, including:
physical security measures such as restricted access facilities and locked filing cabinets electronic security measures for computerized personal information such as password protection, database encryption and personal identification numbers. We work to protect the security of your information during transmission by using \u201cTransport Layer Security\u201d (TLS) protocol. organizational processes such as limiting access to your personal information to a selected group of individuals contractual obligations with third parties who need access to your personal information requiring them to protect and secure your personal information It\u2019s important for you to protect against unauthorized access to your password and your computer. Be sure to sign off when you\u2019ve finished using any shared computer.
"},{"location":"legal/privacy-policy/#what-about-third-party-advertisers-and-links-to-other-websites","title":"What About Third-Party Advertisers and Links to Other Websites?","text":"Our site may include third-party advertising and links to other websites. We do not provide any personally identifiable customer information to these advertisers or third-party websites.
These third-party websites and advertisers, or Internet advertising companies working on their behalf, sometimes use technology to send (or \u201cserve\u201d) the advertisements that appear on our website directly to your browser. They automatically receive your IP address when this happens. They may also use cookies, JavaScript, web beacons (also known as action tags or single-pixel gifs), and other technologies to measure the effectiveness of their ads and to personalize advertising content. We do not have access to or control over cookies or other features that they may use, and the information practices of these advertisers and third-party websites are not covered by this Privacy Notice. Please contact them directly for more information about their privacy practices. In addition, the Network Advertising Initiative offers useful information about Internet advertising companies (also called \u201cad networks\u201d or \u201cnetwork advertisers\u201d), including information about how to opt-out of their information collection. You can access the Network Advertising Initiative at http://www.networkadvertising.org.
"},{"location":"legal/privacy-policy/#redirection-to-stripe","title":"Redirection to Stripe","text":"In particular, when you submit an order to us, you may be automatically redirected to Stripe in order to complete the required payment. The payment page that is provided by Stripe is not part of this site. As noted above, we are not privy to any of the bank account, credit card or other personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address). We recommend that you refer to Stripe\u2019s privacy statement if you would like more information about how Stripe collects and handles your personal information.
"},{"location":"legal/privacy-policy/#we-are-open-about-our-privacy-and-security-policy","title":"We are open about our privacy and security policy","text":"We are committed to providing you with understandable and easily available information about our policy and practices related to management of your personal information. This policy and any related information is available at all times on our website, https://data.catering/about/ under Privacy or on request. To contact us, refer to the \u201cHow to contact us\u201d section below.
"},{"location":"legal/privacy-policy/#we-provide-access-to-your-personal-information-stored-by-peter-john-flook","title":"We provide access to your personal information stored by Peter John Flook","text":"You can request access to your personal information stored by Peter John Flook. To contact us, refer to the \u201cHow to contact us\u201d section below. Upon receiving such a request, Peter John Flook will:
inform you about what type of personal information we have on record or in our control, how it is used and to whom it may have been disclosed provide you with access to your information so you can review and verify the accuracy and completeness and request changes to the information make any necessary updates to your personal information We respond to your questions, concerns and complaints about privacy Peter John Flook responds in a timely manner to your questions, concerns and complaints about the privacy of your personal information and our privacy policies and procedures.
"},{"location":"legal/privacy-policy/#how-to-contact-us","title":"How to contact us","text":"peter.flook@data.catering
Our business changes constantly, and this privacy notice will change also. We may e-mail periodic reminders of our notices and conditions, unless you have instructed us not to, but you should check our website frequently to see recent changes. We are, however, committed to protecting your information and will never materially change our policies and practices to make them less protective of customer information collected in the past without the consent of affected customers.
"},{"location":"legal/terms-of-service/","title":"Terms and Conditions","text":"Last updated: September 25, 2023
Please read these terms and conditions carefully before using Our Service.
"},{"location":"legal/terms-of-service/#interpretation-and-definitions","title":"Interpretation and Definitions","text":""},{"location":"legal/terms-of-service/#interpretation","title":"Interpretation","text":"The words of which the initial letter is capitalized have meanings defined under the following conditions. The following definitions shall have the same meaning regardless of whether they appear in singular or in plural.
"},{"location":"legal/terms-of-service/#definitions","title":"Definitions","text":"For the purposes of these Terms and Conditions:
These are the Terms and Conditions governing the use of this Service and the agreement that operates between You and the Company. These Terms and Conditions set out the rights and obligations of all users regarding the use of the Service.
Your access to and use of the Service is conditioned on Your acceptance of and compliance with these Terms and Conditions. These Terms and Conditions apply to all visitors, users and others who access or use the Service.
By accessing or using the Service You agree to be bound by these Terms and Conditions. If You disagree with any part of these Terms and Conditions then You may not access the Service.
You represent that you are over the age of 18. The Company does not permit those under 18 to use the Service.
Your access to and use of the Service is also conditioned on Your acceptance of and compliance with the Privacy Policy of the Company. Our Privacy Policy describes Our policies and procedures on the collection, use and disclosure of Your personal information when You use the Application or the Website and tells You about Your privacy rights and how the law protects You. Please read Our Privacy Policy carefully before using Our Service.
"},{"location":"legal/terms-of-service/#links-to-other-websites","title":"Links to Other Websites","text":"Our Service may contain links to third-party websites or services that are not owned or controlled by the Company.
The Company has no control over, and assumes no responsibility for, the content, privacy policies, or practices of any third party websites or services. You further acknowledge and agree that the Company shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with the use of or reliance on any such content, goods or services available on or through any such websites or services.
We strongly advise You to read the terms and conditions and privacy policies of any third-party websites or services that You visit.
"},{"location":"legal/terms-of-service/#termination","title":"Termination","text":"We may terminate or suspend Your access immediately, without prior notice or liability, for any reason whatsoever, including without limitation if You breach these Terms and Conditions.
Upon termination, Your right to use the Service will cease immediately.
"},{"location":"legal/terms-of-service/#limitation-of-liability","title":"Limitation of Liability","text":"Notwithstanding any damages that You might incur, the entire liability of the Company and any of its suppliers under any provision of these Terms and Your exclusive remedy for all the foregoing shall be limited to the amount actually paid by You through the Service or 100 USD if You haven't purchased anything through the Service.
To the maximum extent permitted by applicable law, in no event shall the Company or its suppliers be liable for any special, incidental, indirect, or consequential damages whatsoever (including, but not limited to, damages for loss of profits, loss of data or other information, for business interruption, for personal injury, loss of privacy arising out of or in any way related to the use of or inability to use the Service, third-party software and/or third-party hardware used with the Service, or otherwise in connection with any provision of these Terms), even if the Company or any supplier has been advised of the possibility of such damages and even if the remedy fails of its essential purpose.
Some states do not allow the exclusion of implied warranties or limitation of liability for incidental or consequential damages, which means that some of the above limitations may not apply. In these states, each party's liability will be limited to the greatest extent permitted by law.
"},{"location":"legal/terms-of-service/#as-is-and-as-available-disclaimer","title":"\"AS IS\" and \"AS AVAILABLE\" Disclaimer","text":"The Service is provided to You \"AS IS\" and \"AS AVAILABLE\" and with all faults and defects without warranty of any kind. To the maximum extent permitted under applicable law, the Company, on its own behalf and on behalf of its Affiliates and its and their respective licensors and service providers, expressly disclaims all warranties, whether express, implied, statutory or otherwise, with respect to the Service, including all implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and warranties that may arise out of course of dealing, course of performance, usage or trade practice. Without limitation to the foregoing, the Company provides no warranty or undertaking, and makes no representation of any kind that the Service will meet Your requirements, achieve any intended results, be compatible or work with any other software, applications, systems or services, operate without interruption, meet any performance or reliability standards or be error free or that any errors or defects can or will be corrected.
Without limiting the foregoing, neither the Company nor any of the company's provider makes any representation or warranty of any kind, express or implied: (i) as to the operation or availability of the Service, or the information, content, and materials or products included thereon; (ii) that the Service will be uninterrupted or error-free; (iii) as to the accuracy, reliability, or currency of any information or content provided through the Service; or (iv) that the Service, its servers, the content, or e-mails sent from or on behalf of the Company are free of viruses, scripts, trojan horses, worms, malware, time-bombs or other harmful components.
Some jurisdictions do not allow the exclusion of certain types of warranties or limitations on applicable statutory rights of a consumer, so some or all of the above exclusions and limitations may not apply to You. But in such a case the exclusions and limitations set forth in this section shall be applied to the greatest extent enforceable under applicable law.
"},{"location":"legal/terms-of-service/#governing-law","title":"Governing Law","text":"The laws of the Country, excluding its conflicts of law rules, shall govern this Terms and Your use of the Service. Your use of the Application may also be subject to other local, state, national, or international laws.
"},{"location":"legal/terms-of-service/#disputes-resolution","title":"Disputes Resolution","text":"If You have any concern or dispute about the Service, You agree to first try to resolve the dispute informally by contacting the Company.
"},{"location":"legal/terms-of-service/#for-european-union-eu-users","title":"For European Union (EU) Users","text":"If You are a European Union consumer, you will benefit from any mandatory provisions of the law of the country in which you are resident in.
"},{"location":"legal/terms-of-service/#united-states-legal-compliance","title":"United States Legal Compliance","text":"You represent and warrant that (i) You are not located in a country that is subject to the United States government embargo, or that has been designated by the United States government as a \"terrorist supporting\" country, and (ii) You are not listed on any United States government list of prohibited or restricted parties.
"},{"location":"legal/terms-of-service/#severability-and-waiver","title":"Severability and Waiver","text":""},{"location":"legal/terms-of-service/#severability","title":"Severability","text":"If any provision of these Terms is held to be unenforceable or invalid, such provision will be changed and interpreted to accomplish the objectives of such provision to the greatest extent possible under applicable law and the remaining provisions will continue in full force and effect.
"},{"location":"legal/terms-of-service/#waiver","title":"Waiver","text":"Except as provided herein, the failure to exercise a right or to require performance of an obligation under these Terms shall not affect a party's ability to exercise such right or require such performance at any time thereafter nor shall the waiver of a breach constitute a waiver of any subsequent breach.
"},{"location":"legal/terms-of-service/#translation-interpretation","title":"Translation Interpretation","text":"These Terms and Conditions may have been translated if We have made them available to You on our Service. You agree that the original English text shall prevail in the case of a dispute.
"},{"location":"legal/terms-of-service/#changes-to-these-terms-and-conditions","title":"Changes to These Terms and Conditions","text":"We reserve the right, at Our sole discretion, to modify or replace these Terms at any time. If a revision is material We will make reasonable efforts to provide at least 30 days' notice prior to any new terms taking effect. What constitutes a material change will be determined at Our sole discretion.
By continuing to access or use Our Service after those revisions become effective, You agree to be bound by the revised terms. If You do not agree to the new terms, in whole or in part, please stop using the website and the Service.
"},{"location":"legal/terms-of-service/#contact-us","title":"Contact Us","text":"If you have any questions about these Terms and Conditions, You can contact us:
All the configurations and customisation related to Data Caterer can be found under here.
"},{"location":"setup/#guide","title":"Guide","text":"If you want a guided tour of using the Java or Scala API, you can follow one of the guides found here.
"},{"location":"setup/#specific-configuration","title":"Specific Configuration","text":"There are many options available for you to use when you have a scenario when data has to be a certain format.
Details for how you can configure foreign keys can be found here.
"},{"location":"setup/advanced/#edge-cases","title":"Edge cases","text":"For each given data type, there are edge cases which can cause issues when your application processes the data. This can be controlled at a column level by including the following flag in the generator options:
JavaScalaYAMLfield()\n.name(\"amount\")\n.type(DoubleType.instance())\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n
field\n.name(\"amount\")\n.`type`(DoubleType)\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n
fields:\n- name: \"amount\"\ntype: \"double\"\ngenerator:\ntype: \"random\"\noptions:\nenableEdgeCases: \"true\"\nedgeCaseProb: 0.1\n
If you want to know all the possible edge cases for each data type, can check the documentation here.
"},{"location":"setup/advanced/#scenario-testing","title":"Scenario testing","text":"You can create specific scenarios by adjusting the metadata found in the plan and tasks to your liking. For example, if you had two data sources, a Postgres database and a parquet file, and you wanted to save account data into Postgres and transactions related to those accounts into a parquet file. You can alter the status
column in the account data to only generate open
accounts and define a foreign key between Postgres and parquet to ensure the same account_id
is being used. Then in the parquet task, define 1 to 10 transactions per account_id
to be generated.
Postgres account generation example task Parquet transaction generation example task Plan
"},{"location":"setup/advanced/#cloud-storage","title":"Cloud storage","text":""},{"location":"setup/advanced/#data-source","title":"Data source","text":"If you want to save the file types CSV, JSON, Parquet or ORC into cloud storage, you can do so via adding extra configurations. Below is an example for S3.
JavaScalaYAMLvar csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield().name(\"account_id\"),\n...\n);\n\nvar s3Configuration = configuration()\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n\nexecute(s3Configuration, csvTask);\n
val csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield.name(\"account_id\"),\n...\n)\n\nval s3Configuration = configuration\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -> \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -> \"true\",\n\"spark.hadoop.fs.defaultFS\" -> \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -> \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -> \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -> \"secret_key\"\n))\n\nexecute(s3Configuration, csvTask)\n
folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n
"},{"location":"setup/advanced/#storing-plantasks","title":"Storing plan/task(s)","text":"You can generate and store the plan/task files inside either AWS S3, Azure Blob Storage or Google GCS. This can be controlled via configuration set in the application.conf
file where you can set something like the below:
configuration()\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n
configuration\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -> \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -> \"true\",\n\"spark.hadoop.fs.defaultFS\" -> \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -> \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -> \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -> \"secret_key\"\n))\n
folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n
"},{"location":"setup/configuration/","title":"Configuration","text":"A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.
These configurations are defined from within your Java or Scala class via configuration
or for YAML file setup, application.conf
file as seen here.
Flags are used to control which processes are executed when you run Data Caterer.
Config Default Paid DescriptionenableGenerateData
true N Enable/disable data generation enableCount
true N Count the number of records generated. Can be disabled to improve performance enableFailOnError
true N Whilst saving generated data, if there is an error, it will stop any further data from being generated enableSaveReports
true N Enable/disable HTML reports summarising data generated, metadata of data generated (if enableSinkMetadata
is enabled) and validation results (if enableValidation
is enabled). Sample here enableSinkMetadata
true N Run data profiling for the generated data. Shown in HTML reports if enableSaveSinkMetadata
is enabled enableValidation
false N Run validations as described in plan. Results can be viewed from logs or from HTML report if enableSaveSinkMetadata
is enabled. Sample here enableUniqueCheck
false N If enabled, for any isUnique
fields, will ensure only unique values are generated enableAlerts
true N Enable/disable alerts to be sent enableGeneratePlanAndTasks
false Y Enable/disable plan and task auto generation based off data source connections enableRecordTracking
false Y Enable/disable which data records have been generated for any data source enableDeleteGeneratedRecords
false Y Delete all generated records based off record tracking (if enableRecordTracking
has been set to true) enableGenerateValidations
false Y If enabled, it will generate validations based on the data sources defined. JavaScalaapplication.conf configuration()\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableUniqueCheck(true)\n.enableAlerts(true)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false);\n
configuration\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableUniqueCheck(true)\n.enableAlerts(true)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false)\n
flags {\n enableCount = false\n enableCount = ${?ENABLE_COUNT}\n enableGenerateData = true\n enableGenerateData = ${?ENABLE_GENERATE_DATA}\n enableFailOnError = true\n enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}\n enableGeneratePlanAndTasks = false\n enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}\n enableRecordTracking = false\n enableRecordTracking = ${?ENABLE_RECORD_TRACKING}\n enableDeleteGeneratedRecords = false\n enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}\n enableUniqueCheck = true\n enableUniqueCheck = ${?ENABLE_UNIQUE_CHECK}\n enableSinkMetadata = true\n enableSinkMetadata = ${?ENABLE_SINK_METADATA}\n enableSaveReports = true\n enableSaveReports = ${?ENABLE_SAVE_REPORTS}\n enableValidation = false\n enableValidation = ${?ENABLE_VALIDATION}\n enableGenerateValidations = false\n enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}\n enableAlerts = false\n enableAlerts = ${?ENABLE_ALERTS}\n}\n
"},{"location":"setup/configuration/#folders","title":"Folders","text":"Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.
These folder pathways can be defined as a cloud storage pathway (i.e. s3a://my-bucket/task
).
planFilePath
/opt/app/plan/customer-create-plan.yaml N Plan file path to use when generating and/or validating data taskFolderPath
/opt/app/task N Task folder path that contains all the task files (can have nested directories) validationFolderPath
/opt/app/validation N Validation folder path that contains all the validation files (can have nested directories) generatedReportsFolderPath
/opt/app/report N Where HTML reports get generated that contain information about data generated along with any validations performed generatedPlanAndTaskFolderPath
/tmp Y Folder path where generated plan and task files will be saved recordTrackingFolderPath
/opt/app/record-tracking Y Where record tracking parquet files get saved recordTrackingForValidationFolderPath
/opt/app/record-tracking-validation Y Where record tracking parquet files get saved for the purpose of validation JavaScalaapplication.conf configuration()\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n.recordTrackingForValidationFolderPath(\"/opt/app/custom/record-tracking-validation\");\n
configuration\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n.recordTrackingForValidationFolderPath(\"/opt/app/custom/record-tracking-validation\")\n
folders {\n planFilePath = \"/opt/app/custom/plan/postgres-plan.yaml\"\n planFilePath = ${?PLAN_FILE_PATH}\n taskFolderPath = \"/opt/app/custom/task\"\n taskFolderPath = ${?TASK_FOLDER_PATH}\n validationFolderPath = \"/opt/app/custom/validation\"\n validationFolderPath = ${?VALIDATION_FOLDER_PATH}\n generatedReportsFolderPath = \"/opt/app/custom/report\"\n generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}\n generatedPlanAndTaskFolderPath = \"/opt/app/custom/generated\"\n generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}\n recordTrackingFolderPath = \"/opt/app/custom/record-tracking\"\n recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}\n recordTrackingForValidationFolderPath = \"/opt/app/custom/record-tracking-validation\"\n recordTrackingForValidationFolderPath = ${?RECORD_TRACKING_VALIDATION_FOLDER_PATH}\n}\n
"},{"location":"setup/configuration/#metadata","title":"Metadata","text":"When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy related issues. Metadata gets generated from two processes: 1) if enableGeneratePlanAndTasks
or 2) if enableSinkMetadata
are enabled.
During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.
Config Default Paid DescriptionnumRecordsFromDataSource
10000 Y Number of records read in from the data source that could be used for data profiling numRecordsForAnalysis
10000 Y Number of records used for data profiling from the records gathered in numRecordsFromDataSource
oneOfMinCount
1000 Y Minimum number of records required before considering if a field can be of type oneOf
oneOfDistinctCountVsCountThreshold
0.2 Y Threshold ratio to determine if a field is of type oneOf
(i.e. a field called status
that only contains open
or closed
. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as oneOf
) numGeneratedSamples
10 N Number of sample records from generated data to take. Shown in HTML report JavaScalaapplication.conf configuration()\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10);\n
configuration\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10)\n
metadata {\n numRecordsFromDataSource = 10000\n numRecordsForAnalysis = 10000\n oneOfMinCount = 1000\n oneOfDistinctCountVsCountThreshold = 0.2\n numGeneratedSamples = 10\n}\n
"},{"location":"setup/configuration/#generation","title":"Generation","text":"When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.
Config Default Paid DescriptionnumRecordsPerBatch
100000 N Number of records across all data sources to generate per batch numRecordsPerStep
N Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated) JavaScalaapplication.conf configuration()\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000);\n
configuration\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000)\n
generation {\n numRecordsPerBatch = 100000\n numRecordsPerStep = 1000\n}\n
"},{"location":"setup/configuration/#validation","title":"Validation","text":"Configurations to alter how validations are executed.
Config Default Paid DescriptionnumSampleErrorRecords
5 N Number of error sample records to retrieve and display in generated HTML report. Increase to help debugging data issues enableDeleteRecordTrackingFiles
true Y After validations are complete, delete record tracking files that were used for validation purposes (enabled via enableRecordTracking
) JavaScalaapplication.conf configuration()\n.numSampleErrorRecords(10)\n.enableDeleteRecordTrackingFiles(false);\n
configuration\n.numSampleErrorRecords(10)\n.enableDeleteRecordTrackingFiles(false)\n
validatoin {\n numSampleErrorRecords = 10\n enableDeleteRecordTrackingFiles = false\n}\n
"},{"location":"setup/configuration/#runtime","title":"Runtime","text":"Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your specifications via configuration as seen here.
JavaScalaapplication.confconfiguration()\n.master(\"local[*]\")\n.runtimeConfig(Map.of(\"spark.driver.cores\", \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\", \"10g\");\n
configuration\n.master(\"local[*]\")\n.runtimeConfig(Map(\"spark.driver.cores\" -> \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\" -> \"10g\")\n
runtime {\n master = \"local[*]\"\n master = ${?DATA_CATERER_MASTER}\n config {\n \"spark.driver.cores\" = \"5\"\n \"spark.driver.memory\" = \"10g\"\n }\n}\n
"},{"location":"setup/connection/","title":"Data Source Connections","text":"Details of all the connection configuration supported can be found in the below subsections for each type of connection.
These configurations can be done via API or from configuration. Examples of both are shown for each data source below.
"},{"location":"setup/connection/#supported-data-connections","title":"Supported Data Connections","text":"Data Source Type Data Source Sponsor Database Postgres, MySQL, Cassandra N File CSV, JSON, ORC, Parquet N Messaging Kafka, Solace Y HTTP REST API Y Metadata Marquez, OpenMetadata, OpenAPI/Swagger Y"},{"location":"setup/connection/#api","title":"API","text":"All connection details require a name. Depending on the data source, you can define additional options which may be used by the driver or connector for connecting to the data source.
"},{"location":"setup/connection/#configuration-file","title":"Configuration file","text":"All connection details follow the same pattern.
<connection format> {\n <connection name> {\n <key> = <value>\n }\n}\n
Overriding configuration
When defining a configuration value that can be defined by a system property or environment variable at runtime, you can define that via the following:
url = \"localhost\"\nurl = ${?POSTGRES_URL}\n
The above defines that if there is a system property or environment variable named POSTGRES_URL
, then that value will be used for the url
, otherwise, it will default to localhost
.
To find examples of a task for each type of data source, please check out this page.
"},{"location":"setup/connection/#file","title":"File","text":"Linked here is a list of generic options that can be included as part of your file data source configuration if required. Links to specific file type configurations can be found below.
"},{"location":"setup/connection/#csv","title":"CSV","text":"JavaScalaapplication.confcsv(\"customer_transactions\", \"/data/customer/transaction\")\n
csv(\"customer_transactions\", \"/data/customer/transaction\")\n
csv {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?CSV_PATH}\n }\n}\n
Other available configuration for CSV can be found here
"},{"location":"setup/connection/#json","title":"JSON","text":"JavaScalaapplication.confjson(\"customer_transactions\", \"/data/customer/transaction\")\n
json(\"customer_transactions\", \"/data/customer/transaction\")\n
json {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?JSON_PATH}\n }\n}\n
Other available configuration for JSON can be found here
"},{"location":"setup/connection/#orc","title":"ORC","text":"JavaScalaapplication.conforc(\"customer_transactions\", \"/data/customer/transaction\")\n
orc(\"customer_transactions\", \"/data/customer/transaction\")\n
orc {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?ORC_PATH}\n }\n}\n
Other available configuration for ORC can be found here
"},{"location":"setup/connection/#parquet","title":"Parquet","text":"JavaScalaapplication.confparquet(\"customer_transactions\", \"/data/customer/transaction\")\n
parquet(\"customer_transactions\", \"/data/customer/transaction\")\n
parquet {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?PARQUET_PATH}\n }\n}\n
Other available configuration for Parquet can be found here
"},{"location":"setup/connection/#delta-not-supported-yet","title":"Delta (not supported yet)","text":"JavaScalaapplication.confdelta(\"customer_transactions\", \"/data/customer/transaction\")\n
delta(\"customer_transactions\", \"/data/customer/transaction\")\n
delta {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?DELTA_PATH}\n }\n}\n
"},{"location":"setup/connection/#rmdbs","title":"RMDBS","text":"Follows the same configuration used by Spark as found here. Sample can be found below
JavaScalaapplication.confpostgres(\n\"customer_postgres\", #name\n\"jdbc:postgresql://localhost:5432/customer\", #url\n\"postgres\", #username\n\"postgres\" #password\n)\n
postgres(\n\"customer_postgres\", #name\n\"jdbc:postgresql://localhost:5432/customer\", #url\n\"postgres\", #username\n\"postgres\" #password\n)\n
jdbc {\n customer_postgres {\n url = \"jdbc:postgresql://localhost:5432/customer\"\n url = ${?POSTGRES_URL}\n user = \"postgres\"\n user = ${?POSTGRES_USERNAME}\n password = \"postgres\"\n password = ${?POSTGRES_PASSWORD}\n driver = \"org.postgresql.Driver\"\n }\n}\n
Ensure that the user has write permission, so it is able to save the table to the target tables.
SQL Permission StatementsGRANT INSERT ON <schema>.<table> TO <user>;\n
"},{"location":"setup/connection/#postgres","title":"Postgres","text":"Can see example API or Config definition for Postgres connection above.
"},{"location":"setup/connection/#permissions","title":"Permissions","text":"Following permissions are required when generating plan and tasks:
SQL Permission StatementsGRANT SELECT ON information_schema.tables TO < user >;\nGRANT SELECT ON information_schema.columns TO < user >;\nGRANT SELECT ON information_schema.key_column_usage TO < user >;\nGRANT SELECT ON information_schema.table_constraints TO < user >;\nGRANT SELECT ON information_schema.constraint_column_usage TO < user >;\n
"},{"location":"setup/connection/#mysql","title":"MySQL","text":"JavaScalaapplication.conf mysql(\n\"customer_mysql\", #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\", #username\n\"root\" #password\n)\n
mysql(\n\"customer_mysql\", #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\", #username\n\"root\" #password\n)\n
jdbc {\n customer_mysql {\n url = \"jdbc:mysql://localhost:3306/customer\"\n user = \"root\"\n password = \"root\"\n driver = \"com.mysql.cj.jdbc.Driver\"\n }\n}\n
"},{"location":"setup/connection/#permissions_1","title":"Permissions","text":"Following permissions are required when generating plan and tasks:
SQL Permission StatementsGRANT SELECT ON information_schema.columns TO < user >;\nGRANT SELECT ON information_schema.statistics TO < user >;\nGRANT SELECT ON information_schema.key_column_usage TO < user >;\n
"},{"location":"setup/connection/#cassandra","title":"Cassandra","text":"Follows same configuration as defined by the Spark Cassandra Connector as found here
JavaScalaapplication.confcassandra(\n\"customer_cassandra\", #name\n\"localhost:9042\", #url\n\"cassandra\", #username\n\"cassandra\", #password\nMap.of() #optional additional connection options\n)\n
cassandra(\n\"customer_cassandra\", #name\n\"localhost:9042\", #url\n\"cassandra\", #username\n\"cassandra\", #password\nMap() #optional additional connection options\n)\n
org.apache.spark.sql.cassandra {\n customer_cassandra {\n spark.cassandra.connection.host = \"localhost\"\n spark.cassandra.connection.host = ${?CASSANDRA_HOST}\n spark.cassandra.connection.port = \"9042\"\n spark.cassandra.connection.port = ${?CASSANDRA_PORT}\n spark.cassandra.auth.username = \"cassandra\"\n spark.cassandra.auth.username = ${?CASSANDRA_USERNAME}\n spark.cassandra.auth.password = \"cassandra\"\n spark.cassandra.auth.password = ${?CASSANDRA_PASSWORD}\n }\n}\n
"},{"location":"setup/connection/#permissions_2","title":"Permissions","text":"Ensure that the user has write permission, so it is able to save the table to the target tables.
CQL Permission StatementsGRANT INSERT ON <schema>.<table> TO <user>;\n
Following permissions are required when enabling configuration.enableGeneratePlanAndTasks(true)
as it will gather metadata information about tables and columns from the below tables.
GRANT SELECT ON system_schema.tables TO <user>;\nGRANT SELECT ON system_schema.columns TO <user>;\n
"},{"location":"setup/connection/#kafka","title":"Kafka","text":"Define your Kafka bootstrap server to connect and send generated data to corresponding topics. Topic gets set at a step level. Further details can be found here
JavaScalaapplication.confkafka(\n\"customer_kafka\", #name\n\"localhost:9092\" #url\n)\n
kafka(\n\"customer_kafka\", #name\n\"localhost:9092\" #url\n)\n
kafka {\n customer_kafka {\n kafka.bootstrap.servers = \"localhost:9092\"\n kafka.bootstrap.servers = ${?KAFKA_BOOTSTRAP_SERVERS}\n }\n}\n
When defining your schema for pushing data to Kafka, it follows a specific top level schema. An example can be found here . You can define the key, value, headers, partition or topic by following the linked schema.
"},{"location":"setup/connection/#jms","title":"JMS","text":"Uses JNDI lookup to send messages to JMS queue. Ensure that the messaging system you are using has your queue/topic registered via JNDI otherwise a connection cannot be created.
JavaScalaapplication.confsolace(\n\"customer_solace\", #name\n\"smf://localhost:55554\", #url\n\"admin\", #username\n\"admin\", #password\n\"default\", #vpn name\n\"/jms/cf/default\", #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\" #initial context factory\n)\n
solace(\n\"customer_solace\", #name\n\"smf://localhost:55554\", #url\n\"admin\", #username\n\"admin\", #password\n\"default\", #vpn name\n\"/jms/cf/default\", #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\" #initial context factory\n)\n
jms {\n customer_solace {\n initialContextFactory = \"com.solacesystems.jndi.SolJNDIInitialContextFactory\"\n connectionFactory = \"/jms/cf/default\"\n url = \"smf://localhost:55555\"\n url = ${?SOLACE_URL}\n user = \"admin\"\n user = ${?SOLACE_USER}\n password = \"admin\"\n password = ${?SOLACE_PASSWORD}\n vpnName = \"default\"\n vpnName = ${?SOLACE_VPN}\n }\n}\n
"},{"location":"setup/connection/#http","title":"HTTP","text":"Define any username and/or password needed for the HTTP requests. The url is defined in the tasks to allow for generated data to be populated in the url.
JavaScalaapplication.confhttp(\n\"customer_api\", #name\n\"admin\", #username\n\"admin\" #password\n)\n
http(\n\"customer_api\", #name\n\"admin\", #username\n\"admin\" #password\n)\n
http {\n customer_api {\n user = \"admin\"\n user = ${?HTTP_USER}\n password = \"admin\"\n password = ${?HTTP_PASSWORD}\n }\n}\n
"},{"location":"setup/deployment/","title":"Deployment","text":"Two main ways to deploy and run Data Caterer:
To package up your class along with the Data Caterer base image, you can follow the Dockerfile that is created for you here.
Then you can run the following:
./gradlew clean build\ndocker build -t <my_image_name>:<my_image_tag> .\n
"},{"location":"setup/deployment/#helm","title":"Helm","text":"Link to sample helm on GitHub here
Update the configuration to your own data connections and configuration or own image created from above.
git clone git@github.com:data-catering/data-caterer-example.git\nhelm install data-caterer ./data-caterer-example/helm/data-caterer\n
"},{"location":"setup/design/","title":"Design","text":"This document shows the thought process behind the design of Data Caterer to help give you insights as to how and why it was created to what it is today. Also, this serves as a reference for future design decisions which will get updated here and thus is a living document.
"},{"location":"setup/design/#motivation","title":"Motivation","text":"The main difficulties that I faced as a developer and team lead relating to testing were:
These difficulties helped formed the basis of the principles for which Data Caterer should follow:
graph LR\n subgraph userTasks [User Configuration]\n dataGen[Data Generation]\n dataValid[Data Validation]\n runConf[Runtime Config]\n end\n\n subgraph dataProcessor [Processor]\n dataCaterer[Data Caterer]\n end\n\n subgraph existingMetadata [Metadata]\n metadataService[Metadata Services]\n metadataDataSource[Data Sources]\n end\n\n subgraph output [Output]\n outputDataSource[Data Sources]\n report[Report]\n end\n\n dataGen --> dataCaterer\n dataValid --> dataCaterer\n runConf --> dataCaterer\n direction TB\n dataCaterer -.-> metadataService\n dataCaterer -.-> metadataDataSource\n direction LR\n dataCaterer ---> outputDataSource\n dataCaterer ---> report
Foreign keys can be defined to represent the relationships between datasets where values are required to match for particular columns.
"},{"location":"setup/foreign-key/#single-column","title":"Single column","text":"Define a column in one data source to match against another column. Below example shows a postgres
data source with two tables, accounts
and transactions
that have a foreign key for account_id
.
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList.of(Map.entry(postgresTxn, \"account_id\"))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList(postgresTxn -> \"account_id\")\n)\n
---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"postgres.accounts.account_id\":\n- \"postgres.transactions.account_id\"\n
"},{"location":"setup/foreign-key/#multiple-columns","title":"Multiple columns","text":"You may have a scenario where multiple columns need to be aligned. From the same example, we want account_id
and name
from accounts
to match with account_id
and full_name
to match in transactions
respectively.
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(postgresTxn, List.of(\"account_id\", \"full_name\")))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(postgresTxn -> List(\"account_id\", \"full_name\"))\n)\n
---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_postgres.transactions.account_id,full_name\"\n
"},{"location":"setup/foreign-key/#nested-column","title":"Nested column","text":"Your schema structure can have nested fields which can also be referenced as foreign keys. But to do so, you need to create a proxy field that gets omitted from the final saved data.
In the example below, the nested customer_details.name
field inside the json
task needs to match with name
from postgres
. A new field in the json
called _txn_name
is used as a temporary column to facilitate the foreign key definition.
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n),\nfield().name(\"_txn_name\").omit(true) #value will not be included in output\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(jsonTask, List.of(\"account_id\", \"_txn_name\")))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n), field.name(\"_txn_name\").omit(true) #value will not be included in output\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(jsonTask -> List(\"account_id\", \"_txn_name\"))\n)\n
---\n#postgres task yaml\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n---\n#json task yaml\nname: \"json_data\"\nsteps:\n- name: \"transactions\"\ntype: \"json\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"_txn_name\"\ngenerator:\noptions:\nomit: true\n- name: \"cusotmer_details\"\nschema:\nfields:\nname: \"name\"\ngenerator:\ntype: \"sql\"\noptions:\nsql: \"_txn_name\"\n\n---\n#plan yaml\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n- name: \"json_data\"\ndataSourceName: \"my_json\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_json.transactions.account_id,_txn_name\"\n
"},{"location":"setup/validation/","title":"Validations","text":"Validations can be used to run data checks after you have run the data generator or even as a standalone task. A report summarising the success or failure of the validations is produced and can be examined for further investigation.
Full example validation can be found below. For more details, check out each of the subsections defined further below.
JavaScalaYAMLvar csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().col(\"amount\").lessThan(100),\nvalidation().col(\"year\").isEqual(2021).errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail\nvalidation().col(\"name\").matches(\"Peter .*\").errorThreshold(200) //equivalent to if number of errors is > 200, then fail\n)\n.validationWait(waitCondition().pause(1));\n\nvar conf = configuration().enableValidation(true);\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validations(\nvalidation.col(\"amount\").lessThan(100),\nvalidation.col(\"year\").isEqual(2021).errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail\nvalidation.col(\"name\").matches(\"Peter .*\").errorThreshold(200) //equivalent to if number of errors is > 200, then fail\n) .validationWait(waitCondition.pause(1))\n\nval conf = configuration.enableValidation(true)\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1 #equivalent to if error percentage is > 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200 #equivalent to if number of errors is > 200, then fail\ndescription: \"Should be lots of Peters\"\nwaitCondition:\npauseInSeconds: 1\n
"},{"location":"setup/validation/#wait-condition","title":"Wait Condition","text":"Once data has been generated, you may want to wait for a certain condition to be met before starting the data validations. This can be via:
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().pause(1));\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.pause(1))\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npauseInSeconds: 1\n
"},{"location":"setup/validation/#data-exists","title":"Data exists","text":"JavaScalaYAML var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWaitDataExists(\"updated_date > DATE('2023-01-01')\");\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWaitDataExists(\"updated_date > DATE('2023-01-01')\")\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"transactions\"\noptions:\npath: \"/tmp/csv\"\nexpr: \"updated_date > DATE('2023-01-01')\"\n
"},{"location":"setup/validation/#webhook","title":"Webhook","text":"JavaScalaYAML var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\")); //by default, GET request successful when 200 status code\n\n//or\n\nvar csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)); //successful if 200 or 202 status code\n\n//or\n\nvar csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"my_http\", \"http://localhost:8080/finished\")); //use connection configuration from existing 'my_http' connection definition\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\")) //by default, GET request successful when 200 status code\n\n//or\n\nval csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)) //successful if 200 or 202 status code\n\n//or\n\nval csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"my_http\", \"http://localhost:8080/finished\")) //use connection configuration from existing 'my_http' connection definition\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\" #by default, GET request successful when 200 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\"\nmethod: \"GET\"\nstatusCodes: [200, 202] #successful if 200 or 202 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"my_http\" #use connection configuration from existing 'my_http' connection definition\nurl: \"http://localhost:8080/finished\"\n
"},{"location":"setup/validation/#file-exists","title":"File exists","text":"JavaScalaYAML var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().file(\"/tmp/json\"));\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition.file(\"/tmp/json\"))\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npath: \"/tmp/json\"\n
"},{"location":"setup/validation/#report","title":"Report","text":"Once run, it will produce a report like this.
"},{"location":"setup/generator/count/","title":"Record Count","text":"There are options related to controlling the number of records generated that can help in generating the scenarios or data required.
"},{"location":"setup/generator/count/#record-count_1","title":"Record Count","text":"Record count is the simplest as you define the total number of records you require for that particular step. For example, in the below step, it will generate 1000 records for the CSV file
JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\n
"},{"location":"setup/generator/count/#generated-count","title":"Generated Count","text":"As like most things in Data Caterer, the count can be generated based on some metadata. For example, if I wanted to generate between 1000 and 2000 records, I could define that by the below configuration:
JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator().min(1000).max(2000));\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator.min(1000).max(2000))\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\ngenerator:\ntype: \"random\"\noptions:\nmin: 1000\nmax: 2000\n
"},{"location":"setup/generator/count/#per-column-count","title":"Per Column Count","text":"When defining a per column count, this allows you to generate records \"per set of columns\". This means that for a given set of columns, it will generate a particular amount of records per combination of values for those columns.
One example of this would be when generating transactions relating to a customer, a customer may be defined by columns account_id, name
. A number of transactions would be generated per account_id, name
.
You can also use a combination of the above two methods to generate the number of records per column.
"},{"location":"setup/generator/count/#records","title":"Records","text":"When defining a base number of records within the perColumn
configuration, it translates to creating (count.records * count.recordsPerColumn)
records. This is a fixed number of records that will be generated each time, with no variation between runs.
In the example below, we have count.records = 1000
and count.recordsPerColumn = 2
. Which means that 1000 * 2 = 2000
records will be generated in total.
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\nrecords: 2\ncolumnNames:\n- \"account_id\"\n- \"name\"\n
"},{"location":"setup/generator/count/#generated","title":"Generated","text":"You can also define a generator for the count per column. This can be used in scenarios where you want a variable number of records per set of columns.
In the example below, it will generate between (count.records * count.perColumnGenerator.generator.min) = (1000 * 1) = 1000
and (count.records * count.perColumnGenerator.generator.max) = (1000 * 2) = 2000
records.
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumnGenerator(generator().min(1).max(2), \"account_id\", \"name\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumnGenerator(generator.min(1).max(2), \"account_id\", \"name\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\ncolumnNames:\n- \"account_id\"\n- \"name\"\ngenerator:\ntype: \"random\"\noptions:\nmin: 1\nmax: 2\n
"},{"location":"setup/generator/data-generator/","title":"Data Generators","text":""},{"location":"setup/generator/data-generator/#data-types","title":"Data Types","text":"Below is a list of all supported data types for generating data:
Data Type Spark Data Type Options Description string StringTypeminLen, maxLen, expression, enableNull
integer IntegerType min, max, stddev, mean
long LongType min, max, stddev, mean
short ShortType min, max, stddev, mean
decimal(precision, scale) DecimalType(precision, scale) min, max, stddev, mean
double DoubleType min, max, stddev, mean
float FloatType min, max, stddev, mean
date DateType min, max, enableNull
timestamp TimestampType min, max, enableNull
boolean BooleanType binary BinaryType minLen, maxLen, enableNull
byte ByteType array ArrayType arrayMinLen, arrayMaxLen, arrayType
_ StructType Implicitly supported when a schema is defined for a field"},{"location":"setup/generator/data-generator/#options","title":"Options","text":""},{"location":"setup/generator/data-generator/#all-data-types","title":"All data types","text":"Some options are available to use for all types of data generators. Below is the list along with example and descriptions:
Option Default Example DescriptionenableEdgeCase
false enableEdgeCase: \"true\"
Enable/disable generated data to contain edge cases based on the data type. For example, integer data type has edge cases of (Int.MaxValue, Int.MinValue and 0) edgeCaseProbability
0.0 edgeCaseProb: \"0.1\"
Probability of generating a random edge case value if enableEdgeCase
is true isUnique
false isUnique: \"true\"
Enable/disable generated data to be unique for that column. Errors will be thrown when it is unable to generate unique data seed
seed: \"1\"
Defines the random seed for generating data for that particular column. It will override any seed defined at a global level sql
sql: \"CASE WHEN amount < 10 THEN true ELSE false END\"
Define any SQL statement for generating that columns value. Computation occurs after all non-SQL fields are generated. This means any columns used in the SQL cannot be based on other SQL generated columns. Data type of generated value from SQL needs to match data type defined for the field"},{"location":"setup/generator/data-generator/#string","title":"String","text":"Option Default Example Description minLen
1 minLen: \"2\"
Ensures that all generated strings have at least length minLen
maxLen
10 maxLen: \"15\"
Ensures that all generated strings have at most length maxLen
expression
expression: \"#{Name.name}\"
expression:\"#{Address.city}/#{Demographic.maritalStatus}\"
Will generate a string based on the faker expression provided. All possible faker expressions can be found here Expression has to be in format #{<faker expression name>}
enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", \"\u0130yi g\u00fcnler\", \"\u0421\u043f\u0430\u0441\u0438\u0431\u043e\", \"\u039a\u03b1\u03bb\u03b7\u03bc\u03ad\u03c1\u03b1\", \"\u0635\u0628\u0627\u062d \u0627\u0644\u062e\u064a\u0631\", \" F\u00f6rl\u00e5t\", \"\u4f60\u597d\u5417\", \"Nh\u00e0 v\u1ec7 sinh \u1edf \u0111\u00e2u\", \"\u3053\u3093\u306b\u3061\u306f\", \"\u0928\u092e\u0938\u094d\u0924\u0947\", \"\u0532\u0561\u0580\u0565\u0582\", \"\u0417\u0434\u0440\u0430\u0432\u0435\u0439\u0442\u0435\")
"},{"location":"setup/generator/data-generator/#sample","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield()\n.name(\"name\")\n.type(StringType.instance())\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield\n.name(\"name\")\n.`type`(StringType)\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\nschema:\nfields:\n- name: \"name\"\ntype: \"string\"\ngenerator:\noptions:\nexpression: \"#{Name.name}\"\nenableNull: true\nnullProb: 0.1\nminLength: 4\nmaxLength: 20\n
"},{"location":"setup/generator/data-generator/#numeric","title":"Numeric","text":"For all the numeric data types, there are 4 options to choose from: min, max and maxValue. Generally speaking, you only need to define one of min or minValue, similarly with max or maxValue. The reason why there are 2 options for each is because of when metadata is automatically gathered, we gather the statistics of the observed min and max values. Also, it will attempt to gather any restriction on the min or max value as defined by the data source (i.e. max value as per database type).
"},{"location":"setup/generator/data-generator/#integerlongshort","title":"Integer/Long/Short","text":"Option Default Example Descriptionmin
0 min: \"2\"
Ensures that all generated values are greater than or equal to min
max
1000 max: \"25\"
Ensures that all generated values are less than or equal to max
stddev
1.0 stddev: \"2.0\"
Standard deviation for normal distributed data mean
max - min
mean: \"5.0\"
Mean for normal distributed data Edge cases Integer: (2147483647, -2147483648, 0) Edge cases Long: (9223372036854775807, -9223372036854775808, 0) Edge cases Short: (32767, -32768, 0)
"},{"location":"setup/generator/data-generator/#sample_1","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"year\").type(IntegerType.instance()).min(2020).max(2023),\nfield().name(\"customer_id\").type(LongType.instance()),\nfield().name(\"customer_group\").type(ShortType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"year\").`type`(IntegerType).min(2020).max(2023),\nfield.name(\"customer_id\").`type`(LongType),\nfield.name(\"customer_group\").`type`(ShortType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"year\"\ntype: \"integer\"\ngenerator:\noptions:\nmin: 2020\nmax: 2023\n- name: \"customer_id\"\ntype: \"long\"\n- name: \"customer_group\"\ntype: \"short\"\n
"},{"location":"setup/generator/data-generator/#decimal","title":"Decimal","text":"Option Default Example Description min
0 min: \"2\"
Ensures that all generated values are greater than or equal to min
max
1000 max: \"25\"
Ensures that all generated values are less than or equal to max
stddev
1.0 stddev: \"2.0\"
Standard deviation for normal distributed data mean
max - min
mean: \"5.0\"
Mean for normal distributed data numericPrecision
10 precision: \"25\"
The maximum number of digits numericScale
0 scale: \"25\"
The number of digits on the right side of the decimal point (has to be less than or equal to precision) Edge cases Decimal: (9223372036854775807, -9223372036854775808, 0)
"},{"location":"setup/generator/data-generator/#sample_2","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"balance\").type(DecimalType.instance()).numericPrecision(10).numericScale(5)\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"balance\").`type`(DecimalType).numericPrecision(10).numericScale(5)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"balance\"\ntype: \"decimal\"\ngenerator:\noptions:\nprecision: 10\nscale: 5\n
"},{"location":"setup/generator/data-generator/#doublefloat","title":"Double/Float","text":"Option Default Example Description min
0.0 min: \"2.1\"
Ensures that all generated values are greater than or equal to min
max
1000.0 max: \"25.9\"
Ensures that all generated values are less than or equal to max
stddev
1.0 stddev: \"2.0\"
Standard deviation for normal distributed data mean
max - min
mean: \"5.0\"
Mean for normal distributed data Edge cases Double: (+infinity, 1.7976931348623157e+308, 4.9e-324, 0.0, -0.0, -1.7976931348623157e+308, -infinity, NaN) Edge cases Float: (+infinity, 3.4028235e+38, 1.4e-45, 0.0, -0.0, -3.4028235e+38, -infinity, NaN)
"},{"location":"setup/generator/data-generator/#sample_3","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"amount\").type(DoubleType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"amount\").`type`(DoubleType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"amount\"\ntype: \"double\"\n
"},{"location":"setup/generator/data-generator/#date","title":"Date","text":"Option Default Example Description min
now() - 365 days min: \"2023-01-31\"
Ensures that all generated values are greater than or equal to min
max
now() max: \"2023-12-31\"
Ensures that all generated values are less than or equal to max
enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true Edge cases: (0001-01-01, 1582-10-15, 1970-01-01, 9999-12-31) (reference)
"},{"location":"setup/generator/data-generator/#sample_4","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_date\").type(DateType.instance()).min(java.sql.Date.valueOf(\"2020-01-01\"))\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_date\").`type`(DateType).min(java.sql.Date.valueOf(\"2020-01-01\"))\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_date\"\ntype: \"date\"\ngenerator:\noptions:\nmin: \"2020-01-01\"\n
"},{"location":"setup/generator/data-generator/#timestamp","title":"Timestamp","text":"Option Default Example Description min
now() - 365 days min: \"2023-01-31 23:10:10\"
Ensures that all generated values are greater than or equal to min
max
now() max: \"2023-12-31 23:10:10\"
Ensures that all generated values are less than or equal to max
enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true Edge cases: (0001-01-01 00:00:00, 1582-10-15 23:59:59, 1970-01-01 00:00:00, 9999-12-31 23:59:59)
"},{"location":"setup/generator/data-generator/#sample_5","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_time\").type(TimestampType.instance()).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_time\").`type`(TimestampType).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_time\"\ntype: \"timestamp\"\ngenerator:\noptions:\nmin: \"2020-01-01 00:00:00\"\n
"},{"location":"setup/generator/data-generator/#binary","title":"Binary","text":"Option Default Example Description minLen
1 minLen: \"2\"
Ensures that all generated array of bytes have at least length minLen
maxLen
20 maxLen: \"15\"
Ensures that all generated array of bytes have at most length maxLen
enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", -128, 127)
"},{"location":"setup/generator/data-generator/#sample_6","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"payload\").type(BinaryType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"payload\").`type`(BinaryType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"payload\"\ntype: \"binary\"\n
"},{"location":"setup/generator/data-generator/#array","title":"Array","text":"Option Default Example Description arrayMinLen
0 arrayMinLen: \"2\"
Ensures that all generated arrays have at least length arrayMinLen
arrayMaxLen
5 arrayMaxLen: \"15\"
Ensures that all generated arrays have at most length arrayMaxLen
arrayType
arrayType: \"double\"
Inner data type of the array. Optional when using Java/Scala API. Allows for nested data types to be defined like struct enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true"},{"location":"setup/generator/data-generator/#sample_7","title":"Sample","text":"JavaScalaYAML csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"last_5_amounts\").type(ArrayType.instance()).arrayType(\"double\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"last_5_amounts\").`type`(ArrayType).arrayType(\"double\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"last_5_amounts\"\ntype: \"array<double>\"\n
"},{"location":"setup/guide/","title":"Guides","text":"Below are a list of guides you can follow to create your data generation for your use case.
For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.
"},{"location":"setup/guide/#scenarios","title":"Scenarios","text":"The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources. A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.
"},{"location":"setup/guide/#plan","title":"Plan","text":""},{"location":"setup/guide/#foreign-keys","title":"Foreign Keys","text":"Define foreign keys across data sources in your plan to ensure generated data can match Link to associated task 1 Link to associated task 2
"},{"location":"setup/guide/#task","title":"Task","text":"Data Source Type Data Source Sample Task Notes Database Postgres Sample Database MySQL Sample Database Cassandra Sample File CSV Sample File JSON Sample Contains nested schemas and use of SQL for generated values File Parquet Sample Partition by year column Kafka Kafka Sample Specific base schema to be used, define headers, key, value, etc. JMS Solace Sample JSON formatted message HTTP PUT Sample JSON formatted PUT body"},{"location":"setup/guide/#configuration","title":"Configuration","text":"Basic configuration
"},{"location":"setup/guide/#docker-compose","title":"Docker-compose","text":"To see how it runs against different data sources, you can run using docker-compose
and set DATA_SOURCE
like below
./gradlew build\ncd docker\nDATA_SOURCE=postgres docker-compose up -d datacaterer\n
Can set it to one of the following:
Info
Writing data to Cassandra is a paid feature. Try the free trial here.
Creating a data generator for Cassandra. You will build a Docker image that will be able to populate data in Cassandra for the tables you configure.
"},{"location":"setup/guide/data-source/cassandra/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
If you already have a Cassandra instance running, you can skip to this step.
"},{"location":"setup/guide/data-source/cassandra/#cassandra-setup","title":"Cassandra Setup","text":"Next, let's make sure you have an instance of Cassandra up and running in your local environment. This will make it easy for us to iterate and check our changes.
cd docker\ndocker-compose up -d cassandra\n
"},{"location":"setup/guide/data-source/cassandra/#permissions","title":"Permissions","text":"Let's make a new user that has the required permissions needed to push data into the Cassandra tables we want.
CQL Permission StatementsGRANT INSERT ON <schema>.<table> TO data_caterer_user;\n
Following permissions are required when enabling configuration.enableGeneratePlanAndTasks(true)
as it will gather metadata information about tables and columns from the below tables.
GRANT SELECT ON system_schema.tables TO data_caterer_user;\nGRANT SELECT ON system_schema.columns TO data_caterer_user;\n
"},{"location":"setup/guide/data-source/cassandra/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedCassandraJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyAdvancedCassandraPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedCassandraJavaPlan extends PlanRun {\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n\nclass MyAdvancedCassandraPlan extends PlanRun {\n}\n
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
"},{"location":"setup/guide/data-source/cassandra/#connection-configuration","title":"Connection Configuration","text":"Within our class, we can start by defining the connection properties to connect to Cassandra.
JavaScalavar accountTask = cassandra(\n\"customer_cassandra\", //name\n\"localhost:9042\", //url\n\"cassandra\", //username\n\"cassandra\", //password\nMap.of() //optional additional connection options\n)\n
Additional options such as SSL configuration, etc can be found here.
val accountTask = cassandra(\n\"customer_cassandra\", //name\n\"localhost:9042\", //url\n\"cassandra\", //username\n\"cassandra\", //password\nMap() //optional additional connection options\n)\n
Additional options such as SSL configuration, etc can be found here.
"},{"location":"setup/guide/data-source/cassandra/#schema","title":"Schema","text":"Let's create a task for inserting data into the account.accounts
and account.account_status_history
tables as defined underdocker/data/cql/customer.cql
. This table should already be setup for you if you followed this step. We can check if the table is setup already via the following command:
docker exec docker-cassandraserver-1 cqlsh -e 'describe account.accounts; describe account.account_status_history;'\n
Here we should see some output that looks like the below. This tells us what schema we need to follow when generating data. We need to define that alongside any metadata that is useful to add constraints on what are possible values the generated data should contain.
CREATE TABLE account.accounts (\naccount_id text PRIMARY KEY,\n amount double,\n created_by text,\n name text,\n open_time timestamp,\n status text\n)...\n\nCREATE TABLE account.account_status_history (\naccount_id text,\n eod_date date,\n status text,\n updated_by text,\n updated_time timestamp,\n PRIMARY KEY (account_id, eod_date)\n)...\n
Trimming the connection details to work with the docker-compose Cassandra, we have a base Cassandra connection to define the table and schema required. Let's define each field along with their corresponding data type. You will notice that the text
fields do not have a data type defined. This is because the default data type is StringType
which corresponds to text
in Cassandra.
{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n}\n
val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n
"},{"location":"setup/guide/data-source/cassandra/#field-metadata","title":"Field Metadata","text":"We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata that add guidelines that the data generator will understand when generating data.
"},{"location":"setup/guide/data-source/cassandra/#account_id","title":"account_id","text":"account_id
follows a particular pattern that where it starts with ACC
and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that unique values are generated.
field().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n
field.name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n
"},{"location":"setup/guide/data-source/cassandra/#amount","title":"amount","text":"amount
the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between 1
and 1000
.
field().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\n
field.name(\"amount\").`type`(DoubleType).min(1).max(1000),\n
"},{"location":"setup/guide/data-source/cassandra/#name","title":"name","text":"name
is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an expression
to generate real looking name. All possible faker expressions can be found here
field().name(\"name\").expression(\"#{Name.name}\"),\n
field.name(\"name\").expression(\"#{Name.name}\"),\n
"},{"location":"setup/guide/data-source/cassandra/#open_time","title":"open_time","text":"open_time
is a timestamp that we want to have a value greater than a specific date. We can define a min date by using java.sql.Date
like below.
field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
"},{"location":"setup/guide/data-source/cassandra/#status","title":"status","text":"status
is a field that can only obtain one of four values, open, closed, suspended or pending
.
field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
"},{"location":"setup/guide/data-source/cassandra/#created_by","title":"created_by","text":"created_by
is a field that is based on the status
field where it follows the logic: if status is open or closed, then it is created_by eod else created_by event
. This can be achieved by defining a SQL expression like below.
field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
Putting it all the fields together, our class should now look like this.
JavaScalavar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n
val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n
"},{"location":"setup/guide/data-source/cassandra/#additional-configurations","title":"Additional Configurations","text":"At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.
JavaScalavar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n
"},{"location":"setup/guide/data-source/cassandra/#execute","title":"Execute","text":"To tell Data Caterer that we want to run with the configurations along with the accountTask
, we have to call execute
. So our full plan run will look like this.
public class MyAdvancedCassandraJavaPlan extends PlanRun {\n{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n
class MyAdvancedCassandraPlan extends PlanRun {\nval accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n
"},{"location":"setup/guide/data-source/cassandra/#run","title":"Run","text":"Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just created.
./run.sh\n#input class MyAdvancedCassandraJavaPlan or MyAdvancedCassandraPlan\n#after completing\ndocker exec docker-cassandraserver-1 cqlsh -e 'select count(1) from account.accounts;select * from account.accounts limit 10;'\n
Your output should look like this.
count\n-------\n 1000\n\n(1 rows)\n\nWarnings :\nAggregation query used without partition key\n\n\n account_id | amount | created_by | name | open_time | status\n-------------+-----------+--------------------+------------------------+---------------------------------+-----------\n ACC13554145 | 917.00418 | zb CVvbBTTzitjo5fK | Jan Sanford I | 2023-06-21 21:50:10.463000+0000 | suspended\n ACC19154140 | 46.99177 | VH88H9 | Clyde Bailey PhD | 2023-07-18 11:33:03.675000+0000 | open\n ACC50587836 | 774.9872 | GENANwPm t | Sang Monahan | 2023-03-21 00:16:53.308000+0000 | closed\n ACC67619387 | 452.86706 | 5msTpcBLStTH | Jewell Gerlach | 2022-10-18 19:13:07.606000+0000 | suspended\n ACC69889784 | 14.69298 | WDmOh7NT | Dale Schulist | 2022-10-25 12:10:52.239000+0000 | suspended\n ACC41977254 | 51.26492 | J8jAKzvj2 | Norma Nienow | 2023-08-19 18:54:39.195000+0000 | suspended\n ACC40932912 | 349.68067 | SLcJgKZdLp5ALMyg | Vincenzo Considine III | 2023-05-16 00:22:45.991000+0000 | closed\n ACC20642011 | 658.40713 | clyZRD4fI | Lannie McLaughlin DDS | 2023-05-11 23:14:30.249000+0000 | open\n ACC74962085 | 970.98218 | ZLETTSnj4NpD | Ima Jerde DVM | 2023-05-07 10:01:56.218000+0000 | pending\n ACC72848439 | 481.64267 | cc | Kyla Deckow DDS | 2023-08-16 13:28:23.362000+0000 | suspended\n\n(10 rows)\n
Also check the HTML report, found at docker/sample/report/index.html
, that gets generated to get an overview of what was executed.
Info
Generating data based on OpenAPI/Swagger document and pushing to HTTP endpoint is a paid feature. Try the free trial here.
Creating a data generator based on an OpenAPI/Swagger document.
"},{"location":"setup/guide/data-source/http/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/http/#http-setup","title":"HTTP Setup","text":"We will be using the http-bin docker image to help simulate a service with HTTP endpoints.
Start it via:
cd docker\ndocker-compose up -d http\ndocker ps\n
"},{"location":"setup/guide/data-source/http/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedHttpJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedHttpPlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedHttpJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedHttpPlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n
We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.
"},{"location":"setup/guide/data-source/http/#schema","title":"Schema","text":"We can point the schema of a data source to a OpenAPI/Swagger document or URL. For this example, we will use the OpenAPI document found under docker/mount/http/petstore.json
in the data-caterer-example repo. This is a simplified version of the original OpenAPI spec that can be found here.
We have kept the following endpoints to test out:
var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count().records(2));\n
val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count.records(2))\n
The above defines that the schema will come from an OpenAPI document found on the pathway defined. It will then generate 2 requests per request method and endpoint combination.
"},{"location":"setup/guide/data-source/http/#run","title":"Run","text":"Let's try run and see what happens.
cd ..\n./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\n#after completing\ndocker logs -f docker-http-1\n
It should look something like this.
172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DeXQxFUHVja+EYm%26limit%3D33895 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DSXaFvAqwYGF%26tags%3DjdNRFONA%26limit%3D40975 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/kbH8D7rDuq HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/REsa0tnu7dvekGDvxR HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/EqrOr1dHFfKUjWb HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/7WG7JHPaNxP HTTP/1.1 200 Host: host.docker.internal}\n
Looks like we have some data now. But we can do better and add some enhancements to it.
"},{"location":"setup/guide/data-source/http/#foreign-keys","title":"Foreign keys","text":"The four different requests that get sent could have the same id
passed across to each of them if we define a foreign key relationship. This will make it more realistic to a real life scenario as pets get created and queried by a particular id
value. We note that the id
value is first used when a pet is created in the body of the POST request. Then it gets used as a path parameter in the DELETE and GET requests.
To link them all together, we must follow a particular pattern when referring to request body, query parameter or path parameter columns.
HTTP Type Column Prefix Example Request BodybodyContent
bodyContent.id
Path Parameter pathParam
pathParamid
Query Parameter queryParam
queryParamid
Header header
headerContent_Type
Also note, that when creating a foreign field definition for a HTTP data source, to refer to a specific endpoint and method, we have to follow the pattern of {http method}{http path}
. For example, POST/pets
. Let's apply this knowledge to link all the id
values together.
var myPlan = plan().addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"), //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n);\n\nexecute(myPlan, conf, httpTask);\n
val myPlan = plan.addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"), //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n)\n\nexecute(myPlan, conf, httpTask)\n
Let's test it out by running it again
./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n
172.21.0.1 [06/Nov/2023:01:33:59 +0000] GET /anything/pets?limit%3D45971 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:00 +0000] GET /anything/pets?limit%3D62015 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:04 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:05 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n
Now we have the same id
values being produced across the POST, DELETE and GET requests! What if we knew that the id
values should follow a particular pattern?
So given that we have defined a foreign key where the root of the foreign key values is from the POST request, we can update the metadata of the id
column for the POST request and it will proliferate to the other endpoints as well. Given the id
column is a nested column as noted in the foreign key, we can alter its metadata via the following:
var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field().name(\"bodyContent\").schema(field().name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count().records(2));\n
val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field.name(\"bodyContent\").schema(field.name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count.records(2))\n
We first get the column bodyContent
, then get the nested schema and get the column id
and add metadata stating that id
should follow the patter ID[0-9]{8}
.
Let's try run again, and hopefully we should see some proper ID values.
./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n
172.21.0.1 [06/Nov/2023:01:45:45 +0000] GET /anything/pets?tags%3D10fWnNoDz%26limit%3D66804 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:46 +0000] GET /anything/pets?tags%3DhyO6mI8LZUUpS HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:50 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:51 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n
Great! Now we have replicated a production-like flow of HTTP requests.
"},{"location":"setup/guide/data-source/http/#ordering","title":"Ordering","text":"If you wanted to change the ordering of the requests, you can alter the order from within the OpenAPI/Swagger document. This is particularly useful when you want to simulate the same flow that users would take when utilising your application (i.e. create account, query account, update account).
"},{"location":"setup/guide/data-source/http/#rows-per-second","title":"Rows per second","text":"By default, Data Caterer will push requests per method and endpoint at a rate of around 5 requests per second. If you want to alter this value, you can do so via the below configuration. The lowest supported requests per second is 1.
JavaScalaimport io.github.datacatering.datacaterer.api.model.Constants;\n\n...\nvar httpTask = http(\"my_http\", Map.of(Constants.ROWS_PER_SECOND(), \"1\"))\n...\n
import io.github.datacatering.datacaterer.api.model.Constants.ROWS_PER_SECOND\n\n...\nval httpTask = http(\"my_http\", options = Map(ROWS_PER_SECOND -> \"1\"))\n...\n
Check out the full example under AdvancedHttpPlanRun
in the example repo.
Info
Writing data to Kafka is a paid feature. Try the free trial here.
Creating a data generator for Kafka. You will build a Docker image that will be able to populate data in kafka for the topics you configure.
"},{"location":"setup/guide/data-source/kafka/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
If you already have a Kafka instance running, you can skip to this step.
"},{"location":"setup/guide/data-source/kafka/#kafka-setup","title":"Kafka Setup","text":"Next, let's make sure you have an instance of Kafka up and running in your local environment. This will make it easy for us to iterate and check our changes.
cd docker\ndocker-compose up -d kafka\n
"},{"location":"setup/guide/data-source/kafka/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedKafkaJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyAdvancedKafkaPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedKafkaJavaPlan extends PlanRun {\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n\nclass MyAdvancedKafkaPlan extends PlanRun {\n}\n
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
"},{"location":"setup/guide/data-source/kafka/#connection-configuration","title":"Connection Configuration","text":"Within our class, we can start by defining the connection properties to connect to Kafka.
JavaScalavar accountTask = kafka(\n\"my_kafka\", //name\n\"localhost:9092\", //url\nMap.of() //optional additional connection options\n);\n
Additional options can be found here.
val accountTask = kafka(\n\"my_kafka\", //name\n\"localhost:9092\", //url\nMap() //optional additional connection options\n)\n
Additional options can be found here.
"},{"location":"setup/guide/data-source/kafka/#schema","title":"Schema","text":"Let's create a task for inserting data into the account-topic
that is already defined underdocker/data/kafka/setup_kafka.sh
. This topic should already be setup for you if you followed this step. We can check if the topic is set up already via the following command:
docker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n
Trimming the connection details to work with the docker-compose Kafka, we have a base Kafka connection to define the topic we will publish to. Let's define each field along with their corresponding data type. You will notice that the text
fields do not have a data type defined. This is because the default data type is StringType
.
{\nvar kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield().name(\"key\").sql(\"content.account_id\"),\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()), can define partition here\nfield().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n),\nfield().name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield().name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n}\n
val kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield.name(\"key\").sql(\"content.account_id\"),\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").type(IntegerType), can define partition here\nfield.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n | NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n | NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\nfield.name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield.name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n
"},{"location":"setup/guide/data-source/kafka/#fields","title":"Fields","text":"The schema defined for Kafka has a format that needs to be followed as noted above. Specifically, the required fields are: - value
Whilst, the other fields are optional:
headers
follows a particular pattern that where it is of type array<struct<key: string,value: binary>>
. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the value
part, it refers to content.account_id
where content
is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.
field().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n
field.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n | NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n | NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n |)\"\"\".stripMargin\n)\n
"},{"location":"setup/guide/data-source/kafka/#transactions","title":"transactions","text":"transactions
is an array that contains an inner structure of txn_date
and amount
. The size of the array generated can be controlled via arrayMinLength
and arrayMaxLength
.
field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n
field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n
"},{"location":"setup/guide/data-source/kafka/#details","title":"details","text":"details
is another example of a nested schema structure where it also has a nested structure itself in updated_by
. One thing to note here is the first_txn_date
field has a reference to the content.transactions
array where it will sort the array by txn_date
and get the first element.
field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n
field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n
"},{"location":"setup/guide/data-source/kafka/#additional-configurations","title":"Additional Configurations","text":"At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.
JavaScalavar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n
"},{"location":"setup/guide/data-source/kafka/#execute","title":"Execute","text":"To tell Data Caterer that we want to run with the configurations along with the kafkaTask
, we have to call execute
.
Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just created.
./run.sh\n#input class AdvancedKafkaJavaPlanRun or AdvancedKafkaPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n
Your output should look like this.
{\"account_id\":\"ACC56292178\",\"year\":2022,\"amount\":18338.627721151555,\"details\":{\"name\":\"Isaias Reilly\",\"first_txn_date\":\"2021-01-22\",\"updated_by\":{\"user\":\"FgYXbKDWdhHVc3\",\"time\":\"2022-12-30T13:49:07.309Z\"}},\"transactions\":[{\"txn_date\":\"2021-01-22\",\"amount\":30556.52125487579},{\"txn_date\":\"2021-10-29\",\"amount\":39372.302259554635},{\"txn_date\":\"2021-10-29\",\"amount\":61887.31389495968}]}\n{\"account_id\":\"ACC37729457\",\"year\":2022,\"amount\":96885.31758764731,\"details\":{\"name\":\"Randell Witting\",\"first_txn_date\":\"2021-06-30\",\"updated_by\":{\"user\":\"HCKYEBHN8AJ3TB\",\"time\":\"2022-12-02T02:05:01.144Z\"}},\"transactions\":[{\"txn_date\":\"2021-06-30\",\"amount\":98042.09647765031},{\"txn_date\":\"2021-10-06\",\"amount\":41191.43564742036},{\"txn_date\":\"2021-11-16\",\"amount\":78852.08184809204},{\"txn_date\":\"2021-10-09\",\"amount\":13747.157653571106}]}\n{\"account_id\":\"ACC23127317\",\"year\":2023,\"amount\":81164.49304198896,\"details\":{\"name\":\"Jed Wisozk\",\"updated_by\":{\"user\":\"9MBFZZ\",\"time\":\"2023-07-12T05:56:52.397Z\"}},\"transactions\":[]}\n
Also check the HTML report, found at docker/sample/report/index.html
, that gets generated to get an overview of what was executed.
Info
Generating data based on an external metadata source is a paid feature. Try the free trial here.
Creating a data generator for Postgres tables and CSV file based on metadata stored in Marquez ( follows OpenLineage API).
"},{"location":"setup/guide/data-source/marquez-metadata-source/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/marquez-metadata-source/#marquez-setup","title":"Marquez Setup","text":"You can follow the README found here to help with setting up Marquez in your local environment. This comes with an instance of Postgres which we will also be using as a data store for generated data.
The command that was run for this example to help with setup of dummy data was ./docker/up.sh -a 5001 -m 5002 --seed
.
Check that the following url shows some data like below once you click on food_delivery
from the ns
drop down in the top right corner.
Since we will also be using the Marquez Postgres instance as a data source, we will set up a separate database to store the generated data in via:
docker exec marquez-db psql -Upostgres -c 'CREATE DATABASE food_delivery'\n
"},{"location":"setup/guide/data-source/marquez-metadata-source/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedMetadataSourceJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedMetadataSourcePlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n
We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.
"},{"location":"setup/guide/data-source/marquez-metadata-source/#schema","title":"Schema","text":"We can point the schema of a data source to our Marquez instance. For the Postgres data source, we will point to a namespace
, which in Marquez or OpenLineage, represents a set of datasets. For the CSV data source, we will point to a specific namespace
and dataset
.
var csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map.of(\"saveMode\", \"overwrite\", \"header\", \"true\"))\n.schema(metadataSource().marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count().records(10));\n
val csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map(\"saveMode\" -> \"overwrite\", \"header\" -> \"true\"))\n.schema(metadataSource.marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count.records(10))\n
The above defines that the schema will come from Marquez, which is a type of metadata source that contains information about schemas. Specifically, it points to the food_delivery
namespace and public.categories
dataset to retrieve the schema information from.
var postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\", Map.of())\n.schema(metadataSource().marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count().records(10));\n
val postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\")\n.schema(metadataSource.marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count.records(10))\n
We now have pointed this Postgres instance to produce multiple schemas that are defined under the food_delivery
namespace. Also note that we are using database food_delivery
in Postgres to push our generated data to, and we have set the number of records per sub data source (in this case, per table) to be 10.
Let's try run and see what happens.
cd ..\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\n#after completing\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n
It should look something like this.
order_id | order_placed_on | order_dispatched_on | order_delivered_on | customer_email | customer_address | menu_id | restaurant_id | restaurant_address\n | menu_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+--------------------------------+----------------------------------------------------------+---------+---------------+---------------------------------------------------------------\n---+--------------+-------------+-------------+---------+-----------\n 38736 | 2023-02-05 06:05:23.755 | 2023-09-08 04:29:10.878 | 2023-09-03 23:58:34.285 | april.skiles@hotmail.com | 5018 Lang Dam, Gaylordfurt, MO 35172 | 59841 | 30971 | Suite 439 51366 Bartoletti Plains, West Lashawndamouth, CA 242\n42 | 55697 | 36370 | 21574 | 88022 | 16569\n4376 | 2022-12-19 14:39:53.442 | 2023-08-30 07:40:06.948 | 2023-03-15 20:38:26.11 | adelina.balistreri@hotmail.com | Apt. 340 9146 Novella Motorway, East Troyhaven, UT 34773 | 66195 | 42765 | Suite 670 8956 Rob Fork, Rennershire, CA 04524\n| 26516 | 81335 | 87615 | 27433 | 45649\n11083 | 2022-10-30 12:46:38.692 | 2023-06-02 13:05:52.493 | 2022-11-27 18:38:07.873 | johnny.gleason@gmail.com | Apt. 385 99701 Lemke Place, New Irvin, RI 73305 | 66427 | 44438 | 1309 Danny Cape, Weimanntown, AL 15865\n| 41686 | 36508 | 34498 | 24191 | 92405\n58759 | 2023-07-26 14:32:30.883 | 2022-12-25 11:04:08.561 | 2023-04-21 17:43:05.86 | isabelle.ohara@hotmail.com | 2225 Evie Lane, South Ardella, SD 90805 | 27106 | 25287 | Suite 678 3731 Dovie Park, Port Luigi, ID 08250\n| 94205 | 66207 | 81051 | 52553 | 27483\n
You can also try query some other tables. Let's also check what is in the CSV file.
$ head docker/sample/csv/part-0000*\nmenu_item_id,category_id,discount_id,city_id,driver_id,order_id,order_placed_on,order_dispatched_on,order_delivered_on,customer_email,customer_address,menu_id,restaurant_id,restaurant_address\n72248,37098,80135,45888,5036,11090,2023-09-20T05:33:08.036+08:00,2023-05-16T23:10:57.119+08:00,2023-05-01T22:02:23.272+08:00,demetrice.rohan@hotmail.com,\"406 Harmony Rue, Wisozkburgh, MD 12282\",33762,9042,\"Apt. 751 0796 Ellan Flats, Lake Chetville, WI 81957\"\n41644,40029,48565,83373,89919,58359,2023-04-18T06:28:26.194+08:00,2022-10-15T18:17:48.998+08:00,2023-02-06T17:02:04.104+08:00,joannie.okuneva@yahoo.com,\"Suite 889 022 Susan Lane, Zemlakport, OR 56996\",27467,6216,\"Suite 016 286 Derick Grove, Dooleytown, NY 14664\"\n49299,53699,79675,40821,61764,72234,2023-07-16T21:33:48.739+08:00,2023-02-14T21:23:10.265+08:00,2023-09-18T02:08:51.433+08:00,ina.heller@yahoo.com,\"Suite 600 86844 Heller Island, New Celestinestad, DE 42622\",48002,12462,\"5418 Okuneva Mountain, East Blairchester, MN 04060\"\n83197,86141,11085,29944,81164,65382,2023-01-20T06:08:25.981+08:00,2023-01-11T13:24:32.968+08:00,2023-09-09T02:30:16.890+08:00,lakisha.bashirian@yahoo.com,\"Suite 938 534 Theodore Lock, Port Caitlynland, LA 67308\",69109,47727,\"4464 Stewart Tunnel, Marguritemouth, AR 56791\"\n
Looks like we have some data now. But we can do better and add some enhancements to it.
What if we wanted the same records in Postgres public.delivery_7_days
to also show up in the CSV file? That's where we can use a foreign key definition.
We can take a look at the report (under docker/sample/report/index.html
) to see what we need to do to create the foreign key. From the overview, you should see under Tasks
there is a my_postgres
task which has food_delivery_public.delivery_7_days
as a step. Click on the link for food_delivery_public.delivery_7_days
and it will take us to a page where we can find out about the columns used in this table. Click on the Fields
button on the far right to see.
We can copy all of a subset of fields that we want matched across the CSV file and Postgres. For this example, we will take all the fields.
JavaScalavar myPlan = plan().addForeignKeyRelationship(\npostgresTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask);\n
val foreignCols = List(\"order_id\", \"order_placed_on\", \"order_dispatched_on\", \"order_delivered_on\", \"customer_email\",\n\"customer_address\", \"menu_id\", \"restaurant_id\", \"restaurant_address\", \"menu_item_id\", \"category_id\", \"discount_id\",\n\"city_id\", \"driver_id\")\n\nval myPlan = plan.addForeignKeyRelationships(\ncsvTask, foreignCols,\nList(foreignField(postgresTask, \"food_delivery_public.delivery_7_days\", foreignCols))\n)\n\nval conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask)\n
Notice how we have defined the csvTask
and foreignCols
as the main foreign key but for postgresTask
, we had to define it as a foreignField
. This is because postgresTask
has multiple tables within it, and we only want to define our foreign key with respect to the public.delivery_7_days
table. We use the step name (can be seen from the report) to specify the table to target.
To test this out, we will truncate the public.delivery_7_days
table in Postgres first, and then try run again.
docker exec marquez-db psql -Upostgres -d food_delivery -c 'TRUNCATE public.delivery_7_days'\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n
order_id | order_placed_on | order_dispatched_on | order_delivered_on | customer_email |\ncustomer_address | menu_id | restaurant_id | restaurant_address | menu\n_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+------------------------------+-------------\n--------------------------------------------+---------+---------------+--------------------------------------------------------+-----\n---------+-------------+-------------+---------+-----------\n 53333 | 2022-10-15 08:40:23.394 | 2023-01-23 09:42:48.397 | 2023-08-12 08:50:52.397 | normand.aufderhar@gmail.com | Apt. 036 449\n27 Wilderman Forge, Marvinchester, CT 15952 | 40412 | 70130 | Suite 146 98176 Schaden Village, Grahammouth, SD 12354 |\n90141 | 44210 | 83966 | 78614 | 77449\n
Let's grab the first email from the Postgres table and check whether the same record exists in the CSV file.
$ cat docker/sample/csv/part-0000* | grep normand.aufderhar\n90141,44210,83966,78614,77449,53333,2022-10-15T08:40:23.394+08:00,2023-01-23T09:42:48.397+08:00,2023-08-12T08:50:52.397+08:00,normand.aufderhar@gmail.com,\"Apt. 036 44927 Wilderman Forge, Marvinchester, CT 15952\",40412,70130,\"Suite 146 98176 Schaden Village, Grahammouth, SD 12354\"\n
Great! Now we have the ability to get schema information from an external source, add our own foreign keys and generate data.
Check out the full example under AdvancedMetadataSourcePlanRun
in the example repo.
Info
Generating data based on an external metadata source is a paid feature. Try the free trial here.
Creating a data generator for a JSON file based on metadata stored in OpenMetadata.
"},{"location":"setup/guide/data-source/open-metadata-source/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/open-metadata-source/#openmetadata-setup","title":"OpenMetadata Setup","text":"You can follow the local docker setup found here to help with setting up OpenMetadata in your local environment.
If that page becomes outdated or the link doesn't work, below are the commands I used to run it:
mkdir openmetadata-docker && cd openmetadata-docker\ncurl -sL https://github.com/open-metadata/OpenMetadata/releases/download/1.2.0-release/docker-compose.yml > docker-compose.yml\ndocker compose -f docker-compose.yml up --detach\n
Check that the following url works and login with admin:admin
. Then you should see some data like below:
Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedOpenMetadataSourceJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedOpenMetadataSourcePlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedOpenMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedOpenMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n
We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.
"},{"location":"setup/guide/data-source/open-metadata-source/#schema","title":"Schema","text":"We can point the schema of a data source to our OpenMetadata instance. We will use a JSON data source so that we can show how nested data types are handled and how we could customise it.
"},{"location":"setup/guide/data-source/open-metadata-source/#single-schema","title":"Single Schema","text":"JavaScalaimport io.github.datacatering.datacaterer.api.model.Constants;\n...\n\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(metadataSource().openMetadataJava(\n\"http://localhost:8585/api\", //url\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(), //auth type\nMap.of( //additional options (including auth options)\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\", //get from settings/bots/ingestion-bot\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\" //table fully qualified name\n)\n))\n.count(count().records(10));\n
import io.github.datacatering.datacaterer.api.model.Constants.{OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, OPEN_METADATA_JWT_TOKEN, OPEN_METADATA_TABLE_FQN, SAVE_MODE}\n...\n\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -> \"overwrite\"))\n.schema(metadataSource.openMetadata(\n\"http://localhost:8585/api\", //url\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA, //auth type\nMap( //additional options (including auth options)\nOPEN_METADATA_JWT_TOKEN -> \"abc123\", //get from settings/bots/ingestion-bot\nOPEN_METADATA_TABLE_FQN -> \"sample_data.ecommerce_db.shopify.raw_customer\" //table fully qualified name\n)\n))\n.count(count.records(10))\n
The above defines that the schema will come from OpenMetadata, which is a type of metadata source that contains information about schemas. Specifically, it points to the sample_data.ecommerce_db.shopify.raw_customer
table. You can check out the schema here to see what it looks like.
Let's try run and see what happens.
cd ..\n./run.sh\n#input class MyAdvancedOpenMetadataSourceJavaPlanRun or MyAdvancedOpenMetadataSourcePlanRun\n#after completing\ncat docker/sample/json/part-00000-*\n
It should look something like this.
{\n\"comments\": \"Mh6jqpD5e4M\",\n\"creditcard\": \"6771839575926717\",\n\"membership\": \"Za3wCQUl9E EJj712\",\n\"orders\": [\n{\n\"product_id\": \"Aa6NG0hxfHVq\",\n\"price\": 16139,\n\"onsale\": false,\n\"tax\": 58134,\n\"weight\": 40734,\n\"others\": 45813,\n\"vendor\": \"Kh\"\n},\n{\n\"product_id\": \"zbHBY \",\n\"price\": 17903,\n\"onsale\": false,\n\"tax\": 39526,\n\"weight\": 9346,\n\"others\": 52035,\n\"vendor\": \"jbkbnXAa\"\n},\n{\n\"product_id\": \"5qs3gakppd7Nw5\",\n\"price\": 48731,\n\"onsale\": true,\n\"tax\": 81105,\n\"weight\": 2004,\n\"others\": 20465,\n\"vendor\": \"nozCDMSXRPH Ev\"\n},\n{\n\"product_id\": \"CA6h17ANRwvb\",\n\"price\": 62102,\n\"onsale\": true,\n\"tax\": 96601,\n\"weight\": 78849,\n\"others\": 79453,\n\"vendor\": \" ihVXEJz7E2EFS\"\n}\n],\n\"platform\": \"GLt9\",\n\"preference\": {\n\"key\": \"nmPmsPjg C\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Loren Bechtelar\",\n\"street_address\": \"Suite 526 293 Rohan Road, Wunschshire, NE 25532\",\n\"city\": \"South Norrisland\",\n\"postcode\": \"56863\"\n}\n],\n\"shipping_date\": \"2022-11-03\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"lance.murphy\",\n\"name\": \"Zane Brakus DVM\",\n\"sex\": \"7HcAaPiO\",\n\"address\": \"594 Loida Haven, Gilland, MA 26071\",\n\"mail\": \"Un3fhbvK2rEbenIYdnq\",\n\"birthdate\": \"2023-01-31\"\n}\n}\n
Looks like we have some data now. But we can do better and add some enhancements to it.
"},{"location":"setup/guide/data-source/open-metadata-source/#custom-metadata","title":"Custom metadata","text":"We can see from the data generated, that it isn't quite what we want. The metadata is not sufficient for us to produce production-like data yet. Let's try to add some enhancements to it.
Let's make the platform
field a choice field that can only be a set of certain values and the nested field customer.sex
is also from a predefined set of values.
var jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield().name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield().name(\"customer\").schema(field().name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count().records(10));\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -> \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield.name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield.name(\"customer\").schema(field.name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count.records(10))\n
Let's test it out by running it again
./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ncat docker/sample/json/part-00000-*\n
{\n\"comments\": \"vqbPUm\",\n\"creditcard\": \"6304867705548636\",\n\"membership\": \"GZ1xOnpZSUOKN\",\n\"orders\": [\n{\n\"product_id\": \"rgOokDAv\",\n\"price\": 77367,\n\"onsale\": false,\n\"tax\": 61742,\n\"weight\": 87855,\n\"others\": 26857,\n\"vendor\": \"04XHR64ImMr9T\"\n}\n],\n\"platform\": \"mobile\",\n\"preference\": {\n\"key\": \"IB5vNdWka\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Isiah Bins\",\n\"street_address\": \"36512 Ross Spurs, Hillhaven, IA 18760\",\n\"city\": \"Averymouth\",\n\"postcode\": \"75818\"\n},\n{\n\"name\": \"Scott Prohaska\",\n\"street_address\": \"26573 Haley Ports, Dariusland, MS 90642\",\n\"city\": \"Ashantimouth\",\n\"postcode\": \"31792\"\n},\n{\n\"name\": \"Rudolf Stamm\",\n\"street_address\": \"Suite 878 0516 Danica Path, New Christiaport, ID 10525\",\n\"city\": \"Doreathaport\",\n\"postcode\": \"62497\"\n}\n],\n\"shipping_date\": \"2023-08-24\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"jolie.cremin\",\n\"name\": \"Fay Klein\",\n\"sex\": \"O\",\n\"address\": \"Apt. 174 5084 Volkman Creek, Hillborough, PA 61959\",\n\"mail\": \"BiTmzb7\",\n\"birthdate\": \"2023-04-07\"\n}\n}\n
Great! Now we have the ability to get schema information from an external source, add our own metadata and generate data.
"},{"location":"setup/guide/data-source/open-metadata-source/#data-validation","title":"Data validation","text":"Another aspect of OpenMetadata that can be leveraged is the definition of data quality rules. These rules can be incorporated into your Data Caterer job as well by enabling data validations via enableGenerateValidations
in configuration
.
var conf = configuration().enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(conf, jsonTask);\n
val conf = configuration.enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(conf, jsonTask)\n
Check out the full example under AdvancedOpenMetadataSourcePlanRun
in the example repo.
Info
Writing data to Solace is a paid feature. Try the free trial here.
Creating a data generator for Solace. You will build a Docker image that will be able to populate data in Solace for the queues/topics you configure.
"},{"location":"setup/guide/data-source/solace/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
If you already have a Solace instance running, you can skip to this step.
"},{"location":"setup/guide/data-source/solace/#solace-setup","title":"Solace Setup","text":"Next, let's make sure you have an instance of Solace up and running in your local environment. This will make it easy for us to iterate and check our changes.
cd docker\ndocker-compose up -d solace\n
Open up localhost:8080 and login with admin:admin
and check there is the default
VPN like below. Notice there is 2 queues/topics created. If you do not see 2 created, try to run the script found under docker/data/solace/setup_solace.sh
and change the host
to localhost
.
Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedSolaceJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyAdvancedSolacePlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedSolaceJavaPlan extends PlanRun {\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n\nclass MyAdvancedSolacePlan extends PlanRun {\n}\n
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
"},{"location":"setup/guide/data-source/solace/#connection-configuration","title":"Connection Configuration","text":"Within our class, we can start by defining the connection properties to connect to Solace.
JavaScalavar accountTask = solace(\n\"my_solace\", //name\n\"smf://host.docker.internal:55554\", //url\nMap.of() //optional additional connection options\n);\n
Additional connection options can be found here.
val accountTask = solace(\n\"my_solace\", //name\n\"smf://host.docker.internal:55554\", //url\nMap() //optional additional connection options\n)\n
Additional connection options can be found here.
"},{"location":"setup/guide/data-source/solace/#schema","title":"Schema","text":"Let's create a task for inserting data into the rest_test_queue
or rest_test_topic
that is already created for us from this step.
Trimming the connection details to work with the docker-compose Solace, we have a base Solace connection to define the JNDI destination we will publish to. Let's define each field along with their corresponding data type. You will notice that the text
fields do not have a data type defined. This is because the default data type is StringType
.
{\nvar solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()), //can define message JMS priority here\nfield().name(\"headers\") //set message properties via headers field\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()).min(2021).max(2023),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n)\n)\n.count(count().records(10));\n}\n
val solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").`type`(IntegerType), //can define message JMS priority here\nfield.name(\"headers\") //set message properties via headers field\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n | NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n | NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\n).count(count.records(10))\n
"},{"location":"setup/guide/data-source/solace/#fields","title":"Fields","text":"The schema defined for Solace has a format that needs to be followed as noted above. Specifically, the required fields are:
Whilst, the other fields are optional:
headers
follows a particular pattern that where it is of type HeaderType.getType
which behind the scenes, translates toarray<struct<key: string,value: binary>>
. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in thevalue
part, it refers to content.account_id
where content
is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.
field().name(\"headers\")\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n
field.name(\"headers\")\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n | NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n | NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n |)\"\"\".stripMargin\n)\n
"},{"location":"setup/guide/data-source/solace/#transactions","title":"transactions","text":"transactions
is an array that contains an inner structure of txn_date
and amount
. The size of the array generated can be controlled via arrayMinLength
and arrayMaxLength
.
field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n
field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n
"},{"location":"setup/guide/data-source/solace/#details","title":"details","text":"details
is another example of a nested schema structure where it also has a nested structure itself in updated_by
. One thing to note here is the first_txn_date
field has a reference to the content.transactions
array where it will sort the array by txn_date
and get the first element.
field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n
field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n
"},{"location":"setup/guide/data-source/solace/#additional-configurations","title":"Additional Configurations","text":"At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.
JavaScalavar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n
"},{"location":"setup/guide/data-source/solace/#execute","title":"Execute","text":"To tell Data Caterer that we want to run with the configurations along with the kafkaTask
, we have to call execute
.
Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just created.
./run.sh\n#input class AdvancedSolaceJavaPlanRun or AdvancedSolacePlanRun\n#after completing, check http://localhost:8080 from browser\n
Your output should look like this.
Unfortunately, there is no easy way to see the message content. You can check the message content from your application or service that consumes these messages.
Also check the HTML report, found at docker/sample/report/index.html
, that gets generated to get an overview of what was executed. Or view the sample report found here.
Info
Auto data generation from data connection is a paid feature. Try the free trial here.
Creating a data generator based on only a data connection to Postgres.
"},{"location":"setup/guide/scenario/auto-generate-connection/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/auto-generate-connection/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedAutomatedJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedAutomatedPlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedAutomatedJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\") (1)\n.enableGeneratePlanAndTasks(true) (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\") (3)\n.enableUniqueCheck(true) (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedAutomatedPlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\") (1)\n.enableGeneratePlanAndTasks(true) (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\") (3)\n.enableUniqueCheck(true) (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n
In the above code, we note the following:
my_postgres
enableGeneratePlanAndTasks
which tells Data Caterer to go to my_postgres
and generate data for all the tables found under the database customer
(which is defined in the connection string).generatedPlanAndTaskFolderPath
defines where the metadata that is gathered from my_postgres
should be saved at so that we could re-use it later.enableUniqueCheck
is set to true to ensure that generated data is unique based on primary key or foreign key definitions.Note
Unique check will only ensure generated data is unique. Any existing data in your data source is not taken into account, so generated data may fail to insert depending on the data source restrictions
"},{"location":"setup/guide/scenario/auto-generate-connection/#postgres-setup","title":"Postgres Setup","text":"If you don't have your own Postgres up and running, you can set up and run an instance configured in the docker
folder via.
cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n
This will create the tables found under docker/data/sql/postgres/customer.sql
. You can change this file to contain your own tables. We can see there are 4 tables created for us, accounts, balances, transactions and mapping
.
Let's try run.
cd ..\n./run.sh\n#input class MyAdvancedAutomatedJavaPlanRun or MyAdvancedAutomatedPlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1;'\n
It should look something like this.
id | account_number | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint | customer_id_decimal | customer_id_real | customer_id_double | open_date | open_timestamp | last_opened_time | payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H | SfA0eZJcTm | CuRw | 13 | 42 | 6041 | 76987.745612542900000000 | 91866.78 | 66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736 | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n
The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.
Also check the HTML report that gets generated under docker/sample/report/index.html
. You can see a summary of what was generated along with other metadata.
You can now look to play around with other tables or data sources and auto generate for them.
"},{"location":"setup/guide/scenario/auto-generate-connection/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#learn-from-existing-data","title":"Learn From Existing Data","text":"If you have any existing data within your data source, Data Caterer will gather metadata about the existing data to help guide it when generating new data. There are configurations that can help tune the metadata analysis found here.
"},{"location":"setup/guide/scenario/auto-generate-connection/#filter-out-schematables","title":"Filter Out Schema/Tables","text":"As part of your connection definition, you can define any schemas and/or tables your don't want to generate data for. In the example below, it will not generate any data for any tables under the history
and audit
schemas. Also, any table with the name balances
or transactions
in any schema will also not have data generated.
var autoRun = configuration()\n.postgres(\n\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap.of(\n\"filterOutSchema\", \"history, audit\",\n\"filterOutTable\", \"balances, transactions\")\n)\n)\n
val autoRun = configuration\n.postgres(\n\"my_postgres\",\n\"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap(\n\"filterOutSchema\" -> \"history, audit\",\n\"filterOutTable\" -> \"balances, transactions\")\n)\n)\n
"},{"location":"setup/guide/scenario/auto-generate-connection/#define-record-count","title":"Define record count","text":"You can control the record count per sub data source via numRecordsPerStep
.
var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n
val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n
"},{"location":"setup/guide/scenario/batch-and-event/","title":"Generate Batch and Event Data","text":"Info
Generating event data is a paid feature. Try the free trial here.
Creating a data generator for Kafka topic with matching records in a CSV file.
"},{"location":"setup/guide/scenario/batch-and-event/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/batch-and-event/#kafka-setup","title":"Kafka Setup","text":"If you don't have your own Kafka up and running, you can set up and run an instance configured in the docker
folder via.
cd docker\ndocker-compose up -d kafka\ndocker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n
Let's create a task for inserting data into the account-topic
that is already defined underdocker/data/kafka/setup_kafka.sh
.
Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedBatchEventJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedBatchEventPlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedBatchEventJavaPlanRun extends PlanRun {\n{\nvar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedBatchEventPlanRun extends PlanRun {\nval kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n}\n
We will borrow the Kafka task that is already defined under the class AdvancedKafkaPlanRun
or AdvancedKafkaJavaPlanRun
. You can go through the Kafka guide here for more details.
Let us set up the corresponding schema for the CSV file where we want to match the values that are generated for the Kafka messages.
JavaScalavar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n\nvar csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield().name(\"account_number\"),\nfield().name(\"year\"),\nfield().name(\"name\"),\nfield().name(\"payload\")\n);\n
val kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n\nval csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield.name(\"account_number\"),\nfield.name(\"year\"),\nfield.name(\"name\"),\nfield.name(\"payload\")\n)\n
This is a simple schema where we want to use the values and metadata that is already defined in the kafkaTask
to determine what the data will look like for the CSV file. Even if we defined some metadata here, it would be overridden when we define our foreign key relationships.
From the above CSV schema, we see note the following against the Kafka schema:
account_number
in CSV needs to match with the account_id
in Kafkaaccount_id
is referred to in the key
column as field.name(\"key\").sql(\"content.account_id\")
year
needs to match with content.year
in Kafka, which is a nested fieldtmp_year
which will not appear in the final output for the Kafka messages but is used as an intermediate step field.name(\"tmp_year\").sql(\"content.year\").omit(true)
name
needs to match with content.details.name
in Kafka, also a nested fieldtmp_name
which will take the value of the nested field but will be omitted field.name(\"tmp_name\").sql(\"content.details.name\").omit(true)
payload
represents the whole JSON message sent to Kafka, which matches to value
columnOur foreign keys are therefore defined like below. Order is important when defining the list of columns. The index needs to match with the corresponding column in the other data source.
JavaScalavar myPlan = plan().addForeignKeyRelationship(\nkafkaTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(myPlan, conf, kafkaTask, csvTask);\n
val myPlan = plan.addForeignKeyRelationship(\nkafkaTask, List(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList(csvTask -> List(\"account_number\", \"year\", \"name\", \"payload\"))\n)\n\nval conf = configuration.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(myPlan, conf, kafkaTask, csvTask)\n
"},{"location":"setup/guide/scenario/batch-and-event/#run","title":"Run","text":"Let's try run.
cd ..\n./run.sh\n#input class MyAdvancedBatchEventJavaPlanRun or MyAdvancedBatchEventPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n
It should look something like this.
{\"account_id\":\"ACC03093143\",\"year\":2023,\"amount\":87990.37196728592,\"details\":{\"name\":\"Nadine Heidenreich Jr.\",\"first_txn_date\":\"2021-11-09\",\"updated_by\":{\"user\":\"YfEyJCe8ohrl0j IfyT\",\"time\":\"2022-09-26T20:47:53.404Z\"}},\"transactions\":[{\"txn_date\":\"2021-11-09\",\"amount\":97073.7914706189}]}\n{\"account_id\":\"ACC08764544\",\"year\":2021,\"amount\":28675.58758765888,\"details\":{\"name\":\"Delila Beer\",\"first_txn_date\":\"2021-05-19\",\"updated_by\":{\"user\":\"IzB5ksXu\",\"time\":\"2023-01-26T20:47:26.389Z\"}},\"transactions\":[{\"txn_date\":\"2021-10-01\",\"amount\":80995.23818711648},{\"txn_date\":\"2021-05-19\",\"amount\":92572.40049217848},{\"txn_date\":\"2021-12-11\",\"amount\":99398.79832225188}]}\n{\"account_id\":\"ACC62505420\",\"year\":2023,\"amount\":96125.3125884202,\"details\":{\"name\":\"Shawn Goodwin\",\"updated_by\":{\"user\":\"F3dqIvYp2pFtena4\",\"time\":\"2023-02-11T04:38:29.832Z\"}},\"transactions\":[]}\n
Let's also check if there is a corresponding record in the CSV file.
$ cat docker/sample/csv/account/part-0000* | grep ACC03093143\nACC03093143,2023,Nadine Heidenreich Jr.,\"{\\\"account_id\\\":\\\"ACC03093143\\\",\\\"year\\\":2023,\\\"amount\\\":87990.37196728592,\\\"details\\\":{\\\"name\\\":\\\"Nadine Heidenreich Jr.\\\",\\\"first_txn_date\\\":\\\"2021-11-09\\\",\\\"updated_by\\\":{\\\"user\\\":\\\"YfEyJCe8ohrl0j IfyT\\\",\\\"time\\\":\\\"2022-09-26T20:47:53.404Z\\\"}},\\\"transactions\\\":[{\\\"txn_date\\\":\\\"2021-11-09\\\",\\\"amount\\\":97073.7914706189}]}\"\n
Great! The account, year, name and payload look to all match up.
"},{"location":"setup/guide/scenario/batch-and-event/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/batch-and-event/#order-of-execution","title":"Order of execution","text":"You may notice that the events are generated first, then the CSV file. This is because as part of the execute
function, we passed in the kafkaTask
first, before the csvTask
. You can change the order of execution by passing in csvTask
before kafkaTask
into the execute
function.
Creating a data validator for a JSON file.
"},{"location":"setup/guide/scenario/data-validation/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/data-validation/#data-setup","title":"Data Setup","text":"To aid in showing the functionality of data validations, we will first generate some data that our validations will run against. Run the below command and it will generate JSON files under docker/sample/json
folder.
./run.sh JsonPlan\n
"},{"location":"setup/guide/scenario/data-validation/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyValidationJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyValidationPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyValidationJavaPlan extends PlanRun {\n{\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\");\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyValidationPlan extends PlanRun {\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n}\n
As noted above, we create a JSON task that points to where the JSON data has been created at folder /opt/app/data/json
. We also note that enableValidation
is set to true
and enableGenerateData
to false
to tell Data Catering, we only want to validate data.
For reference, the schema in which we will be validating against looks like the below.
.schema(\nfield.name(\"account_id\"),\n field.name(\"year\").`type`(IntegerType),\n field.name(\"balance\").`type`(DoubleType),\n field.name(\"date\").`type`(DateType),\n field.name(\"status\"),\n field.name(\"update_history\").`type`(ArrayType)\n.schema(\nfield.name(\"updated_time\").`type`(TimestampType),\n field.name(\"status\").oneOf(\"open\", \"closed\", \"pending\", \"suspended\"),\n ),\n field.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\n field.name(\"age\").`type`(IntegerType),\n field.name(\"city\").expression(\"#{Address.city}\")\n)\n)\n
"},{"location":"setup/guide/scenario/data-validation/#basic-validation","title":"Basic Validation","text":"Let's say our goal is to validate the customer_details.name
field to ensure it conforms to the regex pattern [A-Z][a-z]+ [A-Z][a-z]+
. Given the diversity in naming conventions across cultures and countries, variations such as middle names, suffixes, prefixes, or language-specific differences are tolerated to a certain extent. The validation considers an acceptable error threshold before marking it as failed.
customer_details.name
[A-Z][a-z]+ [A-Z][a-z]+
validation().col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1) //<=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"), //description to add context in report or other developers\n
validation.col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1) //<=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"), //description to add context in report or other developers\n
"},{"location":"setup/guide/scenario/data-validation/#custom-validation","title":"Custom Validation","text":"There will be situation where you have a complex data setup and require you own custom logic to use for data validation. You can achieve this via setting your own SQL expression that returns a boolean value. An example is seen below where we want to check the array update_history
, that each entry has updated_time
greater than a certain timestamp.
validation().expr(\"FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))\"),\n
validation.expr(\"FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))\"),\n
If you want to know what other SQL function are available for you to use, check this page.
"},{"location":"setup/guide/scenario/data-validation/#group-by-validation","title":"Group By Validation","text":"There are scenarios where you want to validate against grouped values or the whole dataset via aggregations. An example would be validating that each customer's transactions sum is greater than 0.
"},{"location":"setup/guide/scenario/data-validation/#validation-criteria_1","title":"Validation Criteria","text":"Line 1: validation.groupBy().count().isEqual(100)
groupBy()
: Group by whole dataset.count()
: Counts the number of dataset elements.isEqual(100)
: Checks if the count is equal to 100.Line 2: validation.groupBy(\"account_id\").max(\"balance\").lessThan(900)
groupBy(\"account_id\")
: Groups the data based on the account_id
field.max(\"balance\")
: Calculates the maximum value of the balance
field within each group.lessThan(900)
: Checks if the maximum balance in each group is less than 900.account_id
the maximum balance is less than 900.errorThreshold
or validation to your specification scenario. The full list of types of validations can be found here.validation().groupBy().count().isEqual(100),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n
validation.groupBy().count().isEqual(100),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n
"},{"location":"setup/guide/scenario/data-validation/#sample-validation","title":"Sample Validation","text":"To try cover the majority of validation cases, the below has been created.
JavaScalavar jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation().col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation().col(\"date\").isNotNull().errorThreshold(10),\nvalidation().col(\"balance\").greaterThan(500),\nvalidation().expr(\"YEAR(date) == year\"),\nvalidation().col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation().col(\"customer_details.age\").greaterThan(18),\nvalidation().expr(\"FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation().col(\"update_history\").greaterThanSize(2),\nvalidation().unique(\"account_id\"),\nvalidation().groupBy().count().isEqual(1000),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation.col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation.col(\"date\").isNotNull.errorThreshold(10),\nvalidation.col(\"balance\").greaterThan(500),\nvalidation.expr(\"YEAR(date) == year\"),\nvalidation.col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation.col(\"customer_details.age\").greaterThan(18),\nvalidation.expr(\"FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation.col(\"update_history\").greaterThanSize(2),\nvalidation.unique(\"account_id\"),\nvalidation.groupBy().count().isEqual(1000),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n
"},{"location":"setup/guide/scenario/data-validation/#run","title":"Run","text":"Let's try run.
./run.sh\n#input class MyValidationJavaPlan or MyValidationPlan\n#after completing, check report at docker/sample/report/index.html\n
It should look something like this.
Check the full example at ValidationPlanRun
inside the examples repo.
Info
Delete generated data is a paid feature. Try the free trial here.
Creating a data generator for Postgres and delete the generated data after using it.
"},{"location":"setup/guide/scenario/delete-generated-data/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/delete-generated-data/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedDeleteJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedDeletePlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedDeleteJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\") (1)\n.enableGeneratePlanAndTasks(true) (2)\n.enableRecordTracking(true) (3)\n.enableDeleteGeneratedRecords(false) (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\") (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\") (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedDeletePlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\") (1)\n.enableGeneratePlanAndTasks(true) (2)\n.enableRecordTracking(true) (3)\n.enableDeleteGeneratedRecords(false) (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\") (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\") (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n
In the above code we note the following:
my_postgres
enableGeneratePlanAndTasks
is enabled to auto generate data for all tables under customer
databaseenableRecordTracking
is enabled to ensure that all generated records are tracked. This will get used when we want to delete data afterwardsenableDeleteGeneratedRecords
is disabled for now. We want to see the generated data first and delete sometime aftergeneratedPlanAndTaskFolderPath
is the folder path where we saved the metadata we have gathered from my_postgres
recordTrackingFolderPath
is the folder path where record tracking is maintained. We need to persist this data to ensure it is still available when we want to delete dataIf you don't have your own Postgres up and running, you can set up and run an instance configured in the docker
folder via.
cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n
This will create the tables found under docker/data/sql/postgres/customer.sql
. You can change this file to contain your own tables. We can see there are 4 tables created for us, accounts, balances, transactions and mapping
.
Let's try run.
cd ..\n./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\n
It should look something like this.
id | account_number | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint | customer_id_decimal | customer_id_real | customer_id_double | open_date | open_timestamp | last_opened_time | payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H | SfA0eZJcTm | CuRw | 13 | 42 | 6041 | 76987.745612542900000000 | 91866.78 | 66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736 | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n
The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.
Check the number of records via:
docker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n#open report under docker/sample/report/index.html\n
"},{"location":"setup/guide/scenario/delete-generated-data/#delete","title":"Delete","text":"We are now at a stage where we want to delete the data that was generated. All we need to do is flip two flags.
.enableDeleteGeneratedRecords(true)\n.enableGenerateData(false) //we need to explicitly disable generating data\n
Enable delete generated records and disable generating data.
Before we run again, let us insert a record manually to see if that data will survive after running the job to delete the generated data.
docker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"insert into account.accounts (account_number) values ('my_account_number')\"\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"select count(1) from account.accounts\"\n
We now should have 1001 records in our account.accounts
table. Let's delete the generated data now.
./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n
You should see that only 1 record is left, the one that we manually inserted. Great, now we can generate data reliably and also be able to clean it up.
"},{"location":"setup/guide/scenario/delete-generated-data/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/delete-generated-data/#one-class-for-generating-another-for-deleting","title":"One class for generating, another for deleting?","text":"Yes, this is possible. There are two requirements: - the connection names used need to be the same across both classes - recordTrackingFolderPath
needs to be set to the same value
You can control the record count per sub data source via numRecordsPerStep
.
var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n
val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n
"},{"location":"setup/guide/scenario/first-data-generation/","title":"First Data Generation","text":"Creating a data generator for a CSV file.
"},{"location":"setup/guide/scenario/first-data-generation/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/first-data-generation/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyCsvPlan.java
src/main/scala/io/github/datacatering/plan/MyCsvPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n\npublic class MyCsvJavaPlan extends PlanRun {\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n\nclass MyCsvPlan extends PlanRun {\n}\n
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
"},{"location":"setup/guide/scenario/first-data-generation/#connection-configuration","title":"Connection Configuration","text":"When dealing with CSV files, we need to define a path for our generated CSV files to be saved at, along with any other high level configurations.
JavaScalacsv(\n\"customer_accounts\", //name\n\"/opt/app/data/customer/account\", //path\nMap.of(\"header\", \"true\") //optional additional options\n)\n
Other additional options for CSV can be found here
csv(\n\"customer_accounts\", //name\n\"/opt/app/data/customer/account\", //path\nMap(\"header\" -> \"true\") //optional additional options\n)\n
Other additional options for CSV can be found here
"},{"location":"setup/guide/scenario/first-data-generation/#schema","title":"Schema","text":"Our CSV file that we generate should adhere to a defined schema where we can also define data types.
Let's define each field along with their corresponding data type. You will notice that the string
fields do not have a data type defined. This is because the default data type is StringType
.
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"balance\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"balance\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#field-metadata","title":"Field Metadata","text":"We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata attributes that add guidelines that the data generator will understand when generating data.
"},{"location":"setup/guide/scenario/first-data-generation/#account_id","title":"account_id","text":"account_id
follows a particular pattern that where it starts with ACC
and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that values are unique ensure that unique values are generated.field().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n
field.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n
"},{"location":"setup/guide/scenario/first-data-generation/#balance","title":"balance","text":"balance
let's make the numbers not too large, so we can define a min and max for the generated numbers to be between 1
and 1000
.field().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\n
field.name(\"balance\").`type`(DoubleType).min(1).max(1000),\n
"},{"location":"setup/guide/scenario/first-data-generation/#name","title":"name","text":"name
is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an expression
to generate real looking name. All possible faker expressions can be found herefield().name(\"name\").expression(\"#{Name.name}\"),\n
field.name(\"name\").expression(\"#{Name.name}\"),\n
"},{"location":"setup/guide/scenario/first-data-generation/#open_time","title":"open_time","text":"open_time
is a timestamp that we want to have a value greater than a specific date. We can define a min date by using java.sql.Date
like below.field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
"},{"location":"setup/guide/scenario/first-data-generation/#status","title":"status","text":"status
is a field that can only obtain one of four values, open, closed, suspended or pending
.field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
"},{"location":"setup/guide/scenario/first-data-generation/#created_by","title":"created_by","text":"created_by
is a field that is based on the status
field where it follows the logic: if status is open or closed, then it is created_by eod else created_by event
. This can be achieved by defining a SQL expression like below.field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
Putting it all the fields together, our class should now look like this.
JavaScalavar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#record-count","title":"Record Count","text":"We only want to generate 100 records, so that we can see what the output looks like. This is controlled at the accountTask
level like below. If you want to generate more records, set it to the value you want.
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().records(100));\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.records(100))\n
"},{"location":"setup/guide/scenario/first-data-generation/#additional-configurations","title":"Additional Configurations","text":"At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.
JavaScalavar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n
"},{"location":"setup/guide/scenario/first-data-generation/#execute","title":"Execute","text":"To tell Data Caterer that we want to run with the configurations along with the accountTask
, we have to call execute
. So our full plan run will look like this.
public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n
class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n
"},{"location":"setup/guide/scenario/first-data-generation/#run","title":"Run","text":"Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just created.
./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing\nhead docker/sample/customer/account/part-00000*\n
Your output should look like this.
account_id,balance,created_by,name,open_time,status\nACC06192462,853.9843359645766,eod,Hoyt Kertzmann MD,2023-07-22T11:17:01.713Z,closed\nACC15350419,632.5969895326234,eod,Dr. Claude White,2022-12-13T21:57:56.840Z,open\nACC25134369,592.0958847218986,eod,Fabian Rolfson,2023-04-26T04:54:41.068Z,open\nACC48021786,656.6413439322964,eod,Dewayne Stroman,2023-05-17T06:31:27.603Z,open\nACC26705211,447.2850352884595,event,Garrett Funk,2023-07-14T03:50:22.746Z,pending\nACC03150585,750.4568929015996,event,Natisha Reichel,2023-04-11T11:13:10.080Z,suspended\nACC29834210,686.4257811608622,event,Gisele Ondricka,2022-11-15T22:09:41.172Z,suspended\nACC39373863,583.5110618128994,event,Thaddeus Ortiz,2022-09-30T06:33:57.193Z,suspended\nACC39405798,989.2623959059525,eod,Shelby Reinger,2022-10-23T17:29:17.564Z,open\n
Also check the HTML report, found at docker/sample/report/index.html
, that gets generated to get an overview of what was executed.
Now that we have generated some accounts, let's also try to generate a set of transactions for those accounts in CSV format as well. The transactions could be in any other format, but to keep this simple, we will continue using CSV.
We can define our schema the same way along with any additional metadata.
JavaScalavar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#records-per-column","title":"Records Per Column","text":"Usually, for a given account_id, full_name
, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the count
function.
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n
"},{"location":"setup/guide/scenario/first-data-generation/#random-records-per-column","title":"Random Records Per Column","text":"Above, you will notice that we are generating 5 records per account_id, full_name
. This is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n
Here we set the minimum number of records per column to be 0 and the maximum to 5.
"},{"location":"setup/guide/scenario/first-data-generation/#foreign-key","title":"Foreign Key","text":"In this scenario, we want to match the account_id
in account
to match the same column values in transaction
. We also want to match name
in account
to full_name
in transaction
. This can be done via plan configuration like below.
var myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"), //the task and columns we want linked\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\"))) //list of other tasks and their respective column names we want matched\n);\n
val myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"), //the task and columns we want linked\nList(transactionTask -> List(\"account_id\", \"full_name\")) //list of other tasks and their respective column names we want matched\n)\n
Now, stitching it all together for the execute
function, our final plan should look like this.
public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count().records(100));\n\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nvar myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\")))\n);\n\nexecute(myPlan, config, accountTask, transactionTask);\n}\n}\n
class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count.records(100))\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nval myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),\nList(transactionTask -> List(\"account_id\", \"full_name\"))\n)\n\nexecute(myPlan, config, accountTask, transactionTask)\n}\n
Let's try run again.
#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing, let's pick an account and check the transactions for that account\naccount=$(tail -1 docker/sample/customer/account/part-00000* | awk -F \",\" '{print $1 \",\" $4}')\necho $account\ncat docker/sample/customer/transaction/part-00000* | grep $account\n
It should look something like this.
ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n
Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the DocumentationJavaPlanRun.java
or DocumentationPlanRun.scala
files as well to check that your plan is the same.
We can now look to consume this CSV data from a job or service. Usually, once we have consumed the data, we would also want to check and validate that our consumer has correctly ingested the data.
"},{"location":"setup/guide/scenario/first-data-generation/#validate","title":"Validate","text":"In this scenario, our consumer will read in the CSV file, do some transformations, and then save the data to Postgres. Let's try to configure data validations for the data that gets pushed into Postgres.
"},{"location":"setup/guide/scenario/first-data-generation/#postgres-setup","title":"Postgres setup","text":"First, we define our connection properties for Postgres. You can check out the full options available here.
JavaScalavar postgresValidateTask = postgres(\n\"my_postgres\", //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\", //username\n\"password\" //password\n).table(\"account\", \"transactions\");\n
val postgresValidateTask = postgres(\n\"my_postgres\", //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\", //username\n\"password\" //password\n).table(\"account\", \"transactions\")\n
We can connect and access the data inside the table account.transactions
. Now to define our data validations.
For full information about validation options and configurations, check here. Below, we have an example that should give you a good understanding of what validations are possible.
JavaScalavar postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation().col(\"account_id\").isNotNull(),\nvalidation().col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation().col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation().expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation().unique(\"account_id\", \"name\"),\nvalidation().groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n);\n
val postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation.col(\"account_id\").isNotNull,\nvalidation.col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation.col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation.expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation.unique(\"account_id\", \"name\"),\nvalidation.groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#name_1","title":"name","text":"For all values in the name
column, we check if they match the regex [A-Z][a-z]+ [A-Z][a-z]+
. As we know in the real world, names do not always follow the same pattern, so we allow for an errorThreshold
before marking the validation as failed. Here, we define the errorThreshold
to be 0.2
, which means, if the error percentage is greater than 20%, then fail the validation. We also append on a helpful description so other developers/users can understand the context of the validation.
We check that all balance
values are greater than or equal to 0. This time, we have a slightly different errorThreshold
as it is set to 10
, which means, if the number of errors is greater than 10, then fail the validation.
Sometimes, we may need to include the values of multiple columns to validate a certain condition. This is where we can use expr
to define a SQL expression that returns a boolean. In this scenario, we are checking if the status
column has value closed
, then the close_date
should be not null, otherwise, close_date
is null.
We check whether the combination of account_id
and name
are unique within the dataset. You can define one or more columns for unique
validations.
There may be some business rule that states the number of login_retry
should be less than 10 for each account. We can check this via a group by validation where we group by the account_id, name
, take the maximum value for login_retry
per account_id,name
combination, then check if it is less than 10.
You can now look to play around with other configurations or data sources to meet your needs. Also, make sure to explore the docs further as it can guide you on what can be configured.
"},{"location":"setup/guide/scenario/records-per-column/","title":"Multiple Records Per Column","text":"Creating a data generator for a CSV file where there are multiple records per column values.
"},{"location":"setup/guide/scenario/records-per-column/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/records-per-column/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyMultipleRecordsPerColJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyMultipleRecordsPerColPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyMultipleRecordsPerColJavaPlan extends PlanRun {\n{\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, transactionTask);\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyMultipleRecordsPerColPlan extends PlanRun {\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"), field.name(\"full_name\").expression(\"#{Name.name}\"), field.name(\"amount\").`type`(DoubleType.instance).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType.instance).min(java.sql.Date.valueOf(\"2022-01-01\")), field.name(\"date\").`type`(DateType.instance).sql(\"DATE(time)\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(config, transactionTask)\n}\n
"},{"location":"setup/guide/scenario/records-per-column/#record-count","title":"Record Count","text":"By default, tasks will generate 1000 records. You can alter this value via the count
configuration which can be applied to individual tasks. For example, in Scala, csv(...).count(count.records(100))
to generate only 100 records.
In this scenario, for a given account_id, full_name
, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the count
function.
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n
This will generate 1000 * 5 = 5000
records as the default number of records is set (1000) and per account_id, full_name
from the initial 1000 records, 5 records will be generated.
Generating 5 records per column is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.
JavaScalavar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n
Here we set the minimum number of records per column to be 0 and the maximum to 5. This will follow a uniform distribution so the average number of records per account is 2.5. We could also define other metadata, just like we did with fields, when defining the generator. For example, we could set standardDeviation
and mean
for the number of records generated per column to follow a normal distribution.
Let's try run.
#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyMultipleRecordsPerColJavaPlan or MyMultipleRecordsPerColPlan\n#after completing\nhead docker/sample/customer/transaction/part-00000*\n
It should look something like this.
ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n
You can now look to play around with other count configurations found here.
"},{"location":"setup/report/alert/","title":"Alert","text":"Alerts can be configured to help users receive feedback from their data testing results. Currently, Data Caterer supports Slack for alerts.
"},{"location":"setup/report/alert/#slack","title":"Slack","text":"Define a Slack token and one or more Slack channels that will receive an alert like the below.
JavaScalavar conf = configuration()\n.slackAlertToken(\"abc123\") //use appropriate Slack token (usually bot token)\n.slackAlertChannels(\"#test-alerts\", \"#pre-prod-testing\"); //define Slack channel(s) to receive alerts on\n\nexecute(conf, ...);\n
val conf = configuration\n.slackAlertToken(\"abc123\") //use appropriate Slack token (usually bot token)\n.slackAlertChannels(\"#test-alerts\", \"#pre-prod-testing\") //define Slack channel(s) to receive alerts on\n\nexecute(conf, ...)\n
"},{"location":"setup/report/html-report/","title":"Report","text":"Data Caterer can be configured to produce a report of the data generated to help users understand what was run, how much data was generated, where it was generated, validation results and any associated metadata.
"},{"location":"setup/report/html-report/#sample","title":"Sample","text":"Once run, it will produce a report like this.
"},{"location":"setup/validation/basic-validation/","title":"Basic Validations","text":"Run validations on a column to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).
"},{"location":"setup/validation/basic-validation/#equal","title":"Equal","text":"Ensure all data in column is equal to certain value. Value can be of any data type. Can use isEqualCol
to define SQL expression that can reference other columns.
validation().col(\"year\").isEqual(2021),\nvalidation().col(\"year\").isEqualCol(\"YEAR(date)\"),\n
validation.col(\"year\").isEqual(2021),\nvalidation.col(\"year\").isEqualCol(\"YEAR(date)\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year == 2021\"\n
"},{"location":"setup/validation/basic-validation/#not-equal","title":"Not Equal","text":"Ensure all data in column is not equal to certain value. Value can be of any data type. Can use isNotEqualCol
to define SQL expression that can reference other columns.
validation().col(\"year\").isNotEqual(2021),\nvalidation().col(\"year\").isNotEqualCol(\"YEAR(date)\"),\n
validation.col(\"year\").isNotEqual(2021)\nvalidation.col(\"year\").isEqualCol(\"YEAR(date)\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year != 2021\"\n
"},{"location":"setup/validation/basic-validation/#null","title":"Null","text":"Ensure all data in column is null.
JavaScalaYAMLvalidation().col(\"year\").isNull()\n
validation.col(\"year\").isNull\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNULL(year)\"\n
"},{"location":"setup/validation/basic-validation/#not-null","title":"Not Null","text":"Ensure all data in column is not null.
JavaScalaYAMLvalidation().col(\"year\").isNotNull()\n
validation.col(\"year\").isNotNull\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNOTNULL(year)\"\n
"},{"location":"setup/validation/basic-validation/#contains","title":"Contains","text":"Ensure all data in column is contains certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"name\").contains(\"peter\")\n
validation.col(\"name\").contains(\"peter\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"CONTAINS(name, 'peter')\"\n
"},{"location":"setup/validation/basic-validation/#not-contains","title":"Not Contains","text":"Ensure all data in column does not contain certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"name\").notContains(\"peter\")\n
validation.col(\"name\").notContains(\"peter\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!CONTAINS(name, 'peter')\"\n
"},{"location":"setup/validation/basic-validation/#unique","title":"Unique","text":"Ensure all data in column is unique.
JavaScalaYAMLvalidation().unique(\"account_id\", \"name\")\n
validation.unique(\"account_id\", \"name\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- unique: [\"account_id\", \"name\"]\n
"},{"location":"setup/validation/basic-validation/#less-than","title":"Less Than","text":"Ensure all data in column is less than certain value. Can use lessThanCol
to define SQL expression that can reference other columns.
validation().col(\"amount\").lessThan(100),\nvalidation().col(\"amount\").lessThanCol(\"balance + 1\"),\n
validation.col(\"amount\").lessThan(100),\nvalidation.col(\"amount\").lessThanCol(\"balance + 1\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"amount < balance + 1\"\n
"},{"location":"setup/validation/basic-validation/#less-than-or-equal","title":"Less Than Or Equal","text":"Ensure all data in column is less than or equal to certain value. Can use lessThanOrEqualCol
to define SQL expression that can reference other columns.
validation().col(\"amount\").lessThanOrEqual(100),\nvalidation().col(\"amount\").lessThanOrEqualCol(\"balance + 1\"),\n
validation.col(\"amount\").lessThanOrEqual(100),\nvalidation.col(\"amount\").lessThanCol(\"balance + 1\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount <= 100\"\n- expr: \"amount <= balance + 1\"\n
"},{"location":"setup/validation/basic-validation/#greater-than","title":"Greater Than","text":"Ensure all data in column is greater than certain value. Can use greaterThanCol
to define SQL expression that can reference other columns.
validation().col(\"amount\").greaterThan(100),\nvalidation().col(\"amount\").greaterThanCol(\"balance\"),\n
validation.col(\"amount\").greaterThan(100),\nvalidation.col(\"amount\").greaterThanCol(\"balance\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount > 100\"\n- expr: \"amount > balance\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-or-equal","title":"Greater Than Or Equal","text":"Ensure all data in column is greater than or equal to certain value. Can use greaterThanOrEqualCol
to define SQL expression that can reference other columns.
validation().col(\"amount\").greaterThanOrEqual(100),\nvalidation().col(\"amount\").greaterThanOrEqualCol(\"balance\"),\n
validation.col(\"amount\").greaterThanOrEqual(100),\nvalidation.col(\"amount\").greaterThanOrEqualCol(\"balance\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount >= 100\"\n- expr: \"amount >= balance\"\n
"},{"location":"setup/validation/basic-validation/#between","title":"Between","text":"Ensure all data in column is between two values. Can use betweenCol
to define SQL expression that references other columns.
validation().col(\"amount\").between(100, 200),\nvalidation().col(\"amount\").betweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n
validation.col(\"amount\").between(100, 200),\nvalidation.col(\"amount\").betweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount BETWEEN 100 AND 200\"\n- expr: \"amount BETWEEN balance * 0.9 AND balance * 1.1\"\n
"},{"location":"setup/validation/basic-validation/#not-between","title":"Not Between","text":"Ensure all data in column is not between two values. Can use notBetweenCol
to define SQL expression that references other columns.
validation().col(\"amount\").notBetween(100, 200),\nvalidation().col(\"amount\").notBetweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n
validation.col(\"amount\").notBetween(100, 200)\nvalidation.col(\"amount\").notBetweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount NOT BETWEEN 100 AND 200\"\n- expr: \"amount NOT BETWEEN balance * 0.9 AND balance * 1.1\"\n
"},{"location":"setup/validation/basic-validation/#in","title":"In","text":"Ensure all data in column is in set of defined values.
JavaScalaYAMLvalidation().col(\"status\").in(\"open\", \"closed\")\n
validation.col(\"status\").in(\"open\", \"closed\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"status IN ('open', 'closed')\"\n
"},{"location":"setup/validation/basic-validation/#matches","title":"Matches","text":"Ensure all data in column matches certain regex expression.
JavaScalaYAMLvalidation().col(\"account_id\").matches(\"ACC[0-9]{8}\")\n
validation.col(\"account_id\").matches(\"ACC[0-9]{8}\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"REGEXP(account_id, ACC[0-9]{8})\"\n
"},{"location":"setup/validation/basic-validation/#not-matches","title":"Not Matches","text":"Ensure all data in column does not match certain regex expression.
JavaScalaYAMLvalidation().col(\"account_id\").notMatches(\"^acc.*\")\n
validation.col(\"account_id\").notMatches(\"^acc.*\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!REGEXP(account_id, '^acc.*')\"\n
"},{"location":"setup/validation/basic-validation/#starts-with","title":"Starts With","text":"Ensure all data in column starts with certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"account_id\").startsWith(\"ACC\")\n
validation.col(\"account_id\").startsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"STARTSWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#not-starts-with","title":"Not Starts With","text":"Ensure all data in column does not start with certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"account_id\").notStartsWith(\"ACC\")\n
validation.col(\"account_id\").notStartsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!STARTSWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#ends-with","title":"Ends With","text":"Ensure all data in column ends with certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"account_id\").endsWith(\"ACC\")\n
validation.col(\"account_id\").endsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ENDWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#not-ends-with","title":"Not Ends With","text":"Ensure all data in column does not end with certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"account_id\").notEndsWith(\"ACC\")\n
validation.col(\"account_id\").notEndsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!ENDWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#size","title":"Size","text":"Ensure all data in column has certain size. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").size(5)\n
validation.col(\"transactions\").size(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions, 5)\"\n
"},{"location":"setup/validation/basic-validation/#not-size","title":"Not Size","text":"Ensure all data in column does not have certain size. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").notSize(5)\n
validation.col(\"transactions\").notSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) != 5\"\n
"},{"location":"setup/validation/basic-validation/#less-than-size","title":"Less Than Size","text":"Ensure all data in column has size less than certain value. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").lessThanSize(5)\n
validation.col(\"transactions\").lessThanSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) < 5\"\n
"},{"location":"setup/validation/basic-validation/#less-than-or-equal-size","title":"Less Than Or Equal Size","text":"Ensure all data in column has size less than or equal to certain value. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").lessThanOrEqualSize(5)\n
validation.col(\"transactions\").lessThanOrEqualSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) <= 5\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-size","title":"Greater Than Size","text":"Ensure all data in column has size greater than certain value. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").greaterThanSize(5)\n
validation.col(\"transactions\").greaterThanSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) > 5\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-or-equal-size","title":"Greater Than Or Equal Size","text":"Ensure all data in column has size greater than or equal to certain value. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").greaterThanOrEqualSize(5)\n
validation.col(\"transactions\").greaterThanOrEqualSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) >= 5\"\n
"},{"location":"setup/validation/basic-validation/#luhn-check","title":"Luhn Check","text":"Ensure all data in column passes luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).
JavaScalaYAMLvalidation().col(\"credit_card\").luhnCheck()\n
validation.col(\"credit_card\").luhnCheck\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"LUHN_CHECK(credit_card)\"\n
"},{"location":"setup/validation/basic-validation/#has-type","title":"Has Type","text":"Ensure all data in column has certain data type.
JavaScalaYAMLvalidation().col(\"id\").hasType(\"string\")\n
validation.col(\"id\").hasType(\"string\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"TYPEOF(id) == 'string'\"\n
"},{"location":"setup/validation/basic-validation/#expression","title":"Expression","text":"Ensure all data in column adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple columns.
For example, CASE WHEN status == 'open' THEN balance > 0 ELSE balance == 0 END
would check all rows with status
open to have balance
greater than 0, otherwise, check the balance
is 0.
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().expr(\"amount < 100\"),\nvalidation().expr(\"year == 2021\").errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail\nvalidation().expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200) //equivalent to if number of errors is > 200, then fail\n);\n\nvar conf = configuration().enableValidation(true);\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validations(\nvalidation.expr(\"amount < 100\"),\nvalidation.expr(\"year == 2021\").errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail\nvalidation.expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200) //equivalent to if number of errors is > 200, then fail\n)\n\nval conf = configuration.enableValidation(true)\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1 #equivalent to if error percentage is > 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200 #equivalent to if number of errors is > 200, then fail\ndescription: \"Should be lots of Peters\"\n\n#enableValidation inside application.conf\n
"},{"location":"setup/validation/column-name-validation/","title":"Column Name Validations","text":"Run validations on the column names to check for column name count of existence of column names.
"},{"location":"setup/validation/column-name-validation/#count-equal","title":"Count Equal","text":"Ensure column name count is equal to certain number.
JavaScalavalidation().columnNames().countEqual(3)\n
validation.columnNames.countEqual(3)\n
"},{"location":"setup/validation/column-name-validation/#not-equal","title":"Not Equal","text":"Ensure column name count is between two numbers.
JavaScalavalidation().columnNames().countBetween(10, 12)\n
validation.columnNames.countBetween(10, 12)\n
"},{"location":"setup/validation/column-name-validation/#match-order","title":"Match Order","text":"Ensure all column names match particular ordering and is complete.
JavaScalavalidation().columnNames().matchOrder(\"account_id\", \"amount\", \"name\")\n
validation.columnNames.matchOrder(\"account_id\", \"amount\", \"name\")\n
"},{"location":"setup/validation/column-name-validation/#match-set","title":"Match Set","text":"Ensure column names contains set of expected names. Order is not checked.
JavaScalavalidation().columnNames().matchSet(\"account_id\", \"first_name\")\n
validation.columnNames.matchSet(\"account_id\", \"first_name\")\n
"},{"location":"setup/validation/external-source-validation/","title":"External Source Validations","text":"Use validations that are defined in external sources such as Great Expectations or OpenMetadata. This allows you to generate data for your upstream data sources and validate your pipelines based on the same rules that would be applied in production.
Info
Retrieving data validations from an external source is a paid feature. Try the free trial here.
"},{"location":"setup/validation/external-source-validation/#supported-sources","title":"Supported Sources","text":"Source Support OpenMetadata Great Expectations DBT Constraints SodaCL MonteCarlo"},{"location":"setup/validation/external-source-validation/#openmetadata","title":"OpenMetadata","text":"Use data quality rules defined from OpenMetadata to execute over dataset.
JavaScalavar jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(metadataSource().openMetadata(\n\"http://host.docker.internal:8585/api\",\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(),\nMap.of(\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\",\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\"\n)\n));\n\nvar conf = configuration().enableGenerateValidations(true);\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(metadataSource.openMetadata(\n\"http://host.docker.internal:8585/api\",\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA,\nMap(\nOPEN_METADATA_JWT_TOKEN -> \"abc123\", //find under settings/bots/ingestion-bot/token\nOPEN_METADATA_TABLE_FQN -> \"sample_data.ecommerce_db.shopify.raw_customer\"\n)\n))\n\nval conf = configuration.enableGenerateValidations(true)\n
"},{"location":"setup/validation/external-source-validation/#great-expectations","title":"Great Expectations","text":"Use data quality rules defined from OpenMetadata to execute over dataset.
JavaScalavar jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(metadataSource().greatExpectations(\"great-expectations/taxi-expectations.json\");\n\nvar conf = configuration().enableGenerateValidations(true);\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(metadataSource.greatExpectations(\"great-expectations/taxi-expectations.json\")\n\nval conf = configuration.enableGenerateValidations(true)\n
"},{"location":"setup/validation/group-by-validation/","title":"Group By Validation","text":"If you want to run aggregations based on a particular set of columns or just the whole dataset, you can do so via group by validations. An example would be checking that the sum of amount
is less than 1000 per account_id, year
. The validations applied can be one of the validations from the basic validation set found here.
Check the number of records across the whole dataset.
JavaScalavalidation().groupBy().count().lessThan(1000)\n
validation.groupBy().count().lessThan(1000)\n
"},{"location":"setup/validation/group-by-validation/#record-count-per-group","title":"Record count per group","text":"Check the number of records for each group.
JavaScalavalidation().groupBy(\"account_id\", \"year\").count().lessThan(10)\n
validation.groupBy(\"account_id\", \"year\").count().lessThan(10)\n
"},{"location":"setup/validation/group-by-validation/#sum","title":"Sum","text":"Check the sum of a columns values for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n
validation.groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n
"},{"location":"setup/validation/group-by-validation/#count","title":"Count","text":"Check the count for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n
validation.groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n
"},{"location":"setup/validation/group-by-validation/#min","title":"Min","text":"Check the min for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n
validation.groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n
"},{"location":"setup/validation/group-by-validation/#max","title":"Max","text":"Check the max for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n
validation.groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n
"},{"location":"setup/validation/group-by-validation/#average","title":"Average","text":"Check the average for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n
validation.groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n
"},{"location":"setup/validation/group-by-validation/#standard-deviation","title":"Standard deviation","text":"Check the standard deviation for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n
validation.groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n
"},{"location":"setup/validation/upstream-data-source-validation/","title":"Upstream Data Source Validation","text":"If you want to run data validations based on data generated or data from another data source, you can use the upstream data source validations. An example would be generating a Parquet file that gets ingested by a job and inserted into Postgres. The validations can then check for each account_id
generated in the Parquet, it exists in account_number
column in Postgres. The validations can be chained with basic and group by validations or even other upstream data sources, to cover any complex validations.
Join across datasets by particular columns. Then run validations on the joined dataset. You will notice that the data source name is appended onto the column names when joined (i.e. my_first_json_customer_details
), to ensure column names do not clash and make it obvious which columns are being validated.
In the below example, we check that the for the same account_id
, then customer_details.name
in the my_first_json
dataset should equal to the name
column in the my_second_json
.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask) //upstream data generation task is `firstJsonTask`\n.joinColumns(\"account_id\") //use `account_id` column in both datasets to join corresponding records (outer join by default)\n.withValidation(\nvalidation().col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\") //validate the name in `my_second_json` is equal to `customer_details.name` in `my_first_json` when the `account_id` matches\n)\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask) //upstream data generation task is `firstJsonTask`\n.joinColumns(\"account_id\") //use `account_id` column in both datasets to join corresponding records (outer join by default)\n.withValidation(\nvalidation.col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\") //validate the name in `my_second_json` is equal to `customer_details.name` in `my_first_json` when the `account_id` matches\n)\n)\n
"},{"location":"setup/validation/upstream-data-source-validation/#join-expression","title":"Join expression","text":"Define join expression to link two datasets together. This can be any SQL expression that returns a boolean value. Useful in situations where join is based on transformations or complex logic.
In the below example, we have to use CONCAT
SQL function to combine 'ACC'
and account_number
to join with account_id
column in my_first_json
dataset.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinExpr(\"my_first_json_account_id == CONCAT('ACC', account_number)\") //generic SQL expression that returns a boolean\n.withValidation(\nvalidation().col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinExpr(\"my_first_json_account_id == CONCAT('ACC', account_number)\") //generic SQL expression that returns a boolean\n.withValidation(\nvalidation.col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n)\n
"},{"location":"setup/validation/upstream-data-source-validation/#different-join-type","title":"Different join type","text":"By default, an outer join is used to gather columns from both datasets together for validation. But there may be scenarios where you want to control the join type.
Possible join types include:
In the example below, we do an anti
join by column account_id
and check if there are no records. This essentially checks that all account_id
's from my_second_json
exist in my_first_json
. The second validation also does something similar but does an outer
join (by default) and checks that the joined dataset has 30 records.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.joinType(\"anti\")\n.withValidation(validation().count().isEqual(0)),\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation().count().isEqual(30))\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.joinType(\"anti\")\n.withValidation(validation.count().isEqual(0)),\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation.count().isEqual(30))\n)\n
"},{"location":"setup/validation/upstream-data-source-validation/#join-then-group-by-validation","title":"Join then group by validation","text":"We can apply aggregate or group by validations to the resulting joined dataset as the withValidation
method accepts any type of validation.
Here we group by account_id, my_first_json_balance
to check that when the amount
field is summed up per group, it is between 0.8 and 1.2 times the balance.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"balance\").type(DoubleType.instance()).min(10).max(1000),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation().groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n)\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"balance\").`type`(DoubleType).min(10).max(1000),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation.groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n)\n)\n
"},{"location":"setup/validation/upstream-data-source-validation/#chained-validations","title":"Chained validations","text":"Given that the withValidation
method accepts any other type of validation, you can chain other upstream data sources with it. Here we will show a third upstream data source being checked to ensure 30 records exists after joining them together by account_id
.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"balance\").type(DoubleType.instance()).min(10).max(1000),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n)\n.count(count().records(10));\n\nvar thirdJsonTask = json(\"my_third_json\", \"/tmp/data/third_json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(IntegerType.instance()).min(1).max(100),\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n.count(count().records(10));\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(\nvalidation().upstreamData(thirdJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation().count().isEqual(30))\n)\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"balance\").`type`(DoubleType).min(10).max(1000),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n.count(count.records(10))\n\nval thirdJsonTask = json(\"my_third_json\", \"/tmp/data/third_json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(IntegerType).min(1).max(100),\nfield.name(\"name\").expression(\"#{Name.name}\"),\n)\n.count(count.records(10))\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation.groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n),\n)\n
Can check out a full example here for more details.
"},{"location":"use-case/business-value/","title":"Business Value","text":"Below is a list of the business related benefits from using Data Caterer which may be applicable for your use case.
Problem Data Caterer Solution Resources Effects Reliable test data creation - Profile existing data- Create scenarios- Generate data Software Engineers, QA, Testers Cost reduction in labor, more time spent on development, more bugs caught before production Faster development cycles - Generate data in local, test, UAT, pre-prod- Run different scenarios Software Engineers, QA, Testers More defects caught in lower environments, features pushed to production faster, common framework used across all environments Data compliance - Profiling existing data- Generate based on metadata- No complex masking- No production data used in lower environments Audit and compliance No chance for production data breaches Storage costs - Delete generated data- Test specific scenarios Infrastructure Lower data storage costs, less time spent on data management and clean up Schema evolution - Create metadata from data sources- Generate data based off fresh metadata Software Engineers, QA, Testers Less time spent altering tests due to schema changes, ease of use between environments and application versions"},{"location":"use-case/comparison/","title":"Comparison to similar tools","text":"I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.
The companies/products not shown below either have:
You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.
Pros
Cons
Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.
Feature Description Sub Tasks Data source support Batch or real time data sources that can be added to Data Caterer. Support data sources that users want - AWS, GCP and Azure related data services ( cloud storage)- Deltalake- RabbitMQ- ActiveMQ- MongoDB- Elasticsearch- Snowflake- Databricks- Pulsar Metadata discovery Allow for schema and data profiling from external metadata sources - HTTP (OpenAPI spec)- JMS- Read from samples- OpenLineage metadata (Marquez)- OpenMetadata- ODCS (Open Data Contract Standard)- Amundsen- Datahub- Solace Event Portal- Airflow- DBT Developer API Scala/Java interface for developers/testers to create data generation and validation tasks - Scala- Java Report generation Generate a report that summarises the data generation or validation results - Report for data generated and validation rules UI portal Allow users to access a UI to input data generation or validation tasks. Also be able to view report results - Metadata stored in database- Store data generation/validation run information in file/database Integration with data validation tools Derive data validation rules from existing data validation tools - Great Expectation- DBT constraints- SodaCL- MonteCarlo- OpenMetadata Data validation rule suggestions Based on metadata, generate data validation rules appropriate for the dataset - Suggest basic data validations (yet to document) Wait conditions before data validation Define certain conditions to be met before starting data validations - Webhook- File exists- Data exists via SQL expression- Pause Validation types Ability to define simple/complex data validations - Basic validations- Aggregates (sum of amount per account is > 500)- Ordering (transactions are ordered by date)- Relationship (at least one account entry in history table per account in accounts table)- Data profile (how close the generated data profile is compared to the expected data profile)- Column name (check column count, column names, ordering) Data generation record count Generate scenarios where there are one to many, many to many situations relating to record count. Also ability to cover all edge cases or scenarios - Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values etc.)- Ability to override edge cases Alerting When tasks have completed, ability to define alerts based on certain conditions - Slack- Email Metadata enhancements Based on data profiling or inference, can add to existing metadata - PII detection (can integrate with Presidio)- Relationship detection across data sources- SQL generation- Ordering information Data cleanup Ability to clean up generated data - Clean up generated data- Clean up data in consumer data sinks- Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS) Trial version Trial version of the full app for users to test out all the features - Trial app to try out all features Code generation Based on metadata or existing classes, code for data generation and validation could be generated - Code generation- Schema generation from Scala/Java class Real time response data validations Ability to define data validations based on the response from real time data sources (e.g. HTTP response) - HTTP response data validation"},{"location":"use-case/blog/shift-left-data-quality/","title":"Shifting Data Quality Left with Data Catering","text":""},{"location":"use-case/blog/shift-left-data-quality/#empowering-proactive-data-management","title":"Empowering Proactive Data Management","text":"In the ever-evolving landscape of data-driven decision-making, ensuring data quality is non-negotiable. Traditionally, data quality has been a concern addressed late in the development lifecycle, often leading to reactive measures and increased costs. However, a paradigm shift is underway with the adoption of a \"shift left\" approach, placing data quality at the forefront of the development process.
"},{"location":"use-case/blog/shift-left-data-quality/#today","title":"Today","text":"graph LR\n subgraph badQualityData[<b>Manually generated data, limited data scenarios, fragmented testing tools</b>]\n local[<b>Local</b>\\nManual test, unit test]\n dev[<b>Dev</b>\\nManual test, integration test]\n stg[<b>Staging</b>\\nSanity checks]\n end\n\n subgraph qualityData[<b>Reliable data, the true test</b>]\n prod[<b>Production</b>\\nData quality checks, monitoring, observaibility]\n end\n\n style badQualityData fill:#d9534f,fill-opacity:0.7\n style qualityData fill:#5cb85c,fill-opacity:0.7\n\n local --> dev\n dev --> stg\n stg --> prod
"},{"location":"use-case/blog/shift-left-data-quality/#with-data-caterer","title":"With Data Caterer","text":"graph LR\n subgraph qualityData[<b>Reliable data anywhere, common testing tool across all data sources</b>]\n direction LR\n local[<b>Local</b>\\nManual test, unit test]\n dev[<b>Dev</b>\\nManual test, integration test]\n stg[<b>Staging</b>\\nSanity checks]\n prod[<b>Production</b>\\nData quality checks, monitoring, observaibility]\n end\n\n style qualityData fill:#5cb85c,fill-opacity:0.7\n\n local --> dev\n dev --> stg\n stg --> prod
"},{"location":"use-case/blog/shift-left-data-quality/#understanding-the-shift-left-approach","title":"Understanding the Shift Left Approach","text":"\"Shift left\" is a philosophy that advocates for addressing tasks and concerns earlier in the development lifecycle. Applied to data quality, it means tackling data issues as early as possible, ideally during the development and testing phases. This approach aims to catch data anomalies, inaccuracies, or inconsistencies before they propagate through the system, reducing the likelihood of downstream errors.
"},{"location":"use-case/blog/shift-left-data-quality/#data-caterer-the-catalyst-for-shifting-left","title":"Data Caterer: The Catalyst for Shifting Left","text":"Enter Data Caterer, a metadata-driven data generation and validation tool designed to empower organizations in shifting data quality left. By incorporating Data Caterer into the early stages of development, teams can proactively test complex data flows, validate data sources, and ensure data quality before it reaches downstream processes.
"},{"location":"use-case/blog/shift-left-data-quality/#key-advantages-of-shifting-data-quality-left-with-data-caterer","title":"Key Advantages of Shifting Data Quality Left with Data Caterer","text":"As organizations strive for excellence in their data-driven endeavors, the shift left approach with Data Caterer becomes a strategic imperative. By instilling a proactive data quality culture, teams can minimize the risk of costly errors, enhance the reliability of their data, and streamline the entire development lifecycle.
In conclusion, the marriage of the shift left philosophy and Data Caterer brings forth a new era of data management, where data quality is not just a final checkpoint but an integral part of every development milestone. Embrace the shift left approach with Data Caterer and empower your teams to build robust, high-quality data solutions from the very beginning.
Shift Left, Validate Early, and Accelerate with Data Caterer.
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Automated data generation and validation tool","text":"Data Caterer is a metadata-driven data generation and validation tool that aids in creating production-like data across both batch and event data systems. Run data validations to ensure your systems have ingested it as expected, then clean up the data afterwards. Simplify your data testing Take away the pain and complexity of your data landscape and let Data Caterer handle itTry now
Data testing is difficult and fragmentedTry now
What you need is a reliable tool that can handle changes to your data landscape
With Data Caterer, you get:
Try now
"},{"location":"#tech-summary","title":"Tech Summary","text":"Use the Java, Scala API, or YAML files to help with setup or customisation that are all run via a Docker image. Want to get into details? Checkout the setup pages here to get code examples and guides that will take you through scenarios and data sources.
Main features include:
Check other run configurations here.
"},{"location":"#what-is-it","title":"What is it","text":"Data generation and testing tool
Generate synthetic production-like data to be consumed and validated.
Designed for any data source
We aim to support pushing data to any data source, in any format.
Low/no code solution
Can use the tool via either Scala, Java or YAML. Connect to data or metadata sources to generate data and validate.
Developer productivity tool
If you are a new developer or seasoned veteran, cut down on your feedback loop when developing with data.
Metadata storage/platform
You could store and use metadata within the data generation/validation tasks but is not the recommended approach. Rather, this metadata should be gathered from existing services who handle metadata on behalf of Data Caterer.
Data contract
The focus of Data Caterer is on the data generation and testing, which can include details about how the data looks like and how it behaves. But it does not encompass all the additional metadata that comes with a data contract such as SLAs, security, etc.
Metrics from load testing
Although millions of records can be generated, there are limited capabilities in terms of metric capturing.
Try now
Data Catering vs Other tools vs In-houseData Catering Other tools In-house Data flow Batch and events generation with validation Batch generation only or validation only Depends on architecture and design Time to results 1 day 1+ month to integrate, deploy and onboard 1+ month to build and deploy Solution Connect with your existing data ecosystem, automatic generation and validation Manual UI data entry or via SDK Depends on engineer(s) building it
"},{"location":"about/","title":"About","text":"Hi, my name is Peter. I am a Software Developer, mainly focussing on data related services. My experience can be found on my LinkedIn.
I have created Data Caterer to help serve individuals and companies with data generation and data testing. It is a complex area that has many edge cases or intricacies that are hard to summarise or turn into something actionable and repeatable. Through the use of metadata, Data Caterer can help simplify your data testing, simulating production environment data, aid in data debugging, or whatever your data use case may be.
Given that it is going to save you and your team time and money, please help in considering financial support. This will help the product grow into a sustainable and feature-full service.
"},{"location":"about/#contact","title":"Contact","text":"Please contact Peter Flook via Slack or via email peter.flook@data.catering
if you have any questions or queries.
To have access to all the features of Data Caterer, you can subscribe according to your situation. You will not be charged by usage. As you continue to subscribe, you will have access to the latest version of Data Caterer as new bug fixes and features get published.
This has been a passion project of mine where I have spent countless hours thinking of the idea, implementing, maintaining, documenting and updating it. I hope that it will help with developers and companies with their testing by saving time and effort, allowing you to focus on what is important. If you fall under this boat, please consider sponsorship to allow me to further maintain and upgrade the solution. Any contributions are much appreciated.
Those who are wanting to use this project for open source applications, please contact me as I would be happy to contribute.
This is inspired by the mkdocs-material project that follows the same model.
"},{"location":"sponsor/#features","title":"Features","text":"Manage via this link
"},{"location":"sponsor/#contact","title":"Contact","text":"Please contact Peter Flook via Slack or via email peter.flook@data.catering
if you have any questions or queries.
Having a stable and reliable test environment is a challenge for a number of companies, especially where teams are asynchronously deploying and testing changes at faster rates. Data Caterer can help alleviate these issues by doing the following:
Similar to the above, being able to replicate production like data in your local environment can be key to developing more reliable code as you can test directly against data in your local computer. This has a number of benefits including:
When working with third-party, external or internal data providers, it can be difficult to have all setup ready to produce reliable data that abides by relationship contracts between each of the systems. You have to rely on these data providers in order for you to run your tests which may not align to their priorities. With Data Caterer, you can generate the same data that they would produce, along with maintaining referential integrity across the data providers, so that you can run your tests without relying on their systems being up and reliable in their corresponding lower environments.
"},{"location":"use-case/#scenario-testing","title":"Scenario testing","text":"If you want to set up particular data scenarios, you can customise the generated data to fit your scenario. Once the data gets generated and is consumed, you can also run validations to ensure your system has consumed the data correctly. These scenarios can be put together from existing tasks or data sources can be enabled/disabled based on your requirement. Built into Data Caterer and controlled via feature flags, is the ability to test edge cases based on the data type of the fields used for data generation (enableEdgeCases
flag within <field>.generator.options
, see more here).
When data related issues occur in production, it may be difficult to replicate in a lower or local environment. It could be related to specific fields not containing expected results, size of data is too large or missing corresponding referenced data. This becomes key to resolving the issue as you can directly code against the exact data scenario and have confidence that your code changes will fix the problem. Data Caterer can be used to generate the appropriate data in whichever environment you want to test your changes against.
"},{"location":"use-case/#data-profiling","title":"Data profiling","text":"When using Data Caterer with the feature flag enableGeneratePlanAndTasks
enabled (see here), metadata relating all the fields defined in the data sources you have configured will be generated via data profiling. You can run this as a standalone job (can disable enableGenerateData
) so that you can focus on the profile of the data you are utilising. This can be run against your production data sources to ensure the metadata can be used to accurately generate data in other environments. This is a key feature of Data Caterer as no direct production connections need to be maintained to generate data in other environments (which can lead to serious concerns about data security as seen here).
When using Data Caterer with the feature flag enableGeneratePlanAndTasks
enabled (see here), all schemas of the data sources defined will be tracked in a common format (as tasks). This data, along with the data profiling metadata, could then feed back into your schema registries to help keep them up to date with your system.
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.7.0\n
Open localhost:9898.git clone git@github.com:data-catering/data-caterer-example.git\ncd data-caterer-example && ./run.sh\n#check results under docker/sample/report/index.html folder\n
"},{"location":"get-started/docker/#report","title":"Report","text":"Check the report generated under docker/data/custom/report/index.html
.
Sample report can also be seen here
"},{"location":"get-started/docker/#paid-version-trial","title":"Paid Version Trial","text":"30 day trial of the paid version can be accessed via these steps:
/token
in the Slack group (will only be visible to you)git clone git@github.com:data-catering/data-caterer-example.git\ncd data-caterer-example && export DATA_CATERING_API_KEY=<insert api key>\n./run.sh\n
If you want to check how long your trial has left, you can check back in the Slack group or type /token
again.
Check out the starter guide here that will take your through step by step. You can also check the other guides here to see the other possibilities of what Data Caterer can achieve for you.
"},{"location":"legal/privacy-policy/","title":"Privacy Policy","text":"Last updated September 25, 2023
"},{"location":"legal/privacy-policy/#data-caterer-policy-on-privacy-of-customer-personal-information","title":"Data Caterer Policy on Privacy of Customer Personal Information","text":"Peter John Flook is committed to protecting the privacy and security of your personal information obtained by reason of your use of Data Caterer. This policy explains the types of customer personal information we collect, how it is used, and the steps we take to ensure your personal information is handled appropriately.
"},{"location":"legal/privacy-policy/#who-is-peter-john-flook","title":"Who is Peter John Flook?","text":"For purposes of this Privacy Policy, \u201cPeter John Flook\u201d means Peter John Flook, the company developing and providing Data Caterer and related websites and services.
"},{"location":"legal/privacy-policy/#what-is-personal-information","title":"What is personal information?","text":"Personal information is information that refers to an individual specifically and is recorded in any form. Personal information includes such things as age, income, date of birth, ethnic origin and credit records. Information about individuals contained in the following documents is not considered personal information:
Peter John Flook is responsible for all personal information under its control. Our team is accountable for compliance with these privacy and security principles.
"},{"location":"legal/privacy-policy/#we-let-you-know-why-we-collect-and-use-your-personal-information-and-get-your-consent","title":"We let you know why we collect and use your personal information and get your consent","text":"Peter John Flook identifies the purpose for which your personal information is collected and will be used or disclosed. If that purpose is not listed below we will do this before or at the time the information is actually being collected. You will be deemed to consent to our use of your personal information for the purpose of:
Otherwise, Peter John Flook will obtain your express consent (by verbal, written or electronic agreement) to collect, use or disclose your personal information. You can change your consent preferences at any time by contacting Peter John Flook (please refer to the \u201cHow to contact us\u201d section below).
"},{"location":"legal/privacy-policy/#we-limit-collection-of-your-personal-information","title":"We limit collection of your personal information","text":"Peter John Flook collects only the information required to provide products and services to you. Peter John Flook will collect personal information only by clear, fair and lawful means.
We receive and store any information you enter on our website or give us in any other way. You can choose not to provide certain information, but then you might not be able to take advantage of many of our features.
Peter John Flook does not receive or store personal content saved to your local device while using Data Caterer.
We also receive and store certain types of information whenever you interact with us.
"},{"location":"legal/privacy-policy/#information-provided-to-stripe","title":"Information provided to Stripe","text":"All purchases that are made through this site are processed securely and externally by Stripe. Unless you expressly consent otherwise, we do not see or have access to any personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address).
"},{"location":"legal/privacy-policy/#we-limit-disclosure-and-retention-of-your-personal-information","title":"We limit disclosure and retention of your personal information","text":"Peter John Flook does not disclose personal information to any organization or person for any reason except the following:
We employ other companies and individuals to perform functions on our behalf. Examples include fulfilling orders, delivering packages, sending postal mail and e-mail, removing repetitive information from customer lists, analyzing data, providing marketing assistance, processing credit card payments, and providing customer service. They have access to personal information needed to perform their functions, but may not use it for other purposes. We may use service providers located outside of Australia, and, if applicable, your personal information may be processed and stored in other countries and therefore may be subject to disclosure under the laws of those countries. As we continue to develop our business, we might sell or buy stores, subsidiaries, or business units. In such transactions, customer information generally is one of the transferred business assets but remains subject to the promises made in any pre-existing Privacy Notice (unless, of course, the customer consents otherwise). Also, in the unlikely event that Peter John Flook or substantially all of its assets are acquired, customer information of course will be one of the transferred assets. You are deemed to consent to disclosure of your personal information for those purposes. If your personal information is shared with third parties, those third parties are bound by appropriate agreements with Peter John Flook to secure and protect the confidentiality of your personal information.
Peter John Flook retains your personal information only as long as it is required for our business relationship or as required by federal and provincial laws.
"},{"location":"legal/privacy-policy/#we-keep-your-personal-information-up-to-date-and-accurate","title":"We keep your personal information up to date and accurate","text":"Peter John Flook keeps your personal information up to date, accurate and relevant for its intended use.
You may request access to the personal information we have on record in order to review and amend the information, as appropriate. In circumstances where your personal information has been provided by a third party, we will refer you to that party (e.g. credit bureaus). To access your personal information, refer to the \u201cHow to contact us\u201d section below.
"},{"location":"legal/privacy-policy/#the-security-of-your-personal-information-is-a-priority-for-peter-john-flook","title":"The security of your personal information is a priority for Peter John Flook","text":"We take steps to safeguard your personal information, regardless of the format in which it is held, including:
physical security measures such as restricted access facilities and locked filing cabinets electronic security measures for computerized personal information such as password protection, database encryption and personal identification numbers. We work to protect the security of your information during transmission by using \u201cTransport Layer Security\u201d (TLS) protocol. organizational processes such as limiting access to your personal information to a selected group of individuals contractual obligations with third parties who need access to your personal information requiring them to protect and secure your personal information It\u2019s important for you to protect against unauthorized access to your password and your computer. Be sure to sign off when you\u2019ve finished using any shared computer.
"},{"location":"legal/privacy-policy/#what-about-third-party-advertisers-and-links-to-other-websites","title":"What About Third-Party Advertisers and Links to Other Websites?","text":"Our site may include third-party advertising and links to other websites. We do not provide any personally identifiable customer information to these advertisers or third-party websites.
These third-party websites and advertisers, or Internet advertising companies working on their behalf, sometimes use technology to send (or \u201cserve\u201d) the advertisements that appear on our website directly to your browser. They automatically receive your IP address when this happens. They may also use cookies, JavaScript, web beacons (also known as action tags or single-pixel gifs), and other technologies to measure the effectiveness of their ads and to personalize advertising content. We do not have access to or control over cookies or other features that they may use, and the information practices of these advertisers and third-party websites are not covered by this Privacy Notice. Please contact them directly for more information about their privacy practices. In addition, the Network Advertising Initiative offers useful information about Internet advertising companies (also called \u201cad networks\u201d or \u201cnetwork advertisers\u201d), including information about how to opt-out of their information collection. You can access the Network Advertising Initiative at http://www.networkadvertising.org.
"},{"location":"legal/privacy-policy/#redirection-to-stripe","title":"Redirection to Stripe","text":"In particular, when you submit an order to us, you may be automatically redirected to Stripe in order to complete the required payment. The payment page that is provided by Stripe is not part of this site. As noted above, we are not privy to any of the bank account, credit card or other personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address). We recommend that you refer to Stripe\u2019s privacy statement if you would like more information about how Stripe collects and handles your personal information.
"},{"location":"legal/privacy-policy/#we-are-open-about-our-privacy-and-security-policy","title":"We are open about our privacy and security policy","text":"We are committed to providing you with understandable and easily available information about our policy and practices related to management of your personal information. This policy and any related information is available at all times on our website, https://data.catering/about/ under Privacy or on request. To contact us, refer to the \u201cHow to contact us\u201d section below.
"},{"location":"legal/privacy-policy/#we-provide-access-to-your-personal-information-stored-by-peter-john-flook","title":"We provide access to your personal information stored by Peter John Flook","text":"You can request access to your personal information stored by Peter John Flook. To contact us, refer to the \u201cHow to contact us\u201d section below. Upon receiving such a request, Peter John Flook will:
inform you about what type of personal information we have on record or in our control, how it is used and to whom it may have been disclosed provide you with access to your information so you can review and verify the accuracy and completeness and request changes to the information make any necessary updates to your personal information We respond to your questions, concerns and complaints about privacy Peter John Flook responds in a timely manner to your questions, concerns and complaints about the privacy of your personal information and our privacy policies and procedures.
"},{"location":"legal/privacy-policy/#how-to-contact-us","title":"How to contact us","text":"peter.flook@data.catering
Our business changes constantly, and this privacy notice will change also. We may e-mail periodic reminders of our notices and conditions, unless you have instructed us not to, but you should check our website frequently to see recent changes. We are, however, committed to protecting your information and will never materially change our policies and practices to make them less protective of customer information collected in the past without the consent of affected customers.
"},{"location":"legal/terms-of-service/","title":"Terms and Conditions","text":"Last updated: September 25, 2023
Please read these terms and conditions carefully before using Our Service.
"},{"location":"legal/terms-of-service/#interpretation-and-definitions","title":"Interpretation and Definitions","text":""},{"location":"legal/terms-of-service/#interpretation","title":"Interpretation","text":"The words of which the initial letter is capitalized have meanings defined under the following conditions. The following definitions shall have the same meaning regardless of whether they appear in singular or in plural.
"},{"location":"legal/terms-of-service/#definitions","title":"Definitions","text":"For the purposes of these Terms and Conditions:
These are the Terms and Conditions governing the use of this Service and the agreement that operates between You and the Company. These Terms and Conditions set out the rights and obligations of all users regarding the use of the Service.
Your access to and use of the Service is conditioned on Your acceptance of and compliance with these Terms and Conditions. These Terms and Conditions apply to all visitors, users and others who access or use the Service.
By accessing or using the Service You agree to be bound by these Terms and Conditions. If You disagree with any part of these Terms and Conditions then You may not access the Service.
You represent that you are over the age of 18. The Company does not permit those under 18 to use the Service.
Your access to and use of the Service is also conditioned on Your acceptance of and compliance with the Privacy Policy of the Company. Our Privacy Policy describes Our policies and procedures on the collection, use and disclosure of Your personal information when You use the Application or the Website and tells You about Your privacy rights and how the law protects You. Please read Our Privacy Policy carefully before using Our Service.
"},{"location":"legal/terms-of-service/#links-to-other-websites","title":"Links to Other Websites","text":"Our Service may contain links to third-party websites or services that are not owned or controlled by the Company.
The Company has no control over, and assumes no responsibility for, the content, privacy policies, or practices of any third party websites or services. You further acknowledge and agree that the Company shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with the use of or reliance on any such content, goods or services available on or through any such websites or services.
We strongly advise You to read the terms and conditions and privacy policies of any third-party websites or services that You visit.
"},{"location":"legal/terms-of-service/#termination","title":"Termination","text":"We may terminate or suspend Your access immediately, without prior notice or liability, for any reason whatsoever, including without limitation if You breach these Terms and Conditions.
Upon termination, Your right to use the Service will cease immediately.
"},{"location":"legal/terms-of-service/#limitation-of-liability","title":"Limitation of Liability","text":"Notwithstanding any damages that You might incur, the entire liability of the Company and any of its suppliers under any provision of these Terms and Your exclusive remedy for all the foregoing shall be limited to the amount actually paid by You through the Service or 100 USD if You haven't purchased anything through the Service.
To the maximum extent permitted by applicable law, in no event shall the Company or its suppliers be liable for any special, incidental, indirect, or consequential damages whatsoever (including, but not limited to, damages for loss of profits, loss of data or other information, for business interruption, for personal injury, loss of privacy arising out of or in any way related to the use of or inability to use the Service, third-party software and/or third-party hardware used with the Service, or otherwise in connection with any provision of these Terms), even if the Company or any supplier has been advised of the possibility of such damages and even if the remedy fails of its essential purpose.
Some states do not allow the exclusion of implied warranties or limitation of liability for incidental or consequential damages, which means that some of the above limitations may not apply. In these states, each party's liability will be limited to the greatest extent permitted by law.
"},{"location":"legal/terms-of-service/#as-is-and-as-available-disclaimer","title":"\"AS IS\" and \"AS AVAILABLE\" Disclaimer","text":"The Service is provided to You \"AS IS\" and \"AS AVAILABLE\" and with all faults and defects without warranty of any kind. To the maximum extent permitted under applicable law, the Company, on its own behalf and on behalf of its Affiliates and its and their respective licensors and service providers, expressly disclaims all warranties, whether express, implied, statutory or otherwise, with respect to the Service, including all implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and warranties that may arise out of course of dealing, course of performance, usage or trade practice. Without limitation to the foregoing, the Company provides no warranty or undertaking, and makes no representation of any kind that the Service will meet Your requirements, achieve any intended results, be compatible or work with any other software, applications, systems or services, operate without interruption, meet any performance or reliability standards or be error free or that any errors or defects can or will be corrected.
Without limiting the foregoing, neither the Company nor any of the company's provider makes any representation or warranty of any kind, express or implied: (i) as to the operation or availability of the Service, or the information, content, and materials or products included thereon; (ii) that the Service will be uninterrupted or error-free; (iii) as to the accuracy, reliability, or currency of any information or content provided through the Service; or (iv) that the Service, its servers, the content, or e-mails sent from or on behalf of the Company are free of viruses, scripts, trojan horses, worms, malware, time-bombs or other harmful components.
Some jurisdictions do not allow the exclusion of certain types of warranties or limitations on applicable statutory rights of a consumer, so some or all of the above exclusions and limitations may not apply to You. But in such a case the exclusions and limitations set forth in this section shall be applied to the greatest extent enforceable under applicable law.
"},{"location":"legal/terms-of-service/#governing-law","title":"Governing Law","text":"The laws of the Country, excluding its conflicts of law rules, shall govern this Terms and Your use of the Service. Your use of the Application may also be subject to other local, state, national, or international laws.
"},{"location":"legal/terms-of-service/#disputes-resolution","title":"Disputes Resolution","text":"If You have any concern or dispute about the Service, You agree to first try to resolve the dispute informally by contacting the Company.
"},{"location":"legal/terms-of-service/#for-european-union-eu-users","title":"For European Union (EU) Users","text":"If You are a European Union consumer, you will benefit from any mandatory provisions of the law of the country in which you are resident in.
"},{"location":"legal/terms-of-service/#united-states-legal-compliance","title":"United States Legal Compliance","text":"You represent and warrant that (i) You are not located in a country that is subject to the United States government embargo, or that has been designated by the United States government as a \"terrorist supporting\" country, and (ii) You are not listed on any United States government list of prohibited or restricted parties.
"},{"location":"legal/terms-of-service/#severability-and-waiver","title":"Severability and Waiver","text":""},{"location":"legal/terms-of-service/#severability","title":"Severability","text":"If any provision of these Terms is held to be unenforceable or invalid, such provision will be changed and interpreted to accomplish the objectives of such provision to the greatest extent possible under applicable law and the remaining provisions will continue in full force and effect.
"},{"location":"legal/terms-of-service/#waiver","title":"Waiver","text":"Except as provided herein, the failure to exercise a right or to require performance of an obligation under these Terms shall not affect a party's ability to exercise such right or require such performance at any time thereafter nor shall the waiver of a breach constitute a waiver of any subsequent breach.
"},{"location":"legal/terms-of-service/#translation-interpretation","title":"Translation Interpretation","text":"These Terms and Conditions may have been translated if We have made them available to You on our Service. You agree that the original English text shall prevail in the case of a dispute.
"},{"location":"legal/terms-of-service/#changes-to-these-terms-and-conditions","title":"Changes to These Terms and Conditions","text":"We reserve the right, at Our sole discretion, to modify or replace these Terms at any time. If a revision is material We will make reasonable efforts to provide at least 30 days' notice prior to any new terms taking effect. What constitutes a material change will be determined at Our sole discretion.
By continuing to access or use Our Service after those revisions become effective, You agree to be bound by the revised terms. If You do not agree to the new terms, in whole or in part, please stop using the website and the Service.
"},{"location":"legal/terms-of-service/#contact-us","title":"Contact Us","text":"If you have any questions about these Terms and Conditions, You can contact us:
All the configurations and customisation related to Data Caterer can be found under here.
"},{"location":"setup/#guide","title":"Guide","text":"If you want a guided tour of using the Java or Scala API, you can follow one of the guides found here.
"},{"location":"setup/#specific-configuration","title":"Specific Configuration","text":"There are many options available for you to use when you have a scenario when data has to be a certain format.
Details for how you can configure foreign keys can be found here.
"},{"location":"setup/advanced/#edge-cases","title":"Edge cases","text":"For each given data type, there are edge cases which can cause issues when your application processes the data. This can be controlled at a column level by including the following flag in the generator options:
JavaScalaYAMLfield()\n.name(\"amount\")\n.type(DoubleType.instance())\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n
field\n.name(\"amount\")\n.`type`(DoubleType)\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n
fields:\n- name: \"amount\"\ntype: \"double\"\ngenerator:\ntype: \"random\"\noptions:\nenableEdgeCases: \"true\"\nedgeCaseProb: 0.1\n
If you want to know all the possible edge cases for each data type, can check the documentation here.
"},{"location":"setup/advanced/#scenario-testing","title":"Scenario testing","text":"You can create specific scenarios by adjusting the metadata found in the plan and tasks to your liking. For example, if you had two data sources, a Postgres database and a parquet file, and you wanted to save account data into Postgres and transactions related to those accounts into a parquet file. You can alter the status
column in the account data to only generate open
accounts and define a foreign key between Postgres and parquet to ensure the same account_id
is being used. Then in the parquet task, define 1 to 10 transactions per account_id
to be generated.
Postgres account generation example task Parquet transaction generation example task Plan
"},{"location":"setup/advanced/#cloud-storage","title":"Cloud storage","text":""},{"location":"setup/advanced/#data-source","title":"Data source","text":"If you want to save the file types CSV, JSON, Parquet or ORC into cloud storage, you can do so via adding extra configurations. Below is an example for S3.
JavaScalaYAMLvar csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield().name(\"account_id\"),\n...\n);\n\nvar s3Configuration = configuration()\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n\nexecute(s3Configuration, csvTask);\n
val csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield.name(\"account_id\"),\n...\n)\n\nval s3Configuration = configuration\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -> \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -> \"true\",\n\"spark.hadoop.fs.defaultFS\" -> \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -> \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -> \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -> \"secret_key\"\n))\n\nexecute(s3Configuration, csvTask)\n
folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n
"},{"location":"setup/advanced/#storing-plantasks","title":"Storing plan/task(s)","text":"You can generate and store the plan/task files inside either AWS S3, Azure Blob Storage or Google GCS. This can be controlled via configuration set in the application.conf
file where you can set something like the below:
configuration()\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n
configuration\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -> \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -> \"true\",\n\"spark.hadoop.fs.defaultFS\" -> \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -> \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -> \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -> \"secret_key\"\n))\n
folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n
"},{"location":"setup/configuration/","title":"Configuration","text":"A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.
These configurations are defined from within your Java or Scala class via configuration
or for YAML file setup, application.conf
file as seen here.
Flags are used to control which processes are executed when you run Data Caterer.
Config Default Paid DescriptionenableGenerateData
true N Enable/disable data generation enableCount
true N Count the number of records generated. Can be disabled to improve performance enableFailOnError
true N Whilst saving generated data, if there is an error, it will stop any further data from being generated enableSaveReports
true N Enable/disable HTML reports summarising data generated, metadata of data generated (if enableSinkMetadata
is enabled) and validation results (if enableValidation
is enabled). Sample here enableSinkMetadata
true N Run data profiling for the generated data. Shown in HTML reports if enableSaveSinkMetadata
is enabled enableValidation
false N Run validations as described in plan. Results can be viewed from logs or from HTML report if enableSaveSinkMetadata
is enabled. Sample here enableUniqueCheck
false N If enabled, for any isUnique
fields, will ensure only unique values are generated enableAlerts
true N Enable/disable alerts to be sent enableGeneratePlanAndTasks
false Y Enable/disable plan and task auto generation based off data source connections enableRecordTracking
false Y Enable/disable which data records have been generated for any data source enableDeleteGeneratedRecords
false Y Delete all generated records based off record tracking (if enableRecordTracking
has been set to true) enableGenerateValidations
false Y If enabled, it will generate validations based on the data sources defined. JavaScalaapplication.conf configuration()\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableUniqueCheck(true)\n.enableAlerts(true)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false);\n
configuration\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableUniqueCheck(true)\n.enableAlerts(true)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false)\n
flags {\n enableCount = false\n enableCount = ${?ENABLE_COUNT}\n enableGenerateData = true\n enableGenerateData = ${?ENABLE_GENERATE_DATA}\n enableFailOnError = true\n enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}\n enableGeneratePlanAndTasks = false\n enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}\n enableRecordTracking = false\n enableRecordTracking = ${?ENABLE_RECORD_TRACKING}\n enableDeleteGeneratedRecords = false\n enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}\n enableUniqueCheck = true\n enableUniqueCheck = ${?ENABLE_UNIQUE_CHECK}\n enableSinkMetadata = true\n enableSinkMetadata = ${?ENABLE_SINK_METADATA}\n enableSaveReports = true\n enableSaveReports = ${?ENABLE_SAVE_REPORTS}\n enableValidation = false\n enableValidation = ${?ENABLE_VALIDATION}\n enableGenerateValidations = false\n enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}\n enableAlerts = false\n enableAlerts = ${?ENABLE_ALERTS}\n}\n
"},{"location":"setup/configuration/#folders","title":"Folders","text":"Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.
These folder pathways can be defined as a cloud storage pathway (i.e. s3a://my-bucket/task
).
planFilePath
/opt/app/plan/customer-create-plan.yaml N Plan file path to use when generating and/or validating data taskFolderPath
/opt/app/task N Task folder path that contains all the task files (can have nested directories) validationFolderPath
/opt/app/validation N Validation folder path that contains all the validation files (can have nested directories) generatedReportsFolderPath
/opt/app/report N Where HTML reports get generated that contain information about data generated along with any validations performed generatedPlanAndTaskFolderPath
/tmp Y Folder path where generated plan and task files will be saved recordTrackingFolderPath
/opt/app/record-tracking Y Where record tracking parquet files get saved recordTrackingForValidationFolderPath
/opt/app/record-tracking-validation Y Where record tracking parquet files get saved for the purpose of validation JavaScalaapplication.conf configuration()\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n.recordTrackingForValidationFolderPath(\"/opt/app/custom/record-tracking-validation\");\n
configuration\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n.recordTrackingForValidationFolderPath(\"/opt/app/custom/record-tracking-validation\")\n
folders {\n planFilePath = \"/opt/app/custom/plan/postgres-plan.yaml\"\n planFilePath = ${?PLAN_FILE_PATH}\n taskFolderPath = \"/opt/app/custom/task\"\n taskFolderPath = ${?TASK_FOLDER_PATH}\n validationFolderPath = \"/opt/app/custom/validation\"\n validationFolderPath = ${?VALIDATION_FOLDER_PATH}\n generatedReportsFolderPath = \"/opt/app/custom/report\"\n generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}\n generatedPlanAndTaskFolderPath = \"/opt/app/custom/generated\"\n generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}\n recordTrackingFolderPath = \"/opt/app/custom/record-tracking\"\n recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}\n recordTrackingForValidationFolderPath = \"/opt/app/custom/record-tracking-validation\"\n recordTrackingForValidationFolderPath = ${?RECORD_TRACKING_VALIDATION_FOLDER_PATH}\n}\n
"},{"location":"setup/configuration/#metadata","title":"Metadata","text":"When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy related issues. Metadata gets generated from two processes: 1) if enableGeneratePlanAndTasks
or 2) if enableSinkMetadata
are enabled.
During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.
Config Default Paid DescriptionnumRecordsFromDataSource
10000 Y Number of records read in from the data source that could be used for data profiling numRecordsForAnalysis
10000 Y Number of records used for data profiling from the records gathered in numRecordsFromDataSource
oneOfMinCount
1000 Y Minimum number of records required before considering if a field can be of type oneOf
oneOfDistinctCountVsCountThreshold
0.2 Y Threshold ratio to determine if a field is of type oneOf
(i.e. a field called status
that only contains open
or closed
. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as oneOf
) numGeneratedSamples
10 N Number of sample records from generated data to take. Shown in HTML report JavaScalaapplication.conf configuration()\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10);\n
configuration\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10)\n
metadata {\n numRecordsFromDataSource = 10000\n numRecordsForAnalysis = 10000\n oneOfMinCount = 1000\n oneOfDistinctCountVsCountThreshold = 0.2\n numGeneratedSamples = 10\n}\n
"},{"location":"setup/configuration/#generation","title":"Generation","text":"When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.
Config Default Paid DescriptionnumRecordsPerBatch
100000 N Number of records across all data sources to generate per batch numRecordsPerStep
N Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated) JavaScalaapplication.conf configuration()\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000);\n
configuration\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000)\n
generation {\n numRecordsPerBatch = 100000\n numRecordsPerStep = 1000\n}\n
"},{"location":"setup/configuration/#validation","title":"Validation","text":"Configurations to alter how validations are executed.
Config Default Paid DescriptionnumSampleErrorRecords
5 N Number of error sample records to retrieve and display in generated HTML report. Increase to help debugging data issues enableDeleteRecordTrackingFiles
true Y After validations are complete, delete record tracking files that were used for validation purposes (enabled via enableRecordTracking
) JavaScalaapplication.conf configuration()\n.numSampleErrorRecords(10)\n.enableDeleteRecordTrackingFiles(false);\n
configuration\n.numSampleErrorRecords(10)\n.enableDeleteRecordTrackingFiles(false)\n
validatoin {\n numSampleErrorRecords = 10\n enableDeleteRecordTrackingFiles = false\n}\n
"},{"location":"setup/configuration/#runtime","title":"Runtime","text":"Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your specifications via configuration as seen here.
JavaScalaapplication.confconfiguration()\n.master(\"local[*]\")\n.runtimeConfig(Map.of(\"spark.driver.cores\", \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\", \"10g\");\n
configuration\n.master(\"local[*]\")\n.runtimeConfig(Map(\"spark.driver.cores\" -> \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\" -> \"10g\")\n
runtime {\n master = \"local[*]\"\n master = ${?DATA_CATERER_MASTER}\n config {\n \"spark.driver.cores\" = \"5\"\n \"spark.driver.memory\" = \"10g\"\n }\n}\n
"},{"location":"setup/connection/","title":"Data Source Connections","text":"Details of all the connection configuration supported can be found in the below subsections for each type of connection.
These configurations can be done via API or from configuration. Examples of both are shown for each data source below.
"},{"location":"setup/connection/#supported-data-connections","title":"Supported Data Connections","text":"Data Source Type Data Source Sponsor Database Postgres, MySQL, Cassandra N File CSV, JSON, ORC, Parquet N Messaging Kafka, Solace Y HTTP REST API Y Metadata Marquez, OpenMetadata, OpenAPI/Swagger Y"},{"location":"setup/connection/#api","title":"API","text":"All connection details require a name. Depending on the data source, you can define additional options which may be used by the driver or connector for connecting to the data source.
"},{"location":"setup/connection/#configuration-file","title":"Configuration file","text":"All connection details follow the same pattern.
<connection format> {\n <connection name> {\n <key> = <value>\n }\n}\n
Overriding configuration
When defining a configuration value that can be defined by a system property or environment variable at runtime, you can define that via the following:
url = \"localhost\"\nurl = ${?POSTGRES_URL}\n
The above defines that if there is a system property or environment variable named POSTGRES_URL
, then that value will be used for the url
, otherwise, it will default to localhost
.
To find examples of a task for each type of data source, please check out this page.
"},{"location":"setup/connection/#file","title":"File","text":"Linked here is a list of generic options that can be included as part of your file data source configuration if required. Links to specific file type configurations can be found below.
"},{"location":"setup/connection/#csv","title":"CSV","text":"JavaScalaapplication.confcsv(\"customer_transactions\", \"/data/customer/transaction\")\n
csv(\"customer_transactions\", \"/data/customer/transaction\")\n
csv {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?CSV_PATH}\n }\n}\n
Other available configuration for CSV can be found here
"},{"location":"setup/connection/#json","title":"JSON","text":"JavaScalaapplication.confjson(\"customer_transactions\", \"/data/customer/transaction\")\n
json(\"customer_transactions\", \"/data/customer/transaction\")\n
json {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?JSON_PATH}\n }\n}\n
Other available configuration for JSON can be found here
"},{"location":"setup/connection/#orc","title":"ORC","text":"JavaScalaapplication.conforc(\"customer_transactions\", \"/data/customer/transaction\")\n
orc(\"customer_transactions\", \"/data/customer/transaction\")\n
orc {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?ORC_PATH}\n }\n}\n
Other available configuration for ORC can be found here
"},{"location":"setup/connection/#parquet","title":"Parquet","text":"JavaScalaapplication.confparquet(\"customer_transactions\", \"/data/customer/transaction\")\n
parquet(\"customer_transactions\", \"/data/customer/transaction\")\n
parquet {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?PARQUET_PATH}\n }\n}\n
Other available configuration for Parquet can be found here
"},{"location":"setup/connection/#delta-not-supported-yet","title":"Delta (not supported yet)","text":"JavaScalaapplication.confdelta(\"customer_transactions\", \"/data/customer/transaction\")\n
delta(\"customer_transactions\", \"/data/customer/transaction\")\n
delta {\n customer_transactions {\n path = \"/data/customer/transaction\"\n path = ${?DELTA_PATH}\n }\n}\n
"},{"location":"setup/connection/#rmdbs","title":"RMDBS","text":"Follows the same configuration used by Spark as found here. Sample can be found below
JavaScalaapplication.confpostgres(\n\"customer_postgres\", #name\n\"jdbc:postgresql://localhost:5432/customer\", #url\n\"postgres\", #username\n\"postgres\" #password\n)\n
postgres(\n\"customer_postgres\", #name\n\"jdbc:postgresql://localhost:5432/customer\", #url\n\"postgres\", #username\n\"postgres\" #password\n)\n
jdbc {\n customer_postgres {\n url = \"jdbc:postgresql://localhost:5432/customer\"\n url = ${?POSTGRES_URL}\n user = \"postgres\"\n user = ${?POSTGRES_USERNAME}\n password = \"postgres\"\n password = ${?POSTGRES_PASSWORD}\n driver = \"org.postgresql.Driver\"\n }\n}\n
Ensure that the user has write permission, so it is able to save the table to the target tables.
SQL Permission StatementsGRANT INSERT ON <schema>.<table> TO <user>;\n
"},{"location":"setup/connection/#postgres","title":"Postgres","text":"Can see example API or Config definition for Postgres connection above.
"},{"location":"setup/connection/#permissions","title":"Permissions","text":"Following permissions are required when generating plan and tasks:
SQL Permission StatementsGRANT SELECT ON information_schema.tables TO < user >;\nGRANT SELECT ON information_schema.columns TO < user >;\nGRANT SELECT ON information_schema.key_column_usage TO < user >;\nGRANT SELECT ON information_schema.table_constraints TO < user >;\nGRANT SELECT ON information_schema.constraint_column_usage TO < user >;\n
"},{"location":"setup/connection/#mysql","title":"MySQL","text":"JavaScalaapplication.conf mysql(\n\"customer_mysql\", #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\", #username\n\"root\" #password\n)\n
mysql(\n\"customer_mysql\", #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\", #username\n\"root\" #password\n)\n
jdbc {\n customer_mysql {\n url = \"jdbc:mysql://localhost:3306/customer\"\n user = \"root\"\n password = \"root\"\n driver = \"com.mysql.cj.jdbc.Driver\"\n }\n}\n
"},{"location":"setup/connection/#permissions_1","title":"Permissions","text":"Following permissions are required when generating plan and tasks:
SQL Permission StatementsGRANT SELECT ON information_schema.columns TO < user >;\nGRANT SELECT ON information_schema.statistics TO < user >;\nGRANT SELECT ON information_schema.key_column_usage TO < user >;\n
"},{"location":"setup/connection/#cassandra","title":"Cassandra","text":"Follows same configuration as defined by the Spark Cassandra Connector as found here
JavaScalaapplication.confcassandra(\n\"customer_cassandra\", #name\n\"localhost:9042\", #url\n\"cassandra\", #username\n\"cassandra\", #password\nMap.of() #optional additional connection options\n)\n
cassandra(\n\"customer_cassandra\", #name\n\"localhost:9042\", #url\n\"cassandra\", #username\n\"cassandra\", #password\nMap() #optional additional connection options\n)\n
org.apache.spark.sql.cassandra {\n customer_cassandra {\n spark.cassandra.connection.host = \"localhost\"\n spark.cassandra.connection.host = ${?CASSANDRA_HOST}\n spark.cassandra.connection.port = \"9042\"\n spark.cassandra.connection.port = ${?CASSANDRA_PORT}\n spark.cassandra.auth.username = \"cassandra\"\n spark.cassandra.auth.username = ${?CASSANDRA_USERNAME}\n spark.cassandra.auth.password = \"cassandra\"\n spark.cassandra.auth.password = ${?CASSANDRA_PASSWORD}\n }\n}\n
"},{"location":"setup/connection/#permissions_2","title":"Permissions","text":"Ensure that the user has write permission, so it is able to save the table to the target tables.
CQL Permission StatementsGRANT INSERT ON <schema>.<table> TO <user>;\n
Following permissions are required when enabling configuration.enableGeneratePlanAndTasks(true)
as it will gather metadata information about tables and columns from the below tables.
GRANT SELECT ON system_schema.tables TO <user>;\nGRANT SELECT ON system_schema.columns TO <user>;\n
"},{"location":"setup/connection/#kafka","title":"Kafka","text":"Define your Kafka bootstrap server to connect and send generated data to corresponding topics. Topic gets set at a step level. Further details can be found here
JavaScalaapplication.confkafka(\n\"customer_kafka\", #name\n\"localhost:9092\" #url\n)\n
kafka(\n\"customer_kafka\", #name\n\"localhost:9092\" #url\n)\n
kafka {\n customer_kafka {\n kafka.bootstrap.servers = \"localhost:9092\"\n kafka.bootstrap.servers = ${?KAFKA_BOOTSTRAP_SERVERS}\n }\n}\n
When defining your schema for pushing data to Kafka, it follows a specific top level schema. An example can be found here . You can define the key, value, headers, partition or topic by following the linked schema.
"},{"location":"setup/connection/#jms","title":"JMS","text":"Uses JNDI lookup to send messages to JMS queue. Ensure that the messaging system you are using has your queue/topic registered via JNDI otherwise a connection cannot be created.
JavaScalaapplication.confsolace(\n\"customer_solace\", #name\n\"smf://localhost:55554\", #url\n\"admin\", #username\n\"admin\", #password\n\"default\", #vpn name\n\"/jms/cf/default\", #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\" #initial context factory\n)\n
solace(\n\"customer_solace\", #name\n\"smf://localhost:55554\", #url\n\"admin\", #username\n\"admin\", #password\n\"default\", #vpn name\n\"/jms/cf/default\", #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\" #initial context factory\n)\n
jms {\n customer_solace {\n initialContextFactory = \"com.solacesystems.jndi.SolJNDIInitialContextFactory\"\n connectionFactory = \"/jms/cf/default\"\n url = \"smf://localhost:55555\"\n url = ${?SOLACE_URL}\n user = \"admin\"\n user = ${?SOLACE_USER}\n password = \"admin\"\n password = ${?SOLACE_PASSWORD}\n vpnName = \"default\"\n vpnName = ${?SOLACE_VPN}\n }\n}\n
"},{"location":"setup/connection/#http","title":"HTTP","text":"Define any username and/or password needed for the HTTP requests. The url is defined in the tasks to allow for generated data to be populated in the url.
JavaScalaapplication.confhttp(\n\"customer_api\", #name\n\"admin\", #username\n\"admin\" #password\n)\n
http(\n\"customer_api\", #name\n\"admin\", #username\n\"admin\" #password\n)\n
http {\n customer_api {\n user = \"admin\"\n user = ${?HTTP_USER}\n password = \"admin\"\n password = ${?HTTP_PASSWORD}\n }\n}\n
"},{"location":"setup/deployment/","title":"Deployment","text":"Two main ways to deploy and run Data Caterer:
To package up your class along with the Data Caterer base image, you can follow the Dockerfile that is created for you here.
Then you can run the following:
./gradlew clean build\ndocker build -t <my_image_name>:<my_image_tag> .\n
"},{"location":"setup/deployment/#helm","title":"Helm","text":"Link to sample helm on GitHub here
Update the configuration to your own data connections and configuration or own image created from above.
git clone git@github.com:data-catering/data-caterer-example.git\nhelm install data-caterer ./data-caterer-example/helm/data-caterer\n
"},{"location":"setup/design/","title":"Design","text":"This document shows the thought process behind the design of Data Caterer to help give you insights as to how and why it was created to what it is today. Also, this serves as a reference for future design decisions which will get updated here and thus is a living document.
"},{"location":"setup/design/#motivation","title":"Motivation","text":"The main difficulties that I faced as a developer and team lead relating to testing were:
These difficulties helped formed the basis of the principles for which Data Caterer should follow:
graph LR\n subgraph userTasks [User Configuration]\n dataGen[Data Generation]\n dataValid[Data Validation]\n runConf[Runtime Config]\n end\n\n subgraph dataProcessor [Processor]\n dataCaterer[Data Caterer]\n end\n\n subgraph existingMetadata [Metadata]\n metadataService[Metadata Services]\n metadataDataSource[Data Sources]\n end\n\n subgraph output [Output]\n outputDataSource[Data Sources]\n report[Report]\n end\n\n dataGen --> dataCaterer\n dataValid --> dataCaterer\n runConf --> dataCaterer\n direction TB\n dataCaterer -.-> metadataService\n dataCaterer -.-> metadataDataSource\n direction LR\n dataCaterer ---> outputDataSource\n dataCaterer ---> report
Foreign keys can be defined to represent the relationships between datasets where values are required to match for particular columns.
"},{"location":"setup/foreign-key/#single-column","title":"Single column","text":"Define a column in one data source to match against another column. Below example shows a postgres
data source with two tables, accounts
and transactions
that have a foreign key for account_id
.
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList.of(Map.entry(postgresTxn, \"account_id\"))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList(postgresTxn -> \"account_id\")\n)\n
---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"postgres.accounts.account_id\":\n- \"postgres.transactions.account_id\"\n
"},{"location":"setup/foreign-key/#multiple-columns","title":"Multiple columns","text":"You may have a scenario where multiple columns need to be aligned. From the same example, we want account_id
and name
from accounts
to match with account_id
and full_name
to match in transactions
respectively.
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(postgresTxn, List.of(\"account_id\", \"full_name\")))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(postgresTxn -> List(\"account_id\", \"full_name\"))\n)\n
---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_postgres.transactions.account_id,full_name\"\n
"},{"location":"setup/foreign-key/#nested-column","title":"Nested column","text":"Your schema structure can have nested fields which can also be referenced as foreign keys. But to do so, you need to create a proxy field that gets omitted from the final saved data.
In the example below, the nested customer_details.name
field inside the json
task needs to match with name
from postgres
. A new field in the json
called _txn_name
is used as a temporary column to facilitate the foreign key definition.
var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n),\nfield().name(\"_txn_name\").omit(true) #value will not be included in output\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(jsonTask, List.of(\"account_id\", \"_txn_name\")))\n);\n
val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n), field.name(\"_txn_name\").omit(true) #value will not be included in output\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(jsonTask -> List(\"account_id\", \"_txn_name\"))\n)\n
---\n#postgres task yaml\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n---\n#json task yaml\nname: \"json_data\"\nsteps:\n- name: \"transactions\"\ntype: \"json\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"_txn_name\"\ngenerator:\noptions:\nomit: true\n- name: \"cusotmer_details\"\nschema:\nfields:\nname: \"name\"\ngenerator:\ntype: \"sql\"\noptions:\nsql: \"_txn_name\"\n\n---\n#plan yaml\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n- name: \"json_data\"\ndataSourceName: \"my_json\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_json.transactions.account_id,_txn_name\"\n
"},{"location":"setup/validation/","title":"Validations","text":"Validations can be used to run data checks after you have run the data generator or even as a standalone task. A report summarising the success or failure of the validations is produced and can be examined for further investigation.
Full example validation can be found below. For more details, check out each of the subsections defined further below.
JavaScalaYAMLvar csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().col(\"amount\").lessThan(100),\nvalidation().col(\"year\").isEqual(2021).errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail\nvalidation().col(\"name\").matches(\"Peter .*\").errorThreshold(200) //equivalent to if number of errors is > 200, then fail\n)\n.validationWait(waitCondition().pause(1));\n\nvar conf = configuration().enableValidation(true);\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validations(\nvalidation.col(\"amount\").lessThan(100),\nvalidation.col(\"year\").isEqual(2021).errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail\nvalidation.col(\"name\").matches(\"Peter .*\").errorThreshold(200) //equivalent to if number of errors is > 200, then fail\n) .validationWait(waitCondition.pause(1))\n\nval conf = configuration.enableValidation(true)\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1 #equivalent to if error percentage is > 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200 #equivalent to if number of errors is > 200, then fail\ndescription: \"Should be lots of Peters\"\nwaitCondition:\npauseInSeconds: 1\n
"},{"location":"setup/validation/#wait-condition","title":"Wait Condition","text":"Once data has been generated, you may want to wait for a certain condition to be met before starting the data validations. This can be via:
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().pause(1));\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.pause(1))\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npauseInSeconds: 1\n
"},{"location":"setup/validation/#data-exists","title":"Data exists","text":"JavaScalaYAML var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWaitDataExists(\"updated_date > DATE('2023-01-01')\");\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWaitDataExists(\"updated_date > DATE('2023-01-01')\")\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"transactions\"\noptions:\npath: \"/tmp/csv\"\nexpr: \"updated_date > DATE('2023-01-01')\"\n
"},{"location":"setup/validation/#webhook","title":"Webhook","text":"JavaScalaYAML var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\")); //by default, GET request successful when 200 status code\n\n//or\n\nvar csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)); //successful if 200 or 202 status code\n\n//or\n\nvar csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"my_http\", \"http://localhost:8080/finished\")); //use connection configuration from existing 'my_http' connection definition\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\")) //by default, GET request successful when 200 status code\n\n//or\n\nval csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)) //successful if 200 or 202 status code\n\n//or\n\nval csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validationWait(waitCondition.webhook(\"my_http\", \"http://localhost:8080/finished\")) //use connection configuration from existing 'my_http' connection definition\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\" #by default, GET request successful when 200 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\"\nmethod: \"GET\"\nstatusCodes: [200, 202] #successful if 200 or 202 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"my_http\" #use connection configuration from existing 'my_http' connection definition\nurl: \"http://localhost:8080/finished\"\n
"},{"location":"setup/validation/#file-exists","title":"File exists","text":"JavaScalaYAML var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().file(\"/tmp/json\"));\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition.file(\"/tmp/json\"))\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npath: \"/tmp/json\"\n
"},{"location":"setup/validation/#report","title":"Report","text":"Once run, it will produce a report like this.
"},{"location":"setup/generator/count/","title":"Record Count","text":"There are options related to controlling the number of records generated that can help in generating the scenarios or data required.
"},{"location":"setup/generator/count/#record-count_1","title":"Record Count","text":"Record count is the simplest as you define the total number of records you require for that particular step. For example, in the below step, it will generate 1000 records for the CSV file
JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\n
"},{"location":"setup/generator/count/#generated-count","title":"Generated Count","text":"As like most things in Data Caterer, the count can be generated based on some metadata. For example, if I wanted to generate between 1000 and 2000 records, I could define that by the below configuration:
JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator().min(1000).max(2000));\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator.min(1000).max(2000))\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\ngenerator:\ntype: \"random\"\noptions:\nmin: 1000\nmax: 2000\n
"},{"location":"setup/generator/count/#per-column-count","title":"Per Column Count","text":"When defining a per column count, this allows you to generate records \"per set of columns\". This means that for a given set of columns, it will generate a particular amount of records per combination of values for those columns.
One example of this would be when generating transactions relating to a customer, a customer may be defined by columns account_id, name
. A number of transactions would be generated per account_id, name
.
You can also use a combination of the above two methods to generate the number of records per column.
"},{"location":"setup/generator/count/#records","title":"Records","text":"When defining a base number of records within the perColumn
configuration, it translates to creating (count.records * count.recordsPerColumn)
records. This is a fixed number of records that will be generated each time, with no variation between runs.
In the example below, we have count.records = 1000
and count.recordsPerColumn = 2
. Which means that 1000 * 2 = 2000
records will be generated in total.
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\nrecords: 2\ncolumnNames:\n- \"account_id\"\n- \"name\"\n
"},{"location":"setup/generator/count/#generated","title":"Generated","text":"You can also define a generator for the count per column. This can be used in scenarios where you want a variable number of records per set of columns.
In the example below, it will generate between (count.records * count.perColumnGenerator.generator.min) = (1000 * 1) = 1000
and (count.records * count.perColumnGenerator.generator.max) = (1000 * 2) = 2000
records.
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumnGenerator(generator().min(1).max(2), \"account_id\", \"name\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumnGenerator(generator.min(1).max(2), \"account_id\", \"name\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\ncolumnNames:\n- \"account_id\"\n- \"name\"\ngenerator:\ntype: \"random\"\noptions:\nmin: 1\nmax: 2\n
"},{"location":"setup/generator/data-generator/","title":"Data Generators","text":""},{"location":"setup/generator/data-generator/#data-types","title":"Data Types","text":"Below is a list of all supported data types for generating data:
Data Type Spark Data Type Options Description string StringTypeminLen, maxLen, expression, enableNull
integer IntegerType min, max, stddev, mean
long LongType min, max, stddev, mean
short ShortType min, max, stddev, mean
decimal(precision, scale) DecimalType(precision, scale) min, max, stddev, mean
double DoubleType min, max, stddev, mean
float FloatType min, max, stddev, mean
date DateType min, max, enableNull
timestamp TimestampType min, max, enableNull
boolean BooleanType binary BinaryType minLen, maxLen, enableNull
byte ByteType array ArrayType arrayMinLen, arrayMaxLen, arrayType
_ StructType Implicitly supported when a schema is defined for a field"},{"location":"setup/generator/data-generator/#options","title":"Options","text":""},{"location":"setup/generator/data-generator/#all-data-types","title":"All data types","text":"Some options are available to use for all types of data generators. Below is the list along with example and descriptions:
Option Default Example DescriptionenableEdgeCase
false enableEdgeCase: \"true\"
Enable/disable generated data to contain edge cases based on the data type. For example, integer data type has edge cases of (Int.MaxValue, Int.MinValue and 0) edgeCaseProbability
0.0 edgeCaseProb: \"0.1\"
Probability of generating a random edge case value if enableEdgeCase
is true isUnique
false isUnique: \"true\"
Enable/disable generated data to be unique for that column. Errors will be thrown when it is unable to generate unique data seed
seed: \"1\"
Defines the random seed for generating data for that particular column. It will override any seed defined at a global level sql
sql: \"CASE WHEN amount < 10 THEN true ELSE false END\"
Define any SQL statement for generating that columns value. Computation occurs after all non-SQL fields are generated. This means any columns used in the SQL cannot be based on other SQL generated columns. Data type of generated value from SQL needs to match data type defined for the field"},{"location":"setup/generator/data-generator/#string","title":"String","text":"Option Default Example Description minLen
1 minLen: \"2\"
Ensures that all generated strings have at least length minLen
maxLen
10 maxLen: \"15\"
Ensures that all generated strings have at most length maxLen
expression
expression: \"#{Name.name}\"
expression:\"#{Address.city}/#{Demographic.maritalStatus}\"
Will generate a string based on the faker expression provided. All possible faker expressions can be found here Expression has to be in format #{<faker expression name>}
enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", \"\u0130yi g\u00fcnler\", \"\u0421\u043f\u0430\u0441\u0438\u0431\u043e\", \"\u039a\u03b1\u03bb\u03b7\u03bc\u03ad\u03c1\u03b1\", \"\u0635\u0628\u0627\u062d \u0627\u0644\u062e\u064a\u0631\", \" F\u00f6rl\u00e5t\", \"\u4f60\u597d\u5417\", \"Nh\u00e0 v\u1ec7 sinh \u1edf \u0111\u00e2u\", \"\u3053\u3093\u306b\u3061\u306f\", \"\u0928\u092e\u0938\u094d\u0924\u0947\", \"\u0532\u0561\u0580\u0565\u0582\", \"\u0417\u0434\u0440\u0430\u0432\u0435\u0439\u0442\u0435\")
"},{"location":"setup/generator/data-generator/#sample","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield()\n.name(\"name\")\n.type(StringType.instance())\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield\n.name(\"name\")\n.`type`(StringType)\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\nschema:\nfields:\n- name: \"name\"\ntype: \"string\"\ngenerator:\noptions:\nexpression: \"#{Name.name}\"\nenableNull: true\nnullProb: 0.1\nminLength: 4\nmaxLength: 20\n
"},{"location":"setup/generator/data-generator/#numeric","title":"Numeric","text":"For all the numeric data types, there are 4 options to choose from: min, max and maxValue. Generally speaking, you only need to define one of min or minValue, similarly with max or maxValue. The reason why there are 2 options for each is because of when metadata is automatically gathered, we gather the statistics of the observed min and max values. Also, it will attempt to gather any restriction on the min or max value as defined by the data source (i.e. max value as per database type).
"},{"location":"setup/generator/data-generator/#integerlongshort","title":"Integer/Long/Short","text":"Option Default Example Descriptionmin
0 min: \"2\"
Ensures that all generated values are greater than or equal to min
max
1000 max: \"25\"
Ensures that all generated values are less than or equal to max
stddev
1.0 stddev: \"2.0\"
Standard deviation for normal distributed data mean
max - min
mean: \"5.0\"
Mean for normal distributed data Edge cases Integer: (2147483647, -2147483648, 0) Edge cases Long: (9223372036854775807, -9223372036854775808, 0) Edge cases Short: (32767, -32768, 0)
"},{"location":"setup/generator/data-generator/#sample_1","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"year\").type(IntegerType.instance()).min(2020).max(2023),\nfield().name(\"customer_id\").type(LongType.instance()),\nfield().name(\"customer_group\").type(ShortType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"year\").`type`(IntegerType).min(2020).max(2023),\nfield.name(\"customer_id\").`type`(LongType),\nfield.name(\"customer_group\").`type`(ShortType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"year\"\ntype: \"integer\"\ngenerator:\noptions:\nmin: 2020\nmax: 2023\n- name: \"customer_id\"\ntype: \"long\"\n- name: \"customer_group\"\ntype: \"short\"\n
"},{"location":"setup/generator/data-generator/#decimal","title":"Decimal","text":"Option Default Example Description min
0 min: \"2\"
Ensures that all generated values are greater than or equal to min
max
1000 max: \"25\"
Ensures that all generated values are less than or equal to max
stddev
1.0 stddev: \"2.0\"
Standard deviation for normal distributed data mean
max - min
mean: \"5.0\"
Mean for normal distributed data numericPrecision
10 precision: \"25\"
The maximum number of digits numericScale
0 scale: \"25\"
The number of digits on the right side of the decimal point (has to be less than or equal to precision) Edge cases Decimal: (9223372036854775807, -9223372036854775808, 0)
"},{"location":"setup/generator/data-generator/#sample_2","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"balance\").type(DecimalType.instance()).numericPrecision(10).numericScale(5)\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"balance\").`type`(DecimalType).numericPrecision(10).numericScale(5)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"balance\"\ntype: \"decimal\"\ngenerator:\noptions:\nprecision: 10\nscale: 5\n
"},{"location":"setup/generator/data-generator/#doublefloat","title":"Double/Float","text":"Option Default Example Description min
0.0 min: \"2.1\"
Ensures that all generated values are greater than or equal to min
max
1000.0 max: \"25.9\"
Ensures that all generated values are less than or equal to max
stddev
1.0 stddev: \"2.0\"
Standard deviation for normal distributed data mean
max - min
mean: \"5.0\"
Mean for normal distributed data Edge cases Double: (+infinity, 1.7976931348623157e+308, 4.9e-324, 0.0, -0.0, -1.7976931348623157e+308, -infinity, NaN) Edge cases Float: (+infinity, 3.4028235e+38, 1.4e-45, 0.0, -0.0, -3.4028235e+38, -infinity, NaN)
"},{"location":"setup/generator/data-generator/#sample_3","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"amount\").type(DoubleType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"amount\").`type`(DoubleType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"amount\"\ntype: \"double\"\n
"},{"location":"setup/generator/data-generator/#date","title":"Date","text":"Option Default Example Description min
now() - 365 days min: \"2023-01-31\"
Ensures that all generated values are greater than or equal to min
max
now() max: \"2023-12-31\"
Ensures that all generated values are less than or equal to max
enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true Edge cases: (0001-01-01, 1582-10-15, 1970-01-01, 9999-12-31) (reference)
"},{"location":"setup/generator/data-generator/#sample_4","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_date\").type(DateType.instance()).min(java.sql.Date.valueOf(\"2020-01-01\"))\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_date\").`type`(DateType).min(java.sql.Date.valueOf(\"2020-01-01\"))\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_date\"\ntype: \"date\"\ngenerator:\noptions:\nmin: \"2020-01-01\"\n
"},{"location":"setup/generator/data-generator/#timestamp","title":"Timestamp","text":"Option Default Example Description min
now() - 365 days min: \"2023-01-31 23:10:10\"
Ensures that all generated values are greater than or equal to min
max
now() max: \"2023-12-31 23:10:10\"
Ensures that all generated values are less than or equal to max
enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true Edge cases: (0001-01-01 00:00:00, 1582-10-15 23:59:59, 1970-01-01 00:00:00, 9999-12-31 23:59:59)
"},{"location":"setup/generator/data-generator/#sample_5","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_time\").type(TimestampType.instance()).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_time\").`type`(TimestampType).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_time\"\ntype: \"timestamp\"\ngenerator:\noptions:\nmin: \"2020-01-01 00:00:00\"\n
"},{"location":"setup/generator/data-generator/#binary","title":"Binary","text":"Option Default Example Description minLen
1 minLen: \"2\"
Ensures that all generated array of bytes have at least length minLen
maxLen
20 maxLen: \"15\"
Ensures that all generated array of bytes have at most length maxLen
enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", -128, 127)
"},{"location":"setup/generator/data-generator/#sample_6","title":"Sample","text":"JavaScalaYAMLcsv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"payload\").type(BinaryType.instance())\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"payload\").`type`(BinaryType)\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"payload\"\ntype: \"binary\"\n
"},{"location":"setup/generator/data-generator/#array","title":"Array","text":"Option Default Example Description arrayMinLen
0 arrayMinLen: \"2\"
Ensures that all generated arrays have at least length arrayMinLen
arrayMaxLen
5 arrayMaxLen: \"15\"
Ensures that all generated arrays have at most length arrayMaxLen
arrayType
arrayType: \"double\"
Inner data type of the array. Optional when using Java/Scala API. Allows for nested data types to be defined like struct enableNull
false enableNull: \"true\"
Enable/disable null values being generated nullProbability
0.0 nullProb: \"0.1\"
Probability to generate null values if enableNull
is true"},{"location":"setup/generator/data-generator/#sample_7","title":"Sample","text":"JavaScalaYAML csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"last_5_amounts\").type(ArrayType.instance()).arrayType(\"double\")\n);\n
csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"last_5_amounts\").`type`(ArrayType).arrayType(\"double\")\n)\n
name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"last_5_amounts\"\ntype: \"array<double>\"\n
"},{"location":"setup/guide/","title":"Guides","text":"Below are a list of guides you can follow to create your data generation for your use case.
For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.
"},{"location":"setup/guide/#scenarios","title":"Scenarios","text":"The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources. A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.
"},{"location":"setup/guide/#plan","title":"Plan","text":""},{"location":"setup/guide/#foreign-keys","title":"Foreign Keys","text":"Define foreign keys across data sources in your plan to ensure generated data can match Link to associated task 1 Link to associated task 2
"},{"location":"setup/guide/#task","title":"Task","text":"Data Source Type Data Source Sample Task Notes Database Postgres Sample Database MySQL Sample Database Cassandra Sample File CSV Sample File JSON Sample Contains nested schemas and use of SQL for generated values File Parquet Sample Partition by year column Kafka Kafka Sample Specific base schema to be used, define headers, key, value, etc. JMS Solace Sample JSON formatted message HTTP PUT Sample JSON formatted PUT body"},{"location":"setup/guide/#configuration","title":"Configuration","text":"Basic configuration
"},{"location":"setup/guide/#docker-compose","title":"Docker-compose","text":"To see how it runs against different data sources, you can run using docker-compose
and set DATA_SOURCE
like below
./gradlew build\ncd docker\nDATA_SOURCE=postgres docker-compose up -d datacaterer\n
Can set it to one of the following:
Info
Writing data to Cassandra is a paid feature. Try the free trial here.
Creating a data generator for Cassandra. You will build a Docker image that will be able to populate data in Cassandra for the tables you configure.
"},{"location":"setup/guide/data-source/cassandra/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
If you already have a Cassandra instance running, you can skip to this step.
"},{"location":"setup/guide/data-source/cassandra/#cassandra-setup","title":"Cassandra Setup","text":"Next, let's make sure you have an instance of Cassandra up and running in your local environment. This will make it easy for us to iterate and check our changes.
cd docker\ndocker-compose up -d cassandra\n
"},{"location":"setup/guide/data-source/cassandra/#permissions","title":"Permissions","text":"Let's make a new user that has the required permissions needed to push data into the Cassandra tables we want.
CQL Permission StatementsGRANT INSERT ON <schema>.<table> TO data_caterer_user;\n
Following permissions are required when enabling configuration.enableGeneratePlanAndTasks(true)
as it will gather metadata information about tables and columns from the below tables.
GRANT SELECT ON system_schema.tables TO data_caterer_user;\nGRANT SELECT ON system_schema.columns TO data_caterer_user;\n
"},{"location":"setup/guide/data-source/cassandra/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedCassandraJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyAdvancedCassandraPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedCassandraJavaPlan extends PlanRun {\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n\nclass MyAdvancedCassandraPlan extends PlanRun {\n}\n
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
"},{"location":"setup/guide/data-source/cassandra/#connection-configuration","title":"Connection Configuration","text":"Within our class, we can start by defining the connection properties to connect to Cassandra.
JavaScalavar accountTask = cassandra(\n\"customer_cassandra\", //name\n\"localhost:9042\", //url\n\"cassandra\", //username\n\"cassandra\", //password\nMap.of() //optional additional connection options\n)\n
Additional options such as SSL configuration, etc can be found here.
val accountTask = cassandra(\n\"customer_cassandra\", //name\n\"localhost:9042\", //url\n\"cassandra\", //username\n\"cassandra\", //password\nMap() //optional additional connection options\n)\n
Additional options such as SSL configuration, etc can be found here.
"},{"location":"setup/guide/data-source/cassandra/#schema","title":"Schema","text":"Let's create a task for inserting data into the account.accounts
and account.account_status_history
tables as defined underdocker/data/cql/customer.cql
. This table should already be setup for you if you followed this step. We can check if the table is setup already via the following command:
docker exec docker-cassandraserver-1 cqlsh -e 'describe account.accounts; describe account.account_status_history;'\n
Here we should see some output that looks like the below. This tells us what schema we need to follow when generating data. We need to define that alongside any metadata that is useful to add constraints on what are possible values the generated data should contain.
CREATE TABLE account.accounts (\naccount_id text PRIMARY KEY,\n amount double,\n created_by text,\n name text,\n open_time timestamp,\n status text\n)...\n\nCREATE TABLE account.account_status_history (\naccount_id text,\n eod_date date,\n status text,\n updated_by text,\n updated_time timestamp,\n PRIMARY KEY (account_id, eod_date)\n)...\n
Trimming the connection details to work with the docker-compose Cassandra, we have a base Cassandra connection to define the table and schema required. Let's define each field along with their corresponding data type. You will notice that the text
fields do not have a data type defined. This is because the default data type is StringType
which corresponds to text
in Cassandra.
{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n}\n
val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n
"},{"location":"setup/guide/data-source/cassandra/#field-metadata","title":"Field Metadata","text":"We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata that add guidelines that the data generator will understand when generating data.
"},{"location":"setup/guide/data-source/cassandra/#account_id","title":"account_id","text":"account_id
follows a particular pattern that where it starts with ACC
and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that unique values are generated.
field().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n
field.name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n
"},{"location":"setup/guide/data-source/cassandra/#amount","title":"amount","text":"amount
the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between 1
and 1000
.
field().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\n
field.name(\"amount\").`type`(DoubleType).min(1).max(1000),\n
"},{"location":"setup/guide/data-source/cassandra/#name","title":"name","text":"name
is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an expression
to generate real looking name. All possible faker expressions can be found here
field().name(\"name\").expression(\"#{Name.name}\"),\n
field.name(\"name\").expression(\"#{Name.name}\"),\n
"},{"location":"setup/guide/data-source/cassandra/#open_time","title":"open_time","text":"open_time
is a timestamp that we want to have a value greater than a specific date. We can define a min date by using java.sql.Date
like below.
field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
"},{"location":"setup/guide/data-source/cassandra/#status","title":"status","text":"status
is a field that can only obtain one of four values, open, closed, suspended or pending
.
field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
"},{"location":"setup/guide/data-source/cassandra/#created_by","title":"created_by","text":"created_by
is a field that is based on the status
field where it follows the logic: if status is open or closed, then it is created_by eod else created_by event
. This can be achieved by defining a SQL expression like below.
field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
Putting it all the fields together, our class should now look like this.
JavaScalavar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n
val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n
"},{"location":"setup/guide/data-source/cassandra/#additional-configurations","title":"Additional Configurations","text":"At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.
JavaScalavar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n
"},{"location":"setup/guide/data-source/cassandra/#execute","title":"Execute","text":"To tell Data Caterer that we want to run with the configurations along with the accountTask
, we have to call execute
. So our full plan run will look like this.
public class MyAdvancedCassandraJavaPlan extends PlanRun {\n{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n
class MyAdvancedCassandraPlan extends PlanRun {\nval accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n
"},{"location":"setup/guide/data-source/cassandra/#run","title":"Run","text":"Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just created.
./run.sh\n#input class MyAdvancedCassandraJavaPlan or MyAdvancedCassandraPlan\n#after completing\ndocker exec docker-cassandraserver-1 cqlsh -e 'select count(1) from account.accounts;select * from account.accounts limit 10;'\n
Your output should look like this.
count\n-------\n 1000\n\n(1 rows)\n\nWarnings :\nAggregation query used without partition key\n\n\n account_id | amount | created_by | name | open_time | status\n-------------+-----------+--------------------+------------------------+---------------------------------+-----------\n ACC13554145 | 917.00418 | zb CVvbBTTzitjo5fK | Jan Sanford I | 2023-06-21 21:50:10.463000+0000 | suspended\n ACC19154140 | 46.99177 | VH88H9 | Clyde Bailey PhD | 2023-07-18 11:33:03.675000+0000 | open\n ACC50587836 | 774.9872 | GENANwPm t | Sang Monahan | 2023-03-21 00:16:53.308000+0000 | closed\n ACC67619387 | 452.86706 | 5msTpcBLStTH | Jewell Gerlach | 2022-10-18 19:13:07.606000+0000 | suspended\n ACC69889784 | 14.69298 | WDmOh7NT | Dale Schulist | 2022-10-25 12:10:52.239000+0000 | suspended\n ACC41977254 | 51.26492 | J8jAKzvj2 | Norma Nienow | 2023-08-19 18:54:39.195000+0000 | suspended\n ACC40932912 | 349.68067 | SLcJgKZdLp5ALMyg | Vincenzo Considine III | 2023-05-16 00:22:45.991000+0000 | closed\n ACC20642011 | 658.40713 | clyZRD4fI | Lannie McLaughlin DDS | 2023-05-11 23:14:30.249000+0000 | open\n ACC74962085 | 970.98218 | ZLETTSnj4NpD | Ima Jerde DVM | 2023-05-07 10:01:56.218000+0000 | pending\n ACC72848439 | 481.64267 | cc | Kyla Deckow DDS | 2023-08-16 13:28:23.362000+0000 | suspended\n\n(10 rows)\n
Also check the HTML report, found at docker/sample/report/index.html
, that gets generated to get an overview of what was executed.
Info
Generating data based on OpenAPI/Swagger document and pushing to HTTP endpoint is a paid feature. Try the free trial here.
Creating a data generator based on an OpenAPI/Swagger document.
"},{"location":"setup/guide/data-source/http/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/http/#http-setup","title":"HTTP Setup","text":"We will be using the http-bin docker image to help simulate a service with HTTP endpoints.
Start it via:
cd docker\ndocker-compose up -d http\ndocker ps\n
"},{"location":"setup/guide/data-source/http/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedHttpJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedHttpPlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedHttpJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedHttpPlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n
We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.
"},{"location":"setup/guide/data-source/http/#schema","title":"Schema","text":"We can point the schema of a data source to a OpenAPI/Swagger document or URL. For this example, we will use the OpenAPI document found under docker/mount/http/petstore.json
in the data-caterer-example repo. This is a simplified version of the original OpenAPI spec that can be found here.
We have kept the following endpoints to test out:
var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count().records(2));\n
val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count.records(2))\n
The above defines that the schema will come from an OpenAPI document found on the pathway defined. It will then generate 2 requests per request method and endpoint combination.
"},{"location":"setup/guide/data-source/http/#run","title":"Run","text":"Let's try run and see what happens.
cd ..\n./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\n#after completing\ndocker logs -f docker-http-1\n
It should look something like this.
172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DeXQxFUHVja+EYm%26limit%3D33895 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DSXaFvAqwYGF%26tags%3DjdNRFONA%26limit%3D40975 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/kbH8D7rDuq HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/REsa0tnu7dvekGDvxR HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/EqrOr1dHFfKUjWb HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/7WG7JHPaNxP HTTP/1.1 200 Host: host.docker.internal}\n
Looks like we have some data now. But we can do better and add some enhancements to it.
"},{"location":"setup/guide/data-source/http/#foreign-keys","title":"Foreign keys","text":"The four different requests that get sent could have the same id
passed across to each of them if we define a foreign key relationship. This will make it more realistic to a real life scenario as pets get created and queried by a particular id
value. We note that the id
value is first used when a pet is created in the body of the POST request. Then it gets used as a path parameter in the DELETE and GET requests.
To link them all together, we must follow a particular pattern when referring to request body, query parameter or path parameter columns.
HTTP Type Column Prefix Example Request BodybodyContent
bodyContent.id
Path Parameter pathParam
pathParamid
Query Parameter queryParam
queryParamid
Header header
headerContent_Type
Also note, that when creating a foreign field definition for a HTTP data source, to refer to a specific endpoint and method, we have to follow the pattern of {http method}{http path}
. For example, POST/pets
. Let's apply this knowledge to link all the id
values together.
var myPlan = plan().addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"), //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n);\n\nexecute(myPlan, conf, httpTask);\n
val myPlan = plan.addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"), //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n)\n\nexecute(myPlan, conf, httpTask)\n
Let's test it out by running it again
./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n
172.21.0.1 [06/Nov/2023:01:33:59 +0000] GET /anything/pets?limit%3D45971 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:00 +0000] GET /anything/pets?limit%3D62015 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:04 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:05 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n
Now we have the same id
values being produced across the POST, DELETE and GET requests! What if we knew that the id
values should follow a particular pattern?
So given that we have defined a foreign key where the root of the foreign key values is from the POST request, we can update the metadata of the id
column for the POST request and it will proliferate to the other endpoints as well. Given the id
column is a nested column as noted in the foreign key, we can alter its metadata via the following:
var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field().name(\"bodyContent\").schema(field().name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count().records(2));\n
val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field.name(\"bodyContent\").schema(field.name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count.records(2))\n
We first get the column bodyContent
, then get the nested schema and get the column id
and add metadata stating that id
should follow the patter ID[0-9]{8}
.
Let's try run again, and hopefully we should see some proper ID values.
./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n
172.21.0.1 [06/Nov/2023:01:45:45 +0000] GET /anything/pets?tags%3D10fWnNoDz%26limit%3D66804 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:46 +0000] GET /anything/pets?tags%3DhyO6mI8LZUUpS HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:50 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:51 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n
Great! Now we have replicated a production-like flow of HTTP requests.
"},{"location":"setup/guide/data-source/http/#ordering","title":"Ordering","text":"If you wanted to change the ordering of the requests, you can alter the order from within the OpenAPI/Swagger document. This is particularly useful when you want to simulate the same flow that users would take when utilising your application (i.e. create account, query account, update account).
"},{"location":"setup/guide/data-source/http/#rows-per-second","title":"Rows per second","text":"By default, Data Caterer will push requests per method and endpoint at a rate of around 5 requests per second. If you want to alter this value, you can do so via the below configuration. The lowest supported requests per second is 1.
JavaScalaimport io.github.datacatering.datacaterer.api.model.Constants;\n\n...\nvar httpTask = http(\"my_http\", Map.of(Constants.ROWS_PER_SECOND(), \"1\"))\n...\n
import io.github.datacatering.datacaterer.api.model.Constants.ROWS_PER_SECOND\n\n...\nval httpTask = http(\"my_http\", options = Map(ROWS_PER_SECOND -> \"1\"))\n...\n
Check out the full example under AdvancedHttpPlanRun
in the example repo.
Info
Writing data to Kafka is a paid feature. Try the free trial here.
Creating a data generator for Kafka. You will build a Docker image that will be able to populate data in kafka for the topics you configure.
"},{"location":"setup/guide/data-source/kafka/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
If you already have a Kafka instance running, you can skip to this step.
"},{"location":"setup/guide/data-source/kafka/#kafka-setup","title":"Kafka Setup","text":"Next, let's make sure you have an instance of Kafka up and running in your local environment. This will make it easy for us to iterate and check our changes.
cd docker\ndocker-compose up -d kafka\n
"},{"location":"setup/guide/data-source/kafka/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedKafkaJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyAdvancedKafkaPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedKafkaJavaPlan extends PlanRun {\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n\nclass MyAdvancedKafkaPlan extends PlanRun {\n}\n
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
"},{"location":"setup/guide/data-source/kafka/#connection-configuration","title":"Connection Configuration","text":"Within our class, we can start by defining the connection properties to connect to Kafka.
JavaScalavar accountTask = kafka(\n\"my_kafka\", //name\n\"localhost:9092\", //url\nMap.of() //optional additional connection options\n);\n
Additional options can be found here.
val accountTask = kafka(\n\"my_kafka\", //name\n\"localhost:9092\", //url\nMap() //optional additional connection options\n)\n
Additional options can be found here.
"},{"location":"setup/guide/data-source/kafka/#schema","title":"Schema","text":"Let's create a task for inserting data into the account-topic
that is already defined underdocker/data/kafka/setup_kafka.sh
. This topic should already be setup for you if you followed this step. We can check if the topic is set up already via the following command:
docker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n
Trimming the connection details to work with the docker-compose Kafka, we have a base Kafka connection to define the topic we will publish to. Let's define each field along with their corresponding data type. You will notice that the text
fields do not have a data type defined. This is because the default data type is StringType
.
{\nvar kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield().name(\"key\").sql(\"content.account_id\"),\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()), can define partition here\nfield().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n),\nfield().name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield().name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n}\n
val kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield.name(\"key\").sql(\"content.account_id\"),\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").type(IntegerType), can define partition here\nfield.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n | NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n | NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\nfield.name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield.name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n
"},{"location":"setup/guide/data-source/kafka/#fields","title":"Fields","text":"The schema defined for Kafka has a format that needs to be followed as noted above. Specifically, the required fields are: - value
Whilst, the other fields are optional:
headers
follows a particular pattern that where it is of type array<struct<key: string,value: binary>>
. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the value
part, it refers to content.account_id
where content
is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.
field().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n
field.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n | NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n | NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n |)\"\"\".stripMargin\n)\n
"},{"location":"setup/guide/data-source/kafka/#transactions","title":"transactions","text":"transactions
is an array that contains an inner structure of txn_date
and amount
. The size of the array generated can be controlled via arrayMinLength
and arrayMaxLength
.
field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n
field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n
"},{"location":"setup/guide/data-source/kafka/#details","title":"details","text":"details
is another example of a nested schema structure where it also has a nested structure itself in updated_by
. One thing to note here is the first_txn_date
field has a reference to the content.transactions
array where it will sort the array by txn_date
and get the first element.
field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n
field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n
"},{"location":"setup/guide/data-source/kafka/#additional-configurations","title":"Additional Configurations","text":"At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.
JavaScalavar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n
"},{"location":"setup/guide/data-source/kafka/#execute","title":"Execute","text":"To tell Data Caterer that we want to run with the configurations along with the kafkaTask
, we have to call execute
.
Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just created.
./run.sh\n#input class AdvancedKafkaJavaPlanRun or AdvancedKafkaPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n
Your output should look like this.
{\"account_id\":\"ACC56292178\",\"year\":2022,\"amount\":18338.627721151555,\"details\":{\"name\":\"Isaias Reilly\",\"first_txn_date\":\"2021-01-22\",\"updated_by\":{\"user\":\"FgYXbKDWdhHVc3\",\"time\":\"2022-12-30T13:49:07.309Z\"}},\"transactions\":[{\"txn_date\":\"2021-01-22\",\"amount\":30556.52125487579},{\"txn_date\":\"2021-10-29\",\"amount\":39372.302259554635},{\"txn_date\":\"2021-10-29\",\"amount\":61887.31389495968}]}\n{\"account_id\":\"ACC37729457\",\"year\":2022,\"amount\":96885.31758764731,\"details\":{\"name\":\"Randell Witting\",\"first_txn_date\":\"2021-06-30\",\"updated_by\":{\"user\":\"HCKYEBHN8AJ3TB\",\"time\":\"2022-12-02T02:05:01.144Z\"}},\"transactions\":[{\"txn_date\":\"2021-06-30\",\"amount\":98042.09647765031},{\"txn_date\":\"2021-10-06\",\"amount\":41191.43564742036},{\"txn_date\":\"2021-11-16\",\"amount\":78852.08184809204},{\"txn_date\":\"2021-10-09\",\"amount\":13747.157653571106}]}\n{\"account_id\":\"ACC23127317\",\"year\":2023,\"amount\":81164.49304198896,\"details\":{\"name\":\"Jed Wisozk\",\"updated_by\":{\"user\":\"9MBFZZ\",\"time\":\"2023-07-12T05:56:52.397Z\"}},\"transactions\":[]}\n
Also check the HTML report, found at docker/sample/report/index.html
, that gets generated to get an overview of what was executed.
Info
Generating data based on an external metadata source is a paid feature. Try the free trial here.
Creating a data generator for Postgres tables and CSV file based on metadata stored in Marquez ( follows OpenLineage API).
"},{"location":"setup/guide/data-source/marquez-metadata-source/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/marquez-metadata-source/#marquez-setup","title":"Marquez Setup","text":"You can follow the README found here to help with setting up Marquez in your local environment. This comes with an instance of Postgres which we will also be using as a data store for generated data.
The command that was run for this example to help with setup of dummy data was ./docker/up.sh -a 5001 -m 5002 --seed
.
Check that the following url shows some data like below once you click on food_delivery
from the ns
drop down in the top right corner.
Since we will also be using the Marquez Postgres instance as a data source, we will set up a separate database to store the generated data in via:
docker exec marquez-db psql -Upostgres -c 'CREATE DATABASE food_delivery'\n
"},{"location":"setup/guide/data-source/marquez-metadata-source/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedMetadataSourceJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedMetadataSourcePlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n
We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.
"},{"location":"setup/guide/data-source/marquez-metadata-source/#schema","title":"Schema","text":"We can point the schema of a data source to our Marquez instance. For the Postgres data source, we will point to a namespace
, which in Marquez or OpenLineage, represents a set of datasets. For the CSV data source, we will point to a specific namespace
and dataset
.
var csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map.of(\"saveMode\", \"overwrite\", \"header\", \"true\"))\n.schema(metadataSource().marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count().records(10));\n
val csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map(\"saveMode\" -> \"overwrite\", \"header\" -> \"true\"))\n.schema(metadataSource.marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count.records(10))\n
The above defines that the schema will come from Marquez, which is a type of metadata source that contains information about schemas. Specifically, it points to the food_delivery
namespace and public.categories
dataset to retrieve the schema information from.
var postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\", Map.of())\n.schema(metadataSource().marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count().records(10));\n
val postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\")\n.schema(metadataSource.marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count.records(10))\n
We now have pointed this Postgres instance to produce multiple schemas that are defined under the food_delivery
namespace. Also note that we are using database food_delivery
in Postgres to push our generated data to, and we have set the number of records per sub data source (in this case, per table) to be 10.
Let's try run and see what happens.
cd ..\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\n#after completing\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n
It should look something like this.
order_id | order_placed_on | order_dispatched_on | order_delivered_on | customer_email | customer_address | menu_id | restaurant_id | restaurant_address\n | menu_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+--------------------------------+----------------------------------------------------------+---------+---------------+---------------------------------------------------------------\n---+--------------+-------------+-------------+---------+-----------\n 38736 | 2023-02-05 06:05:23.755 | 2023-09-08 04:29:10.878 | 2023-09-03 23:58:34.285 | april.skiles@hotmail.com | 5018 Lang Dam, Gaylordfurt, MO 35172 | 59841 | 30971 | Suite 439 51366 Bartoletti Plains, West Lashawndamouth, CA 242\n42 | 55697 | 36370 | 21574 | 88022 | 16569\n4376 | 2022-12-19 14:39:53.442 | 2023-08-30 07:40:06.948 | 2023-03-15 20:38:26.11 | adelina.balistreri@hotmail.com | Apt. 340 9146 Novella Motorway, East Troyhaven, UT 34773 | 66195 | 42765 | Suite 670 8956 Rob Fork, Rennershire, CA 04524\n| 26516 | 81335 | 87615 | 27433 | 45649\n11083 | 2022-10-30 12:46:38.692 | 2023-06-02 13:05:52.493 | 2022-11-27 18:38:07.873 | johnny.gleason@gmail.com | Apt. 385 99701 Lemke Place, New Irvin, RI 73305 | 66427 | 44438 | 1309 Danny Cape, Weimanntown, AL 15865\n| 41686 | 36508 | 34498 | 24191 | 92405\n58759 | 2023-07-26 14:32:30.883 | 2022-12-25 11:04:08.561 | 2023-04-21 17:43:05.86 | isabelle.ohara@hotmail.com | 2225 Evie Lane, South Ardella, SD 90805 | 27106 | 25287 | Suite 678 3731 Dovie Park, Port Luigi, ID 08250\n| 94205 | 66207 | 81051 | 52553 | 27483\n
You can also try query some other tables. Let's also check what is in the CSV file.
$ head docker/sample/csv/part-0000*\nmenu_item_id,category_id,discount_id,city_id,driver_id,order_id,order_placed_on,order_dispatched_on,order_delivered_on,customer_email,customer_address,menu_id,restaurant_id,restaurant_address\n72248,37098,80135,45888,5036,11090,2023-09-20T05:33:08.036+08:00,2023-05-16T23:10:57.119+08:00,2023-05-01T22:02:23.272+08:00,demetrice.rohan@hotmail.com,\"406 Harmony Rue, Wisozkburgh, MD 12282\",33762,9042,\"Apt. 751 0796 Ellan Flats, Lake Chetville, WI 81957\"\n41644,40029,48565,83373,89919,58359,2023-04-18T06:28:26.194+08:00,2022-10-15T18:17:48.998+08:00,2023-02-06T17:02:04.104+08:00,joannie.okuneva@yahoo.com,\"Suite 889 022 Susan Lane, Zemlakport, OR 56996\",27467,6216,\"Suite 016 286 Derick Grove, Dooleytown, NY 14664\"\n49299,53699,79675,40821,61764,72234,2023-07-16T21:33:48.739+08:00,2023-02-14T21:23:10.265+08:00,2023-09-18T02:08:51.433+08:00,ina.heller@yahoo.com,\"Suite 600 86844 Heller Island, New Celestinestad, DE 42622\",48002,12462,\"5418 Okuneva Mountain, East Blairchester, MN 04060\"\n83197,86141,11085,29944,81164,65382,2023-01-20T06:08:25.981+08:00,2023-01-11T13:24:32.968+08:00,2023-09-09T02:30:16.890+08:00,lakisha.bashirian@yahoo.com,\"Suite 938 534 Theodore Lock, Port Caitlynland, LA 67308\",69109,47727,\"4464 Stewart Tunnel, Marguritemouth, AR 56791\"\n
Looks like we have some data now. But we can do better and add some enhancements to it.
What if we wanted the same records in Postgres public.delivery_7_days
to also show up in the CSV file? That's where we can use a foreign key definition.
We can take a look at the report (under docker/sample/report/index.html
) to see what we need to do to create the foreign key. From the overview, you should see under Tasks
there is a my_postgres
task which has food_delivery_public.delivery_7_days
as a step. Click on the link for food_delivery_public.delivery_7_days
and it will take us to a page where we can find out about the columns used in this table. Click on the Fields
button on the far right to see.
We can copy all of a subset of fields that we want matched across the CSV file and Postgres. For this example, we will take all the fields.
JavaScalavar myPlan = plan().addForeignKeyRelationship(\npostgresTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask);\n
val foreignCols = List(\"order_id\", \"order_placed_on\", \"order_dispatched_on\", \"order_delivered_on\", \"customer_email\",\n\"customer_address\", \"menu_id\", \"restaurant_id\", \"restaurant_address\", \"menu_item_id\", \"category_id\", \"discount_id\",\n\"city_id\", \"driver_id\")\n\nval myPlan = plan.addForeignKeyRelationships(\ncsvTask, foreignCols,\nList(foreignField(postgresTask, \"food_delivery_public.delivery_7_days\", foreignCols))\n)\n\nval conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask)\n
Notice how we have defined the csvTask
and foreignCols
as the main foreign key but for postgresTask
, we had to define it as a foreignField
. This is because postgresTask
has multiple tables within it, and we only want to define our foreign key with respect to the public.delivery_7_days
table. We use the step name (can be seen from the report) to specify the table to target.
To test this out, we will truncate the public.delivery_7_days
table in Postgres first, and then try run again.
docker exec marquez-db psql -Upostgres -d food_delivery -c 'TRUNCATE public.delivery_7_days'\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n
order_id | order_placed_on | order_dispatched_on | order_delivered_on | customer_email |\ncustomer_address | menu_id | restaurant_id | restaurant_address | menu\n_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+------------------------------+-------------\n--------------------------------------------+---------+---------------+--------------------------------------------------------+-----\n---------+-------------+-------------+---------+-----------\n 53333 | 2022-10-15 08:40:23.394 | 2023-01-23 09:42:48.397 | 2023-08-12 08:50:52.397 | normand.aufderhar@gmail.com | Apt. 036 449\n27 Wilderman Forge, Marvinchester, CT 15952 | 40412 | 70130 | Suite 146 98176 Schaden Village, Grahammouth, SD 12354 |\n90141 | 44210 | 83966 | 78614 | 77449\n
Let's grab the first email from the Postgres table and check whether the same record exists in the CSV file.
$ cat docker/sample/csv/part-0000* | grep normand.aufderhar\n90141,44210,83966,78614,77449,53333,2022-10-15T08:40:23.394+08:00,2023-01-23T09:42:48.397+08:00,2023-08-12T08:50:52.397+08:00,normand.aufderhar@gmail.com,\"Apt. 036 44927 Wilderman Forge, Marvinchester, CT 15952\",40412,70130,\"Suite 146 98176 Schaden Village, Grahammouth, SD 12354\"\n
Great! Now we have the ability to get schema information from an external source, add our own foreign keys and generate data.
Check out the full example under AdvancedMetadataSourcePlanRun
in the example repo.
Info
Generating data based on an external metadata source is a paid feature. Try the free trial here.
Creating a data generator for a JSON file based on metadata stored in OpenMetadata.
"},{"location":"setup/guide/data-source/open-metadata-source/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/data-source/open-metadata-source/#openmetadata-setup","title":"OpenMetadata Setup","text":"You can follow the local docker setup found here to help with setting up OpenMetadata in your local environment.
If that page becomes outdated or the link doesn't work, below are the commands I used to run it:
mkdir openmetadata-docker && cd openmetadata-docker\ncurl -sL https://github.com/open-metadata/OpenMetadata/releases/download/1.2.0-release/docker-compose.yml > docker-compose.yml\ndocker compose -f docker-compose.yml up --detach\n
Check that the following url works and login with admin:admin
. Then you should see some data like below:
Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedOpenMetadataSourceJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedOpenMetadataSourcePlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedOpenMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedOpenMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n
We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.
"},{"location":"setup/guide/data-source/open-metadata-source/#schema","title":"Schema","text":"We can point the schema of a data source to our OpenMetadata instance. We will use a JSON data source so that we can show how nested data types are handled and how we could customise it.
"},{"location":"setup/guide/data-source/open-metadata-source/#single-schema","title":"Single Schema","text":"JavaScalaimport io.github.datacatering.datacaterer.api.model.Constants;\n...\n\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(metadataSource().openMetadataJava(\n\"http://localhost:8585/api\", //url\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(), //auth type\nMap.of( //additional options (including auth options)\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\", //get from settings/bots/ingestion-bot\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\" //table fully qualified name\n)\n))\n.count(count().records(10));\n
import io.github.datacatering.datacaterer.api.model.Constants.{OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, OPEN_METADATA_JWT_TOKEN, OPEN_METADATA_TABLE_FQN, SAVE_MODE}\n...\n\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -> \"overwrite\"))\n.schema(metadataSource.openMetadata(\n\"http://localhost:8585/api\", //url\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA, //auth type\nMap( //additional options (including auth options)\nOPEN_METADATA_JWT_TOKEN -> \"abc123\", //get from settings/bots/ingestion-bot\nOPEN_METADATA_TABLE_FQN -> \"sample_data.ecommerce_db.shopify.raw_customer\" //table fully qualified name\n)\n))\n.count(count.records(10))\n
The above defines that the schema will come from OpenMetadata, which is a type of metadata source that contains information about schemas. Specifically, it points to the sample_data.ecommerce_db.shopify.raw_customer
table. You can check out the schema here to see what it looks like.
Let's try run and see what happens.
cd ..\n./run.sh\n#input class MyAdvancedOpenMetadataSourceJavaPlanRun or MyAdvancedOpenMetadataSourcePlanRun\n#after completing\ncat docker/sample/json/part-00000-*\n
It should look something like this.
{\n\"comments\": \"Mh6jqpD5e4M\",\n\"creditcard\": \"6771839575926717\",\n\"membership\": \"Za3wCQUl9E EJj712\",\n\"orders\": [\n{\n\"product_id\": \"Aa6NG0hxfHVq\",\n\"price\": 16139,\n\"onsale\": false,\n\"tax\": 58134,\n\"weight\": 40734,\n\"others\": 45813,\n\"vendor\": \"Kh\"\n},\n{\n\"product_id\": \"zbHBY \",\n\"price\": 17903,\n\"onsale\": false,\n\"tax\": 39526,\n\"weight\": 9346,\n\"others\": 52035,\n\"vendor\": \"jbkbnXAa\"\n},\n{\n\"product_id\": \"5qs3gakppd7Nw5\",\n\"price\": 48731,\n\"onsale\": true,\n\"tax\": 81105,\n\"weight\": 2004,\n\"others\": 20465,\n\"vendor\": \"nozCDMSXRPH Ev\"\n},\n{\n\"product_id\": \"CA6h17ANRwvb\",\n\"price\": 62102,\n\"onsale\": true,\n\"tax\": 96601,\n\"weight\": 78849,\n\"others\": 79453,\n\"vendor\": \" ihVXEJz7E2EFS\"\n}\n],\n\"platform\": \"GLt9\",\n\"preference\": {\n\"key\": \"nmPmsPjg C\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Loren Bechtelar\",\n\"street_address\": \"Suite 526 293 Rohan Road, Wunschshire, NE 25532\",\n\"city\": \"South Norrisland\",\n\"postcode\": \"56863\"\n}\n],\n\"shipping_date\": \"2022-11-03\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"lance.murphy\",\n\"name\": \"Zane Brakus DVM\",\n\"sex\": \"7HcAaPiO\",\n\"address\": \"594 Loida Haven, Gilland, MA 26071\",\n\"mail\": \"Un3fhbvK2rEbenIYdnq\",\n\"birthdate\": \"2023-01-31\"\n}\n}\n
Looks like we have some data now. But we can do better and add some enhancements to it.
"},{"location":"setup/guide/data-source/open-metadata-source/#custom-metadata","title":"Custom metadata","text":"We can see from the data generated, that it isn't quite what we want. The metadata is not sufficient for us to produce production-like data yet. Let's try to add some enhancements to it.
Let's make the platform
field a choice field that can only be a set of certain values and the nested field customer.sex
is also from a predefined set of values.
var jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield().name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield().name(\"customer\").schema(field().name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count().records(10));\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -> \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield.name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield.name(\"customer\").schema(field.name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count.records(10))\n
Let's test it out by running it again
./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ncat docker/sample/json/part-00000-*\n
{\n\"comments\": \"vqbPUm\",\n\"creditcard\": \"6304867705548636\",\n\"membership\": \"GZ1xOnpZSUOKN\",\n\"orders\": [\n{\n\"product_id\": \"rgOokDAv\",\n\"price\": 77367,\n\"onsale\": false,\n\"tax\": 61742,\n\"weight\": 87855,\n\"others\": 26857,\n\"vendor\": \"04XHR64ImMr9T\"\n}\n],\n\"platform\": \"mobile\",\n\"preference\": {\n\"key\": \"IB5vNdWka\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Isiah Bins\",\n\"street_address\": \"36512 Ross Spurs, Hillhaven, IA 18760\",\n\"city\": \"Averymouth\",\n\"postcode\": \"75818\"\n},\n{\n\"name\": \"Scott Prohaska\",\n\"street_address\": \"26573 Haley Ports, Dariusland, MS 90642\",\n\"city\": \"Ashantimouth\",\n\"postcode\": \"31792\"\n},\n{\n\"name\": \"Rudolf Stamm\",\n\"street_address\": \"Suite 878 0516 Danica Path, New Christiaport, ID 10525\",\n\"city\": \"Doreathaport\",\n\"postcode\": \"62497\"\n}\n],\n\"shipping_date\": \"2023-08-24\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"jolie.cremin\",\n\"name\": \"Fay Klein\",\n\"sex\": \"O\",\n\"address\": \"Apt. 174 5084 Volkman Creek, Hillborough, PA 61959\",\n\"mail\": \"BiTmzb7\",\n\"birthdate\": \"2023-04-07\"\n}\n}\n
Great! Now we have the ability to get schema information from an external source, add our own metadata and generate data.
"},{"location":"setup/guide/data-source/open-metadata-source/#data-validation","title":"Data validation","text":"Another aspect of OpenMetadata that can be leveraged is the definition of data quality rules. These rules can be incorporated into your Data Caterer job as well by enabling data validations via enableGenerateValidations
in configuration
.
var conf = configuration().enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(conf, jsonTask);\n
val conf = configuration.enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(conf, jsonTask)\n
Check out the full example under AdvancedOpenMetadataSourcePlanRun
in the example repo.
Info
Writing data to Solace is a paid feature. Try the free trial here.
Creating a data generator for Solace. You will build a Docker image that will be able to populate data in Solace for the queues/topics you configure.
"},{"location":"setup/guide/data-source/solace/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
If you already have a Solace instance running, you can skip to this step.
"},{"location":"setup/guide/data-source/solace/#solace-setup","title":"Solace Setup","text":"Next, let's make sure you have an instance of Solace up and running in your local environment. This will make it easy for us to iterate and check our changes.
cd docker\ndocker-compose up -d solace\n
Open up localhost:8080 and login with admin:admin
and check there is the default
VPN like below. Notice there is 2 queues/topics created. If you do not see 2 created, try to run the script found under docker/data/solace/setup_solace.sh
and change the host
to localhost
.
Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedSolaceJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyAdvancedSolacePlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedSolaceJavaPlan extends PlanRun {\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n\nclass MyAdvancedSolacePlan extends PlanRun {\n}\n
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
"},{"location":"setup/guide/data-source/solace/#connection-configuration","title":"Connection Configuration","text":"Within our class, we can start by defining the connection properties to connect to Solace.
JavaScalavar accountTask = solace(\n\"my_solace\", //name\n\"smf://host.docker.internal:55554\", //url\nMap.of() //optional additional connection options\n);\n
Additional connection options can be found here.
val accountTask = solace(\n\"my_solace\", //name\n\"smf://host.docker.internal:55554\", //url\nMap() //optional additional connection options\n)\n
Additional connection options can be found here.
"},{"location":"setup/guide/data-source/solace/#schema","title":"Schema","text":"Let's create a task for inserting data into the rest_test_queue
or rest_test_topic
that is already created for us from this step.
Trimming the connection details to work with the docker-compose Solace, we have a base Solace connection to define the JNDI destination we will publish to. Let's define each field along with their corresponding data type. You will notice that the text
fields do not have a data type defined. This is because the default data type is StringType
.
{\nvar solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()), //can define message JMS priority here\nfield().name(\"headers\") //set message properties via headers field\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()).min(2021).max(2023),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n)\n)\n.count(count().records(10));\n}\n
val solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").`type`(IntegerType), //can define message JMS priority here\nfield.name(\"headers\") //set message properties via headers field\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n | NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n | NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\n).count(count.records(10))\n
"},{"location":"setup/guide/data-source/solace/#fields","title":"Fields","text":"The schema defined for Solace has a format that needs to be followed as noted above. Specifically, the required fields are:
Whilst, the other fields are optional:
headers
follows a particular pattern that where it is of type HeaderType.getType
which behind the scenes, translates toarray<struct<key: string,value: binary>>
. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in thevalue
part, it refers to content.account_id
where content
is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.
field().name(\"headers\")\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n
field.name(\"headers\")\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n | NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n | NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n |)\"\"\".stripMargin\n)\n
"},{"location":"setup/guide/data-source/solace/#transactions","title":"transactions","text":"transactions
is an array that contains an inner structure of txn_date
and amount
. The size of the array generated can be controlled via arrayMinLength
and arrayMaxLength
.
field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n
field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n
"},{"location":"setup/guide/data-source/solace/#details","title":"details","text":"details
is another example of a nested schema structure where it also has a nested structure itself in updated_by
. One thing to note here is the first_txn_date
field has a reference to the content.transactions
array where it will sort the array by txn_date
and get the first element.
field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n
field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n
"},{"location":"setup/guide/data-source/solace/#additional-configurations","title":"Additional Configurations","text":"At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.
JavaScalavar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n
"},{"location":"setup/guide/data-source/solace/#execute","title":"Execute","text":"To tell Data Caterer that we want to run with the configurations along with the kafkaTask
, we have to call execute
.
Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just created.
./run.sh\n#input class AdvancedSolaceJavaPlanRun or AdvancedSolacePlanRun\n#after completing, check http://localhost:8080 from browser\n
Your output should look like this.
Unfortunately, there is no easy way to see the message content. You can check the message content from your application or service that consumes these messages.
Also check the HTML report, found at docker/sample/report/index.html
, that gets generated to get an overview of what was executed. Or view the sample report found here.
Info
Auto data generation from data connection is a paid feature. Try the free trial here.
Creating a data generator based on only a data connection to Postgres.
"},{"location":"setup/guide/scenario/auto-generate-connection/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/auto-generate-connection/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedAutomatedJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedAutomatedPlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedAutomatedJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\") (1)\n.enableGeneratePlanAndTasks(true) (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\") (3)\n.enableUniqueCheck(true) (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedAutomatedPlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\") (1)\n.enableGeneratePlanAndTasks(true) (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\") (3)\n.enableUniqueCheck(true) (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n
In the above code, we note the following:
my_postgres
enableGeneratePlanAndTasks
which tells Data Caterer to go to my_postgres
and generate data for all the tables found under the database customer
(which is defined in the connection string).generatedPlanAndTaskFolderPath
defines where the metadata that is gathered from my_postgres
should be saved at so that we could re-use it later.enableUniqueCheck
is set to true to ensure that generated data is unique based on primary key or foreign key definitions.Note
Unique check will only ensure generated data is unique. Any existing data in your data source is not taken into account, so generated data may fail to insert depending on the data source restrictions
"},{"location":"setup/guide/scenario/auto-generate-connection/#postgres-setup","title":"Postgres Setup","text":"If you don't have your own Postgres up and running, you can set up and run an instance configured in the docker
folder via.
cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n
This will create the tables found under docker/data/sql/postgres/customer.sql
. You can change this file to contain your own tables. We can see there are 4 tables created for us, accounts, balances, transactions and mapping
.
Let's try run.
cd ..\n./run.sh\n#input class MyAdvancedAutomatedJavaPlanRun or MyAdvancedAutomatedPlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1;'\n
It should look something like this.
id | account_number | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint | customer_id_decimal | customer_id_real | customer_id_double | open_date | open_timestamp | last_opened_time | payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H | SfA0eZJcTm | CuRw | 13 | 42 | 6041 | 76987.745612542900000000 | 91866.78 | 66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736 | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n
The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.
Also check the HTML report that gets generated under docker/sample/report/index.html
. You can see a summary of what was generated along with other metadata.
You can now look to play around with other tables or data sources and auto generate for them.
"},{"location":"setup/guide/scenario/auto-generate-connection/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#learn-from-existing-data","title":"Learn From Existing Data","text":"If you have any existing data within your data source, Data Caterer will gather metadata about the existing data to help guide it when generating new data. There are configurations that can help tune the metadata analysis found here.
"},{"location":"setup/guide/scenario/auto-generate-connection/#filter-out-schematables","title":"Filter Out Schema/Tables","text":"As part of your connection definition, you can define any schemas and/or tables your don't want to generate data for. In the example below, it will not generate any data for any tables under the history
and audit
schemas. Also, any table with the name balances
or transactions
in any schema will also not have data generated.
var autoRun = configuration()\n.postgres(\n\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap.of(\n\"filterOutSchema\", \"history, audit\",\n\"filterOutTable\", \"balances, transactions\")\n)\n)\n
val autoRun = configuration\n.postgres(\n\"my_postgres\",\n\"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap(\n\"filterOutSchema\" -> \"history, audit\",\n\"filterOutTable\" -> \"balances, transactions\")\n)\n)\n
"},{"location":"setup/guide/scenario/auto-generate-connection/#define-record-count","title":"Define record count","text":"You can control the record count per sub data source via numRecordsPerStep
.
var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n
val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n
"},{"location":"setup/guide/scenario/batch-and-event/","title":"Generate Batch and Event Data","text":"Info
Generating event data is a paid feature. Try the free trial here.
Creating a data generator for Kafka topic with matching records in a CSV file.
"},{"location":"setup/guide/scenario/batch-and-event/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/batch-and-event/#kafka-setup","title":"Kafka Setup","text":"If you don't have your own Kafka up and running, you can set up and run an instance configured in the docker
folder via.
cd docker\ndocker-compose up -d kafka\ndocker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n
Let's create a task for inserting data into the account-topic
that is already defined underdocker/data/kafka/setup_kafka.sh
.
Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedBatchEventJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedBatchEventPlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedBatchEventJavaPlanRun extends PlanRun {\n{\nvar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedBatchEventPlanRun extends PlanRun {\nval kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n}\n
We will borrow the Kafka task that is already defined under the class AdvancedKafkaPlanRun
or AdvancedKafkaJavaPlanRun
. You can go through the Kafka guide here for more details.
Let us set up the corresponding schema for the CSV file where we want to match the values that are generated for the Kafka messages.
JavaScalavar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n\nvar csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield().name(\"account_number\"),\nfield().name(\"year\"),\nfield().name(\"name\"),\nfield().name(\"payload\")\n);\n
val kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n\nval csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield.name(\"account_number\"),\nfield.name(\"year\"),\nfield.name(\"name\"),\nfield.name(\"payload\")\n)\n
This is a simple schema where we want to use the values and metadata that is already defined in the kafkaTask
to determine what the data will look like for the CSV file. Even if we defined some metadata here, it would be overridden when we define our foreign key relationships.
From the above CSV schema, we see note the following against the Kafka schema:
account_number
in CSV needs to match with the account_id
in Kafkaaccount_id
is referred to in the key
column as field.name(\"key\").sql(\"content.account_id\")
year
needs to match with content.year
in Kafka, which is a nested fieldtmp_year
which will not appear in the final output for the Kafka messages but is used as an intermediate step field.name(\"tmp_year\").sql(\"content.year\").omit(true)
name
needs to match with content.details.name
in Kafka, also a nested fieldtmp_name
which will take the value of the nested field but will be omitted field.name(\"tmp_name\").sql(\"content.details.name\").omit(true)
payload
represents the whole JSON message sent to Kafka, which matches to value
columnOur foreign keys are therefore defined like below. Order is important when defining the list of columns. The index needs to match with the corresponding column in the other data source.
JavaScalavar myPlan = plan().addForeignKeyRelationship(\nkafkaTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(myPlan, conf, kafkaTask, csvTask);\n
val myPlan = plan.addForeignKeyRelationship(\nkafkaTask, List(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList(csvTask -> List(\"account_number\", \"year\", \"name\", \"payload\"))\n)\n\nval conf = configuration.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(myPlan, conf, kafkaTask, csvTask)\n
"},{"location":"setup/guide/scenario/batch-and-event/#run","title":"Run","text":"Let's try run.
cd ..\n./run.sh\n#input class MyAdvancedBatchEventJavaPlanRun or MyAdvancedBatchEventPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n
It should look something like this.
{\"account_id\":\"ACC03093143\",\"year\":2023,\"amount\":87990.37196728592,\"details\":{\"name\":\"Nadine Heidenreich Jr.\",\"first_txn_date\":\"2021-11-09\",\"updated_by\":{\"user\":\"YfEyJCe8ohrl0j IfyT\",\"time\":\"2022-09-26T20:47:53.404Z\"}},\"transactions\":[{\"txn_date\":\"2021-11-09\",\"amount\":97073.7914706189}]}\n{\"account_id\":\"ACC08764544\",\"year\":2021,\"amount\":28675.58758765888,\"details\":{\"name\":\"Delila Beer\",\"first_txn_date\":\"2021-05-19\",\"updated_by\":{\"user\":\"IzB5ksXu\",\"time\":\"2023-01-26T20:47:26.389Z\"}},\"transactions\":[{\"txn_date\":\"2021-10-01\",\"amount\":80995.23818711648},{\"txn_date\":\"2021-05-19\",\"amount\":92572.40049217848},{\"txn_date\":\"2021-12-11\",\"amount\":99398.79832225188}]}\n{\"account_id\":\"ACC62505420\",\"year\":2023,\"amount\":96125.3125884202,\"details\":{\"name\":\"Shawn Goodwin\",\"updated_by\":{\"user\":\"F3dqIvYp2pFtena4\",\"time\":\"2023-02-11T04:38:29.832Z\"}},\"transactions\":[]}\n
Let's also check if there is a corresponding record in the CSV file.
$ cat docker/sample/csv/account/part-0000* | grep ACC03093143\nACC03093143,2023,Nadine Heidenreich Jr.,\"{\\\"account_id\\\":\\\"ACC03093143\\\",\\\"year\\\":2023,\\\"amount\\\":87990.37196728592,\\\"details\\\":{\\\"name\\\":\\\"Nadine Heidenreich Jr.\\\",\\\"first_txn_date\\\":\\\"2021-11-09\\\",\\\"updated_by\\\":{\\\"user\\\":\\\"YfEyJCe8ohrl0j IfyT\\\",\\\"time\\\":\\\"2022-09-26T20:47:53.404Z\\\"}},\\\"transactions\\\":[{\\\"txn_date\\\":\\\"2021-11-09\\\",\\\"amount\\\":97073.7914706189}]}\"\n
Great! The account, year, name and payload look to all match up.
"},{"location":"setup/guide/scenario/batch-and-event/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/batch-and-event/#order-of-execution","title":"Order of execution","text":"You may notice that the events are generated first, then the CSV file. This is because as part of the execute
function, we passed in the kafkaTask
first, before the csvTask
. You can change the order of execution by passing in csvTask
before kafkaTask
into the execute
function.
Creating a data validator for a JSON file.
"},{"location":"setup/guide/scenario/data-validation/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/data-validation/#data-setup","title":"Data Setup","text":"To aid in showing the functionality of data validations, we will first generate some data that our validations will run against. Run the below command and it will generate JSON files under docker/sample/json
folder.
./run.sh JsonPlan\n
"},{"location":"setup/guide/scenario/data-validation/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyValidationJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyValidationPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyValidationJavaPlan extends PlanRun {\n{\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\");\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyValidationPlan extends PlanRun {\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n}\n
As noted above, we create a JSON task that points to where the JSON data has been created at folder /opt/app/data/json
. We also note that enableValidation
is set to true
and enableGenerateData
to false
to tell Data Catering, we only want to validate data.
For reference, the schema in which we will be validating against looks like the below.
.schema(\nfield.name(\"account_id\"),\n field.name(\"year\").`type`(IntegerType),\n field.name(\"balance\").`type`(DoubleType),\n field.name(\"date\").`type`(DateType),\n field.name(\"status\"),\n field.name(\"update_history\").`type`(ArrayType)\n.schema(\nfield.name(\"updated_time\").`type`(TimestampType),\n field.name(\"status\").oneOf(\"open\", \"closed\", \"pending\", \"suspended\"),\n ),\n field.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\n field.name(\"age\").`type`(IntegerType),\n field.name(\"city\").expression(\"#{Address.city}\")\n)\n)\n
"},{"location":"setup/guide/scenario/data-validation/#basic-validation","title":"Basic Validation","text":"Let's say our goal is to validate the customer_details.name
field to ensure it conforms to the regex pattern [A-Z][a-z]+ [A-Z][a-z]+
. Given the diversity in naming conventions across cultures and countries, variations such as middle names, suffixes, prefixes, or language-specific differences are tolerated to a certain extent. The validation considers an acceptable error threshold before marking it as failed.
customer_details.name
[A-Z][a-z]+ [A-Z][a-z]+
validation().col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1) //<=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"), //description to add context in report or other developers\n
validation.col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1) //<=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"), //description to add context in report or other developers\n
"},{"location":"setup/guide/scenario/data-validation/#custom-validation","title":"Custom Validation","text":"There will be situation where you have a complex data setup and require you own custom logic to use for data validation. You can achieve this via setting your own SQL expression that returns a boolean value. An example is seen below where we want to check the array update_history
, that each entry has updated_time
greater than a certain timestamp.
validation().expr(\"FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))\"),\n
validation.expr(\"FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))\"),\n
If you want to know what other SQL function are available for you to use, check this page.
"},{"location":"setup/guide/scenario/data-validation/#group-by-validation","title":"Group By Validation","text":"There are scenarios where you want to validate against grouped values or the whole dataset via aggregations. An example would be validating that each customer's transactions sum is greater than 0.
"},{"location":"setup/guide/scenario/data-validation/#validation-criteria_1","title":"Validation Criteria","text":"Line 1: validation.groupBy().count().isEqual(100)
groupBy()
: Group by whole dataset.count()
: Counts the number of dataset elements.isEqual(100)
: Checks if the count is equal to 100.Line 2: validation.groupBy(\"account_id\").max(\"balance\").lessThan(900)
groupBy(\"account_id\")
: Groups the data based on the account_id
field.max(\"balance\")
: Calculates the maximum value of the balance
field within each group.lessThan(900)
: Checks if the maximum balance in each group is less than 900.account_id
the maximum balance is less than 900.errorThreshold
or validation to your specification scenario. The full list of types of validations can be found here.validation().groupBy().count().isEqual(100),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n
validation.groupBy().count().isEqual(100),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n
"},{"location":"setup/guide/scenario/data-validation/#sample-validation","title":"Sample Validation","text":"To try cover the majority of validation cases, the below has been created.
JavaScalavar jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation().col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation().col(\"date\").isNotNull().errorThreshold(10),\nvalidation().col(\"balance\").greaterThan(500),\nvalidation().expr(\"YEAR(date) == year\"),\nvalidation().col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation().col(\"customer_details.age\").greaterThan(18),\nvalidation().expr(\"FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation().col(\"update_history\").greaterThanSize(2),\nvalidation().unique(\"account_id\"),\nvalidation().groupBy().count().isEqual(1000),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation.col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation.col(\"date\").isNotNull.errorThreshold(10),\nvalidation.col(\"balance\").greaterThan(500),\nvalidation.expr(\"YEAR(date) == year\"),\nvalidation.col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation.col(\"customer_details.age\").greaterThan(18),\nvalidation.expr(\"FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation.col(\"update_history\").greaterThanSize(2),\nvalidation.unique(\"account_id\"),\nvalidation.groupBy().count().isEqual(1000),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n
"},{"location":"setup/guide/scenario/data-validation/#run","title":"Run","text":"Let's try run.
./run.sh\n#input class MyValidationJavaPlan or MyValidationPlan\n#after completing, check report at docker/sample/report/index.html\n
It should look something like this.
Check the full example at ValidationPlanRun
inside the examples repo.
Info
Delete generated data is a paid feature. Try the free trial here.
Creating a data generator for Postgres and delete the generated data after using it.
"},{"location":"setup/guide/scenario/delete-generated-data/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/delete-generated-data/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyAdvancedDeleteJavaPlanRun.java
src/main/scala/io/github/datacatering/plan/MyAdvancedDeletePlanRun.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedDeleteJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\") (1)\n.enableGeneratePlanAndTasks(true) (2)\n.enableRecordTracking(true) (3)\n.enableDeleteGeneratedRecords(false) (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\") (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\") (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedDeletePlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\") (1)\n.enableGeneratePlanAndTasks(true) (2)\n.enableRecordTracking(true) (3)\n.enableDeleteGeneratedRecords(false) (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\") (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\") (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n
In the above code we note the following:
my_postgres
enableGeneratePlanAndTasks
is enabled to auto generate data for all tables under customer
databaseenableRecordTracking
is enabled to ensure that all generated records are tracked. This will get used when we want to delete data afterwardsenableDeleteGeneratedRecords
is disabled for now. We want to see the generated data first and delete sometime aftergeneratedPlanAndTaskFolderPath
is the folder path where we saved the metadata we have gathered from my_postgres
recordTrackingFolderPath
is the folder path where record tracking is maintained. We need to persist this data to ensure it is still available when we want to delete dataIf you don't have your own Postgres up and running, you can set up and run an instance configured in the docker
folder via.
cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n
This will create the tables found under docker/data/sql/postgres/customer.sql
. You can change this file to contain your own tables. We can see there are 4 tables created for us, accounts, balances, transactions and mapping
.
Let's try run.
cd ..\n./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\n
It should look something like this.
id | account_number | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint | customer_id_decimal | customer_id_real | customer_id_double | open_date | open_timestamp | last_opened_time | payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H | SfA0eZJcTm | CuRw | 13 | 42 | 6041 | 76987.745612542900000000 | 91866.78 | 66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736 | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n
The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.
Check the number of records via:
docker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n#open report under docker/sample/report/index.html\n
"},{"location":"setup/guide/scenario/delete-generated-data/#delete","title":"Delete","text":"We are now at a stage where we want to delete the data that was generated. All we need to do is flip two flags.
.enableDeleteGeneratedRecords(true)\n.enableGenerateData(false) //we need to explicitly disable generating data\n
Enable delete generated records and disable generating data.
Before we run again, let us insert a record manually to see if that data will survive after running the job to delete the generated data.
docker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"insert into account.accounts (account_number) values ('my_account_number')\"\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"select count(1) from account.accounts\"\n
We now should have 1001 records in our account.accounts
table. Let's delete the generated data now.
./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n
You should see that only 1 record is left, the one that we manually inserted. Great, now we can generate data reliably and also be able to clean it up.
"},{"location":"setup/guide/scenario/delete-generated-data/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/delete-generated-data/#one-class-for-generating-another-for-deleting","title":"One class for generating, another for deleting?","text":"Yes, this is possible. There are two requirements: - the connection names used need to be the same across both classes - recordTrackingFolderPath
needs to be set to the same value
You can control the record count per sub data source via numRecordsPerStep
.
var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n
val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n
"},{"location":"setup/guide/scenario/first-data-generation/","title":"First Data Generation","text":"Creating a data generator for a CSV file.
"},{"location":"setup/guide/scenario/first-data-generation/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/first-data-generation/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyCsvPlan.java
src/main/scala/io/github/datacatering/plan/MyCsvPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n\npublic class MyCsvJavaPlan extends PlanRun {\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n\nclass MyCsvPlan extends PlanRun {\n}\n
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
"},{"location":"setup/guide/scenario/first-data-generation/#connection-configuration","title":"Connection Configuration","text":"When dealing with CSV files, we need to define a path for our generated CSV files to be saved at, along with any other high level configurations.
JavaScalacsv(\n\"customer_accounts\", //name\n\"/opt/app/data/customer/account\", //path\nMap.of(\"header\", \"true\") //optional additional options\n)\n
Other additional options for CSV can be found here
csv(\n\"customer_accounts\", //name\n\"/opt/app/data/customer/account\", //path\nMap(\"header\" -> \"true\") //optional additional options\n)\n
Other additional options for CSV can be found here
"},{"location":"setup/guide/scenario/first-data-generation/#schema","title":"Schema","text":"Our CSV file that we generate should adhere to a defined schema where we can also define data types.
Let's define each field along with their corresponding data type. You will notice that the string
fields do not have a data type defined. This is because the default data type is StringType
.
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"balance\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"balance\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#field-metadata","title":"Field Metadata","text":"We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata attributes that add guidelines that the data generator will understand when generating data.
"},{"location":"setup/guide/scenario/first-data-generation/#account_id","title":"account_id","text":"account_id
follows a particular pattern that where it starts with ACC
and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that values are unique ensure that unique values are generated.field().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n
field.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n
"},{"location":"setup/guide/scenario/first-data-generation/#balance","title":"balance","text":"balance
let's make the numbers not too large, so we can define a min and max for the generated numbers to be between 1
and 1000
.field().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\n
field.name(\"balance\").`type`(DoubleType).min(1).max(1000),\n
"},{"location":"setup/guide/scenario/first-data-generation/#name","title":"name","text":"name
is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an expression
to generate real looking name. All possible faker expressions can be found herefield().name(\"name\").expression(\"#{Name.name}\"),\n
field.name(\"name\").expression(\"#{Name.name}\"),\n
"},{"location":"setup/guide/scenario/first-data-generation/#open_time","title":"open_time","text":"open_time
is a timestamp that we want to have a value greater than a specific date. We can define a min date by using java.sql.Date
like below.field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n
"},{"location":"setup/guide/scenario/first-data-generation/#status","title":"status","text":"status
is a field that can only obtain one of four values, open, closed, suspended or pending
.field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n
"},{"location":"setup/guide/scenario/first-data-generation/#created_by","title":"created_by","text":"created_by
is a field that is based on the status
field where it follows the logic: if status is open or closed, then it is created_by eod else created_by event
. This can be achieved by defining a SQL expression like below.field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n
Putting it all the fields together, our class should now look like this.
JavaScalavar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#record-count","title":"Record Count","text":"We only want to generate 100 records, so that we can see what the output looks like. This is controlled at the accountTask
level like below. If you want to generate more records, set it to the value you want.
var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().records(100));\n
val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.records(100))\n
"},{"location":"setup/guide/scenario/first-data-generation/#additional-configurations","title":"Additional Configurations","text":"At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.
JavaScalavar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n
val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n
"},{"location":"setup/guide/scenario/first-data-generation/#execute","title":"Execute","text":"To tell Data Caterer that we want to run with the configurations along with the accountTask
, we have to call execute
. So our full plan run will look like this.
public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n
class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n
"},{"location":"setup/guide/scenario/first-data-generation/#run","title":"Run","text":"Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just created.
./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing\nhead docker/sample/customer/account/part-00000*\n
Your output should look like this.
account_id,balance,created_by,name,open_time,status\nACC06192462,853.9843359645766,eod,Hoyt Kertzmann MD,2023-07-22T11:17:01.713Z,closed\nACC15350419,632.5969895326234,eod,Dr. Claude White,2022-12-13T21:57:56.840Z,open\nACC25134369,592.0958847218986,eod,Fabian Rolfson,2023-04-26T04:54:41.068Z,open\nACC48021786,656.6413439322964,eod,Dewayne Stroman,2023-05-17T06:31:27.603Z,open\nACC26705211,447.2850352884595,event,Garrett Funk,2023-07-14T03:50:22.746Z,pending\nACC03150585,750.4568929015996,event,Natisha Reichel,2023-04-11T11:13:10.080Z,suspended\nACC29834210,686.4257811608622,event,Gisele Ondricka,2022-11-15T22:09:41.172Z,suspended\nACC39373863,583.5110618128994,event,Thaddeus Ortiz,2022-09-30T06:33:57.193Z,suspended\nACC39405798,989.2623959059525,eod,Shelby Reinger,2022-10-23T17:29:17.564Z,open\n
Also check the HTML report, found at docker/sample/report/index.html
, that gets generated to get an overview of what was executed.
Now that we have generated some accounts, let's also try to generate a set of transactions for those accounts in CSV format as well. The transactions could be in any other format, but to keep this simple, we will continue using CSV.
We can define our schema the same way along with any additional metadata.
JavaScalavar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#records-per-column","title":"Records Per Column","text":"Usually, for a given account_id, full_name
, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the count
function.
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n
"},{"location":"setup/guide/scenario/first-data-generation/#random-records-per-column","title":"Random Records Per Column","text":"Above, you will notice that we are generating 5 records per account_id, full_name
. This is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n
Here we set the minimum number of records per column to be 0 and the maximum to 5.
"},{"location":"setup/guide/scenario/first-data-generation/#foreign-key","title":"Foreign Key","text":"In this scenario, we want to match the account_id
in account
to match the same column values in transaction
. We also want to match name
in account
to full_name
in transaction
. This can be done via plan configuration like below.
var myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"), //the task and columns we want linked\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\"))) //list of other tasks and their respective column names we want matched\n);\n
val myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"), //the task and columns we want linked\nList(transactionTask -> List(\"account_id\", \"full_name\")) //list of other tasks and their respective column names we want matched\n)\n
Now, stitching it all together for the execute
function, our final plan should look like this.
public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count().records(100));\n\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nvar myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\")))\n);\n\nexecute(myPlan, config, accountTask, transactionTask);\n}\n}\n
class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count.records(100))\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nval myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),\nList(transactionTask -> List(\"account_id\", \"full_name\"))\n)\n\nexecute(myPlan, config, accountTask, transactionTask)\n}\n
Let's try run again.
#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing, let's pick an account and check the transactions for that account\naccount=$(tail -1 docker/sample/customer/account/part-00000* | awk -F \",\" '{print $1 \",\" $4}')\necho $account\ncat docker/sample/customer/transaction/part-00000* | grep $account\n
It should look something like this.
ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n
Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the DocumentationJavaPlanRun.java
or DocumentationPlanRun.scala
files as well to check that your plan is the same.
We can now look to consume this CSV data from a job or service. Usually, once we have consumed the data, we would also want to check and validate that our consumer has correctly ingested the data.
"},{"location":"setup/guide/scenario/first-data-generation/#validate","title":"Validate","text":"In this scenario, our consumer will read in the CSV file, do some transformations, and then save the data to Postgres. Let's try to configure data validations for the data that gets pushed into Postgres.
"},{"location":"setup/guide/scenario/first-data-generation/#postgres-setup","title":"Postgres setup","text":"First, we define our connection properties for Postgres. You can check out the full options available here.
JavaScalavar postgresValidateTask = postgres(\n\"my_postgres\", //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\", //username\n\"password\" //password\n).table(\"account\", \"transactions\");\n
val postgresValidateTask = postgres(\n\"my_postgres\", //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\", //username\n\"password\" //password\n).table(\"account\", \"transactions\")\n
We can connect and access the data inside the table account.transactions
. Now to define our data validations.
For full information about validation options and configurations, check here. Below, we have an example that should give you a good understanding of what validations are possible.
JavaScalavar postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation().col(\"account_id\").isNotNull(),\nvalidation().col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation().col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation().expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation().unique(\"account_id\", \"name\"),\nvalidation().groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n);\n
val postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation.col(\"account_id\").isNotNull,\nvalidation.col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation.col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation.expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation.unique(\"account_id\", \"name\"),\nvalidation.groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n)\n
"},{"location":"setup/guide/scenario/first-data-generation/#name_1","title":"name","text":"For all values in the name
column, we check if they match the regex [A-Z][a-z]+ [A-Z][a-z]+
. As we know in the real world, names do not always follow the same pattern, so we allow for an errorThreshold
before marking the validation as failed. Here, we define the errorThreshold
to be 0.2
, which means, if the error percentage is greater than 20%, then fail the validation. We also append on a helpful description so other developers/users can understand the context of the validation.
We check that all balance
values are greater than or equal to 0. This time, we have a slightly different errorThreshold
as it is set to 10
, which means, if the number of errors is greater than 10, then fail the validation.
Sometimes, we may need to include the values of multiple columns to validate a certain condition. This is where we can use expr
to define a SQL expression that returns a boolean. In this scenario, we are checking if the status
column has value closed
, then the close_date
should be not null, otherwise, close_date
is null.
We check whether the combination of account_id
and name
are unique within the dataset. You can define one or more columns for unique
validations.
There may be some business rule that states the number of login_retry
should be less than 10 for each account. We can check this via a group by validation where we group by the account_id, name
, take the maximum value for login_retry
per account_id,name
combination, then check if it is less than 10.
You can now look to play around with other configurations or data sources to meet your needs. Also, make sure to explore the docs further as it can guide you on what can be configured.
"},{"location":"setup/guide/scenario/records-per-column/","title":"Multiple Records Per Column","text":"Creating a data generator for a CSV file where there are multiple records per column values.
"},{"location":"setup/guide/scenario/records-per-column/#requirements","title":"Requirements","text":"First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git\n
"},{"location":"setup/guide/scenario/records-per-column/#plan-setup","title":"Plan Setup","text":"Create a new Java or Scala class.
src/main/java/io/github/datacatering/plan/MyMultipleRecordsPerColJavaPlan.java
src/main/scala/io/github/datacatering/plan/MyMultipleRecordsPerColPlan.scala
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;\n...\n\npublic class MyMultipleRecordsPerColJavaPlan extends PlanRun {\n{\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, transactionTask);\n}\n}\n
import io.github.datacatering.datacaterer.api.PlanRun\n...\n\nclass MyMultipleRecordsPerColPlan extends PlanRun {\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"), field.name(\"full_name\").expression(\"#{Name.name}\"), field.name(\"amount\").`type`(DoubleType.instance).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType.instance).min(java.sql.Date.valueOf(\"2022-01-01\")), field.name(\"date\").`type`(DateType.instance).sql(\"DATE(time)\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(config, transactionTask)\n}\n
"},{"location":"setup/guide/scenario/records-per-column/#record-count","title":"Record Count","text":"By default, tasks will generate 1000 records. You can alter this value via the count
configuration which can be applied to individual tasks. For example, in Scala, csv(...).count(count.records(100))
to generate only 100 records.
In this scenario, for a given account_id, full_name
, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the count
function.
var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n
This will generate 1000 * 5 = 5000
records as the default number of records is set (1000) and per account_id, full_name
from the initial 1000 records, 5 records will be generated.
Generating 5 records per column is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.
JavaScalavar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n
val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -> \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n
Here we set the minimum number of records per column to be 0 and the maximum to 5. This will follow a uniform distribution so the average number of records per account is 2.5. We could also define other metadata, just like we did with fields, when defining the generator. For example, we could set standardDeviation
and mean
for the number of records generated per column to follow a normal distribution.
Let's try run.
#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyMultipleRecordsPerColJavaPlan or MyMultipleRecordsPerColPlan\n#after completing\nhead docker/sample/customer/transaction/part-00000*\n
It should look something like this.
ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n
You can now look to play around with other count configurations found here.
"},{"location":"setup/report/alert/","title":"Alert","text":"Alerts can be configured to help users receive feedback from their data testing results. Currently, Data Caterer supports Slack for alerts.
"},{"location":"setup/report/alert/#slack","title":"Slack","text":"Define a Slack token and one or more Slack channels that will receive an alert like the below.
JavaScalavar conf = configuration()\n.slackAlertToken(\"abc123\") //use appropriate Slack token (usually bot token)\n.slackAlertChannels(\"#test-alerts\", \"#pre-prod-testing\"); //define Slack channel(s) to receive alerts on\n\nexecute(conf, ...);\n
val conf = configuration\n.slackAlertToken(\"abc123\") //use appropriate Slack token (usually bot token)\n.slackAlertChannels(\"#test-alerts\", \"#pre-prod-testing\") //define Slack channel(s) to receive alerts on\n\nexecute(conf, ...)\n
"},{"location":"setup/report/html-report/","title":"Report","text":"Data Caterer can be configured to produce a report of the data generated to help users understand what was run, how much data was generated, where it was generated, validation results and any associated metadata.
"},{"location":"setup/report/html-report/#sample","title":"Sample","text":"Once run, it will produce a report like this.
"},{"location":"setup/validation/basic-validation/","title":"Basic Validations","text":"Run validations on a column to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).
"},{"location":"setup/validation/basic-validation/#equal","title":"Equal","text":"Ensure all data in column is equal to certain value. Value can be of any data type. Can use isEqualCol
to define SQL expression that can reference other columns.
validation().col(\"year\").isEqual(2021),\nvalidation().col(\"year\").isEqualCol(\"YEAR(date)\"),\n
validation.col(\"year\").isEqual(2021),\nvalidation.col(\"year\").isEqualCol(\"YEAR(date)\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year == 2021\"\n
"},{"location":"setup/validation/basic-validation/#not-equal","title":"Not Equal","text":"Ensure all data in column is not equal to certain value. Value can be of any data type. Can use isNotEqualCol
to define SQL expression that can reference other columns.
validation().col(\"year\").isNotEqual(2021),\nvalidation().col(\"year\").isNotEqualCol(\"YEAR(date)\"),\n
validation.col(\"year\").isNotEqual(2021)\nvalidation.col(\"year\").isEqualCol(\"YEAR(date)\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year != 2021\"\n
"},{"location":"setup/validation/basic-validation/#null","title":"Null","text":"Ensure all data in column is null.
JavaScalaYAMLvalidation().col(\"year\").isNull()\n
validation.col(\"year\").isNull\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNULL(year)\"\n
"},{"location":"setup/validation/basic-validation/#not-null","title":"Not Null","text":"Ensure all data in column is not null.
JavaScalaYAMLvalidation().col(\"year\").isNotNull()\n
validation.col(\"year\").isNotNull\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNOTNULL(year)\"\n
"},{"location":"setup/validation/basic-validation/#contains","title":"Contains","text":"Ensure all data in column is contains certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"name\").contains(\"peter\")\n
validation.col(\"name\").contains(\"peter\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"CONTAINS(name, 'peter')\"\n
"},{"location":"setup/validation/basic-validation/#not-contains","title":"Not Contains","text":"Ensure all data in column does not contain certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"name\").notContains(\"peter\")\n
validation.col(\"name\").notContains(\"peter\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!CONTAINS(name, 'peter')\"\n
"},{"location":"setup/validation/basic-validation/#unique","title":"Unique","text":"Ensure all data in column is unique.
JavaScalaYAMLvalidation().unique(\"account_id\", \"name\")\n
validation.unique(\"account_id\", \"name\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- unique: [\"account_id\", \"name\"]\n
"},{"location":"setup/validation/basic-validation/#less-than","title":"Less Than","text":"Ensure all data in column is less than certain value. Can use lessThanCol
to define SQL expression that can reference other columns.
validation().col(\"amount\").lessThan(100),\nvalidation().col(\"amount\").lessThanCol(\"balance + 1\"),\n
validation.col(\"amount\").lessThan(100),\nvalidation.col(\"amount\").lessThanCol(\"balance + 1\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"amount < balance + 1\"\n
"},{"location":"setup/validation/basic-validation/#less-than-or-equal","title":"Less Than Or Equal","text":"Ensure all data in column is less than or equal to certain value. Can use lessThanOrEqualCol
to define SQL expression that can reference other columns.
validation().col(\"amount\").lessThanOrEqual(100),\nvalidation().col(\"amount\").lessThanOrEqualCol(\"balance + 1\"),\n
validation.col(\"amount\").lessThanOrEqual(100),\nvalidation.col(\"amount\").lessThanCol(\"balance + 1\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount <= 100\"\n- expr: \"amount <= balance + 1\"\n
"},{"location":"setup/validation/basic-validation/#greater-than","title":"Greater Than","text":"Ensure all data in column is greater than certain value. Can use greaterThanCol
to define SQL expression that can reference other columns.
validation().col(\"amount\").greaterThan(100),\nvalidation().col(\"amount\").greaterThanCol(\"balance\"),\n
validation.col(\"amount\").greaterThan(100),\nvalidation.col(\"amount\").greaterThanCol(\"balance\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount > 100\"\n- expr: \"amount > balance\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-or-equal","title":"Greater Than Or Equal","text":"Ensure all data in column is greater than or equal to certain value. Can use greaterThanOrEqualCol
to define SQL expression that can reference other columns.
validation().col(\"amount\").greaterThanOrEqual(100),\nvalidation().col(\"amount\").greaterThanOrEqualCol(\"balance\"),\n
validation.col(\"amount\").greaterThanOrEqual(100),\nvalidation.col(\"amount\").greaterThanOrEqualCol(\"balance\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount >= 100\"\n- expr: \"amount >= balance\"\n
"},{"location":"setup/validation/basic-validation/#between","title":"Between","text":"Ensure all data in column is between two values. Can use betweenCol
to define SQL expression that references other columns.
validation().col(\"amount\").between(100, 200),\nvalidation().col(\"amount\").betweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n
validation.col(\"amount\").between(100, 200),\nvalidation.col(\"amount\").betweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount BETWEEN 100 AND 200\"\n- expr: \"amount BETWEEN balance * 0.9 AND balance * 1.1\"\n
"},{"location":"setup/validation/basic-validation/#not-between","title":"Not Between","text":"Ensure all data in column is not between two values. Can use notBetweenCol
to define SQL expression that references other columns.
validation().col(\"amount\").notBetween(100, 200),\nvalidation().col(\"amount\").notBetweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n
validation.col(\"amount\").notBetween(100, 200)\nvalidation.col(\"amount\").notBetweenCol(\"balance * 0.9\", \"balance * 1.1\"),\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount NOT BETWEEN 100 AND 200\"\n- expr: \"amount NOT BETWEEN balance * 0.9 AND balance * 1.1\"\n
"},{"location":"setup/validation/basic-validation/#in","title":"In","text":"Ensure all data in column is in set of defined values.
JavaScalaYAMLvalidation().col(\"status\").in(\"open\", \"closed\")\n
validation.col(\"status\").in(\"open\", \"closed\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"status IN ('open', 'closed')\"\n
"},{"location":"setup/validation/basic-validation/#matches","title":"Matches","text":"Ensure all data in column matches certain regex expression.
JavaScalaYAMLvalidation().col(\"account_id\").matches(\"ACC[0-9]{8}\")\n
validation.col(\"account_id\").matches(\"ACC[0-9]{8}\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"REGEXP(account_id, ACC[0-9]{8})\"\n
"},{"location":"setup/validation/basic-validation/#not-matches","title":"Not Matches","text":"Ensure all data in column does not match certain regex expression.
JavaScalaYAMLvalidation().col(\"account_id\").notMatches(\"^acc.*\")\n
validation.col(\"account_id\").notMatches(\"^acc.*\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!REGEXP(account_id, '^acc.*')\"\n
"},{"location":"setup/validation/basic-validation/#starts-with","title":"Starts With","text":"Ensure all data in column starts with certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"account_id\").startsWith(\"ACC\")\n
validation.col(\"account_id\").startsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"STARTSWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#not-starts-with","title":"Not Starts With","text":"Ensure all data in column does not start with certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"account_id\").notStartsWith(\"ACC\")\n
validation.col(\"account_id\").notStartsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!STARTSWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#ends-with","title":"Ends With","text":"Ensure all data in column ends with certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"account_id\").endsWith(\"ACC\")\n
validation.col(\"account_id\").endsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ENDWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#not-ends-with","title":"Not Ends With","text":"Ensure all data in column does not end with certain string. Column has to have type string.
JavaScalaYAMLvalidation().col(\"account_id\").notEndsWith(\"ACC\")\n
validation.col(\"account_id\").notEndsWith(\"ACC\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!ENDWITH(account_id, 'ACC')\"\n
"},{"location":"setup/validation/basic-validation/#size","title":"Size","text":"Ensure all data in column has certain size. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").size(5)\n
validation.col(\"transactions\").size(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions, 5)\"\n
"},{"location":"setup/validation/basic-validation/#not-size","title":"Not Size","text":"Ensure all data in column does not have certain size. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").notSize(5)\n
validation.col(\"transactions\").notSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) != 5\"\n
"},{"location":"setup/validation/basic-validation/#less-than-size","title":"Less Than Size","text":"Ensure all data in column has size less than certain value. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").lessThanSize(5)\n
validation.col(\"transactions\").lessThanSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) < 5\"\n
"},{"location":"setup/validation/basic-validation/#less-than-or-equal-size","title":"Less Than Or Equal Size","text":"Ensure all data in column has size less than or equal to certain value. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").lessThanOrEqualSize(5)\n
validation.col(\"transactions\").lessThanOrEqualSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) <= 5\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-size","title":"Greater Than Size","text":"Ensure all data in column has size greater than certain value. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").greaterThanSize(5)\n
validation.col(\"transactions\").greaterThanSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) > 5\"\n
"},{"location":"setup/validation/basic-validation/#greater-than-or-equal-size","title":"Greater Than Or Equal Size","text":"Ensure all data in column has size greater than or equal to certain value. Column has to have type array or map.
JavaScalaYAMLvalidation().col(\"transactions\").greaterThanOrEqualSize(5)\n
validation.col(\"transactions\").greaterThanOrEqualSize(5)\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) >= 5\"\n
"},{"location":"setup/validation/basic-validation/#luhn-check","title":"Luhn Check","text":"Ensure all data in column passes luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).
JavaScalaYAMLvalidation().col(\"credit_card\").luhnCheck()\n
validation.col(\"credit_card\").luhnCheck\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"LUHN_CHECK(credit_card)\"\n
"},{"location":"setup/validation/basic-validation/#has-type","title":"Has Type","text":"Ensure all data in column has certain data type.
JavaScalaYAMLvalidation().col(\"id\").hasType(\"string\")\n
validation.col(\"id\").hasType(\"string\")\n
---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"TYPEOF(id) == 'string'\"\n
"},{"location":"setup/validation/basic-validation/#expression","title":"Expression","text":"Ensure all data in column adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple columns.
For example, CASE WHEN status == 'open' THEN balance > 0 ELSE balance == 0 END
would check all rows with status
open to have balance
greater than 0, otherwise, check the balance
is 0.
var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().expr(\"amount < 100\"),\nvalidation().expr(\"year == 2021\").errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail\nvalidation().expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200) //equivalent to if number of errors is > 200, then fail\n);\n\nvar conf = configuration().enableValidation(true);\n
val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -> \"true\"))\n.validations(\nvalidation.expr(\"amount < 100\"),\nvalidation.expr(\"year == 2021\").errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail\nvalidation.expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200) //equivalent to if number of errors is > 200, then fail\n)\n\nval conf = configuration.enableValidation(true)\n
---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount < 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1 #equivalent to if error percentage is > 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200 #equivalent to if number of errors is > 200, then fail\ndescription: \"Should be lots of Peters\"\n\n#enableValidation inside application.conf\n
"},{"location":"setup/validation/column-name-validation/","title":"Column Name Validations","text":"Run validations on the column names to check for column name count of existence of column names.
"},{"location":"setup/validation/column-name-validation/#count-equal","title":"Count Equal","text":"Ensure column name count is equal to certain number.
JavaScalavalidation().columnNames().countEqual(3)\n
validation.columnNames.countEqual(3)\n
"},{"location":"setup/validation/column-name-validation/#not-equal","title":"Not Equal","text":"Ensure column name count is between two numbers.
JavaScalavalidation().columnNames().countBetween(10, 12)\n
validation.columnNames.countBetween(10, 12)\n
"},{"location":"setup/validation/column-name-validation/#match-order","title":"Match Order","text":"Ensure all column names match particular ordering and is complete.
JavaScalavalidation().columnNames().matchOrder(\"account_id\", \"amount\", \"name\")\n
validation.columnNames.matchOrder(\"account_id\", \"amount\", \"name\")\n
"},{"location":"setup/validation/column-name-validation/#match-set","title":"Match Set","text":"Ensure column names contains set of expected names. Order is not checked.
JavaScalavalidation().columnNames().matchSet(\"account_id\", \"first_name\")\n
validation.columnNames.matchSet(\"account_id\", \"first_name\")\n
"},{"location":"setup/validation/external-source-validation/","title":"External Source Validations","text":"Use validations that are defined in external sources such as Great Expectations or OpenMetadata. This allows you to generate data for your upstream data sources and validate your pipelines based on the same rules that would be applied in production.
Info
Retrieving data validations from an external source is a paid feature. Try the free trial here.
"},{"location":"setup/validation/external-source-validation/#supported-sources","title":"Supported Sources","text":"Source Support OpenMetadata Great Expectations DBT Constraints SodaCL MonteCarlo"},{"location":"setup/validation/external-source-validation/#openmetadata","title":"OpenMetadata","text":"Use data quality rules defined from OpenMetadata to execute over dataset.
JavaScalavar jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(metadataSource().openMetadata(\n\"http://host.docker.internal:8585/api\",\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(),\nMap.of(\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\",\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\"\n)\n));\n\nvar conf = configuration().enableGenerateValidations(true);\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(metadataSource.openMetadata(\n\"http://host.docker.internal:8585/api\",\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA,\nMap(\nOPEN_METADATA_JWT_TOKEN -> \"abc123\", //find under settings/bots/ingestion-bot/token\nOPEN_METADATA_TABLE_FQN -> \"sample_data.ecommerce_db.shopify.raw_customer\"\n)\n))\n\nval conf = configuration.enableGenerateValidations(true)\n
"},{"location":"setup/validation/external-source-validation/#great-expectations","title":"Great Expectations","text":"Use data quality rules defined from OpenMetadata to execute over dataset.
JavaScalavar jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(metadataSource().greatExpectations(\"great-expectations/taxi-expectations.json\");\n\nvar conf = configuration().enableGenerateValidations(true);\n
val jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(metadataSource.greatExpectations(\"great-expectations/taxi-expectations.json\")\n\nval conf = configuration.enableGenerateValidations(true)\n
"},{"location":"setup/validation/group-by-validation/","title":"Group By Validation","text":"If you want to run aggregations based on a particular set of columns or just the whole dataset, you can do so via group by validations. An example would be checking that the sum of amount
is less than 1000 per account_id, year
. The validations applied can be one of the validations from the basic validation set found here.
Check the number of records across the whole dataset.
JavaScalavalidation().groupBy().count().lessThan(1000)\n
validation.groupBy().count().lessThan(1000)\n
"},{"location":"setup/validation/group-by-validation/#record-count-per-group","title":"Record count per group","text":"Check the number of records for each group.
JavaScalavalidation().groupBy(\"account_id\", \"year\").count().lessThan(10)\n
validation.groupBy(\"account_id\", \"year\").count().lessThan(10)\n
"},{"location":"setup/validation/group-by-validation/#sum","title":"Sum","text":"Check the sum of a columns values for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n
validation.groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n
"},{"location":"setup/validation/group-by-validation/#count","title":"Count","text":"Check the count for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n
validation.groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n
"},{"location":"setup/validation/group-by-validation/#min","title":"Min","text":"Check the min for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n
validation.groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n
"},{"location":"setup/validation/group-by-validation/#max","title":"Max","text":"Check the max for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n
validation.groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n
"},{"location":"setup/validation/group-by-validation/#average","title":"Average","text":"Check the average for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n
validation.groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n
"},{"location":"setup/validation/group-by-validation/#standard-deviation","title":"Standard deviation","text":"Check the standard deviation for each group adheres to validation.
JavaScalavalidation().groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n
validation.groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n
"},{"location":"setup/validation/upstream-data-source-validation/","title":"Upstream Data Source Validation","text":"If you want to run data validations based on data generated or data from another data source, you can use the upstream data source validations. An example would be generating a Parquet file that gets ingested by a job and inserted into Postgres. The validations can then check for each account_id
generated in the Parquet, it exists in account_number
column in Postgres. The validations can be chained with basic and group by validations or even other upstream data sources, to cover any complex validations.
Join across datasets by particular columns. Then run validations on the joined dataset. You will notice that the data source name is appended onto the column names when joined (i.e. my_first_json_customer_details
), to ensure column names do not clash and make it obvious which columns are being validated.
In the below example, we check that the for the same account_id
, then customer_details.name
in the my_first_json
dataset should equal to the name
column in the my_second_json
.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask) //upstream data generation task is `firstJsonTask`\n.joinColumns(\"account_id\") //use `account_id` column in both datasets to join corresponding records (outer join by default)\n.withValidation(\nvalidation().col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\") //validate the name in `my_second_json` is equal to `customer_details.name` in `my_first_json` when the `account_id` matches\n)\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask) //upstream data generation task is `firstJsonTask`\n.joinColumns(\"account_id\") //use `account_id` column in both datasets to join corresponding records (outer join by default)\n.withValidation(\nvalidation.col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\") //validate the name in `my_second_json` is equal to `customer_details.name` in `my_first_json` when the `account_id` matches\n)\n)\n
"},{"location":"setup/validation/upstream-data-source-validation/#join-expression","title":"Join expression","text":"Define join expression to link two datasets together. This can be any SQL expression that returns a boolean value. Useful in situations where join is based on transformations or complex logic.
In the below example, we have to use CONCAT
SQL function to combine 'ACC'
and account_number
to join with account_id
column in my_first_json
dataset.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinExpr(\"my_first_json_account_id == CONCAT('ACC', account_number)\") //generic SQL expression that returns a boolean\n.withValidation(\nvalidation().col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinExpr(\"my_first_json_account_id == CONCAT('ACC', account_number)\") //generic SQL expression that returns a boolean\n.withValidation(\nvalidation.col(\"my_first_json_customer_details.name\")\n.isEqualCol(\"name\")\n)\n)\n
"},{"location":"setup/validation/upstream-data-source-validation/#different-join-type","title":"Different join type","text":"By default, an outer join is used to gather columns from both datasets together for validation. But there may be scenarios where you want to control the join type.
Possible join types include:
In the example below, we do an anti
join by column account_id
and check if there are no records. This essentially checks that all account_id
's from my_second_json
exist in my_first_json
. The second validation also does something similar but does an outer
join (by default) and checks that the joined dataset has 30 records.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.joinType(\"anti\")\n.withValidation(validation().count().isEqual(0)),\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation().count().isEqual(30))\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.joinType(\"anti\")\n.withValidation(validation.count().isEqual(0)),\nvalidation.upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation.count().isEqual(30))\n)\n
"},{"location":"setup/validation/upstream-data-source-validation/#join-then-group-by-validation","title":"Join then group by validation","text":"We can apply aggregate or group by validations to the resulting joined dataset as the withValidation
method accepts any type of validation.
Here we group by account_id, my_first_json_balance
to check that when the amount
field is summed up per group, it is between 0.8 and 1.2 times the balance.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"balance\").type(DoubleType.instance()).min(10).max(1000),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n);\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation().groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n)\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"balance\").`type`(DoubleType).min(10).max(1000),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation.groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n)\n)\n
"},{"location":"setup/validation/upstream-data-source-validation/#chained-validations","title":"Chained validations","text":"Given that the withValidation
method accepts any other type of validation, you can chain other upstream data sources with it. Here we will show a third upstream data source being checked to ensure 30 records exists after joining them together by account_id
.
var firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"balance\").type(DoubleType.instance()).min(10).max(1000),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n)\n.count(count().records(10));\n\nvar thirdJsonTask = json(\"my_third_json\", \"/tmp/data/third_json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(IntegerType.instance()).min(1).max(100),\nfield().name(\"name\").expression(\"#{Name.name}\")\n)\n.count(count().records(10));\n\nvar secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation().upstreamData(firstJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(\nvalidation().upstreamData(thirdJsonTask)\n.joinColumns(\"account_id\")\n.withValidation(validation().count().isEqual(30))\n)\n);\n
val firstJsonTask = json(\"my_first_json\", \"/tmp/data/first_json\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"balance\").`type`(DoubleType).min(10).max(1000),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\")\n)\n)\n.count(count.records(10))\n\nval thirdJsonTask = json(\"my_third_json\", \"/tmp/data/third_json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(IntegerType).min(1).max(100),\nfield.name(\"name\").expression(\"#{Name.name}\"),\n)\n.count(count.records(10))\n\nval secondJsonTask = json(\"my_second_json\", \"/tmp/data/second_json\")\n.validations(\nvalidation.upstreamData(firstJsonTask).joinColumns(\"account_id\")\n.withValidation(\nvalidation.groupBy(\"account_id\", \"my_first_json_balance\")\n.sum(\"amount\")\n.betweenCol(\"my_first_json_balance * 0.8\", \"my_first_json_balance * 1.2\")\n),\n)\n
Can check out a full example here for more details.
"},{"location":"use-case/business-value/","title":"Business Value","text":"Below is a list of the business related benefits from using Data Caterer which may be applicable for your use case.
Problem Data Caterer Solution Resources Effects Reliable test data creation - Profile existing data- Create scenarios- Generate data Software Engineers, QA, Testers Cost reduction in labor, more time spent on development, more bugs caught before production Faster development cycles - Generate data in local, test, UAT, pre-prod- Run different scenarios Software Engineers, QA, Testers More defects caught in lower environments, features pushed to production faster, common framework used across all environments Data compliance - Profiling existing data- Generate based on metadata- No complex masking- No production data used in lower environments Audit and compliance No chance for production data breaches Storage costs - Delete generated data- Test specific scenarios Infrastructure Lower data storage costs, less time spent on data management and clean up Schema evolution - Create metadata from data sources- Generate data based off fresh metadata Software Engineers, QA, Testers Less time spent altering tests due to schema changes, ease of use between environments and application versions"},{"location":"use-case/comparison/","title":"Comparison to similar tools","text":"I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.
The companies/products not shown below either have:
You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.
Pros
Cons
Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.
Feature Description Sub Tasks Data source support Batch or real time data sources that can be added to Data Caterer. Support data sources that users want - AWS, GCP and Azure related data services ( cloud storage)- Deltalake- RabbitMQ- ActiveMQ- MongoDB- Elasticsearch- Snowflake- Databricks- Pulsar Metadata discovery Allow for schema and data profiling from external metadata sources - HTTP (OpenAPI spec)- JMS- Read from samples- OpenLineage metadata (Marquez)- OpenMetadata- ODCS (Open Data Contract Standard)- Amundsen- Datahub- Solace Event Portal- Airflow- DBT Developer API Scala/Java interface for developers/testers to create data generation and validation tasks - Scala- Java Report generation Generate a report that summarises the data generation or validation results - Report for data generated and validation rules UI portal Allow users to access a UI to input data generation or validation tasks. Also be able to view report results - Metadata stored in database- Store data generation/validation run information in file/database Integration with data validation tools Derive data validation rules from existing data validation tools - Great Expectation- DBT constraints- SodaCL- MonteCarlo- OpenMetadata Data validation rule suggestions Based on metadata, generate data validation rules appropriate for the dataset - Suggest basic data validations (yet to document) Wait conditions before data validation Define certain conditions to be met before starting data validations - Webhook- File exists- Data exists via SQL expression- Pause Validation types Ability to define simple/complex data validations - Basic validations- Aggregates (sum of amount per account is > 500)- Ordering (transactions are ordered by date)- Relationship (at least one account entry in history table per account in accounts table)- Data profile (how close the generated data profile is compared to the expected data profile)- Column name (check column count, column names, ordering) Data generation record count Generate scenarios where there are one to many, many to many situations relating to record count. Also ability to cover all edge cases or scenarios - Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values etc.)- Ability to override edge cases Alerting When tasks have completed, ability to define alerts based on certain conditions - Slack- Email Metadata enhancements Based on data profiling or inference, can add to existing metadata - PII detection (can integrate with Presidio)- Relationship detection across data sources- SQL generation- Ordering information Data cleanup Ability to clean up generated data - Clean up generated data- Clean up data in consumer data sinks- Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS) Trial version Trial version of the full app for users to test out all the features - Trial app to try out all features Code generation Based on metadata or existing classes, code for data generation and validation could be generated - Code generation- Schema generation from Scala/Java class Real time response data validations Ability to define data validations based on the response from real time data sources (e.g. HTTP response) - HTTP response data validation"},{"location":"use-case/blog/shift-left-data-quality/","title":"Shifting Data Quality Left with Data Catering","text":""},{"location":"use-case/blog/shift-left-data-quality/#empowering-proactive-data-management","title":"Empowering Proactive Data Management","text":"In the ever-evolving landscape of data-driven decision-making, ensuring data quality is non-negotiable. Traditionally, data quality has been a concern addressed late in the development lifecycle, often leading to reactive measures and increased costs. However, a paradigm shift is underway with the adoption of a \"shift left\" approach, placing data quality at the forefront of the development process.
"},{"location":"use-case/blog/shift-left-data-quality/#today","title":"Today","text":"graph LR\n subgraph badQualityData[<b>Manually generated data, limited data scenarios, fragmented testing tools</b>]\n local[<b>Local</b>\\nManual test, unit test]\n dev[<b>Dev</b>\\nManual test, integration test]\n stg[<b>Staging</b>\\nSanity checks]\n end\n\n subgraph qualityData[<b>Reliable data, the true test</b>]\n prod[<b>Production</b>\\nData quality checks, monitoring, observaibility]\n end\n\n style badQualityData fill:#d9534f,fill-opacity:0.7\n style qualityData fill:#5cb85c,fill-opacity:0.7\n\n local --> dev\n dev --> stg\n stg --> prod
"},{"location":"use-case/blog/shift-left-data-quality/#with-data-caterer","title":"With Data Caterer","text":"graph LR\n subgraph qualityData[<b>Reliable data anywhere, common testing tool across all data sources</b>]\n direction LR\n local[<b>Local</b>\\nManual test, unit test]\n dev[<b>Dev</b>\\nManual test, integration test]\n stg[<b>Staging</b>\\nSanity checks]\n prod[<b>Production</b>\\nData quality checks, monitoring, observaibility]\n end\n\n style qualityData fill:#5cb85c,fill-opacity:0.7\n\n local --> dev\n dev --> stg\n stg --> prod
"},{"location":"use-case/blog/shift-left-data-quality/#understanding-the-shift-left-approach","title":"Understanding the Shift Left Approach","text":"\"Shift left\" is a philosophy that advocates for addressing tasks and concerns earlier in the development lifecycle. Applied to data quality, it means tackling data issues as early as possible, ideally during the development and testing phases. This approach aims to catch data anomalies, inaccuracies, or inconsistencies before they propagate through the system, reducing the likelihood of downstream errors.
"},{"location":"use-case/blog/shift-left-data-quality/#data-caterer-the-catalyst-for-shifting-left","title":"Data Caterer: The Catalyst for Shifting Left","text":"Enter Data Caterer, a metadata-driven data generation and validation tool designed to empower organizations in shifting data quality left. By incorporating Data Caterer into the early stages of development, teams can proactively test complex data flows, validate data sources, and ensure data quality before it reaches downstream processes.
"},{"location":"use-case/blog/shift-left-data-quality/#key-advantages-of-shifting-data-quality-left-with-data-caterer","title":"Key Advantages of Shifting Data Quality Left with Data Caterer","text":"As organizations strive for excellence in their data-driven endeavors, the shift left approach with Data Caterer becomes a strategic imperative. By instilling a proactive data quality culture, teams can minimize the risk of costly errors, enhance the reliability of their data, and streamline the entire development lifecycle.
In conclusion, the marriage of the shift left philosophy and Data Caterer brings forth a new era of data management, where data quality is not just a final checkpoint but an integral part of every development milestone. Embrace the shift left approach with Data Caterer and empower your teams to build robust, high-quality data solutions from the very beginning.
Shift Left, Validate Early, and Accelerate with Data Caterer.
"}]} \ No newline at end of file diff --git a/site/sitemap.xml.gz b/site/sitemap.xml.gz index 30df147a..b09541af 100644 Binary files a/site/sitemap.xml.gz and b/site/sitemap.xml.gz differ