diff --git a/docs/sample/report/report_screenshot.png b/docs/sample/report/report_screenshot.png index 8b1517b9..51396e23 100644 Binary files a/docs/sample/report/report_screenshot.png and b/docs/sample/report/report_screenshot.png differ diff --git a/docs/setup/guide/data-source/file/iceberg.md b/docs/setup/guide/data-source/file/iceberg.md index 7a45e801..2d587266 100644 --- a/docs/setup/guide/data-source/file/iceberg.md +++ b/docs/setup/guide/data-source/file/iceberg.md @@ -47,6 +47,7 @@ Create a new Java or Scala class. - Java: `src/main/java/io/github/datacatering/plan/MyIcebergJavaPlan.java` - Scala: `src/main/scala/io/github/datacatering/plan/MyIcebergPlan.scala` +- YAML: `docker/data/customer/plan/my-iceberg.yaml` Make sure your class extends `PlanRun`. @@ -68,6 +69,22 @@ Make sure your class extends `PlanRun`. } ``` +=== "YAML" + + In `docker/data/custom/plan/my-iceberg.yaml`: + ```yaml + name: "my_iceberg_plan" + description: "Create account data in Iceberg table" + tasks: + - name: "iceberg_account_table" + dataSourceName: "customer_accounts" + enabled: true + ``` + +=== "UI" + + Check next section. + This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use. @@ -105,6 +122,30 @@ Within our class, we can start by defining the connection properties to read/wri Additional options can be found [**here**](https://iceberg.apache.org/docs/1.5.0/spark-configuration/#catalog-configuration). +=== "YAML" + + In `application.conf`: + ``` + iceberg { + customer_accounts { + path = "/opt/app/data/customer/iceberg" + path = ${?ICEBERG_WAREHOUSE_PATH} + catalogType = "hadoop" + catalogType = ${?ICEBERG_CATALOG_TYPE} + catalogUri = "" + catalogUri = ${?ICEBERG_CATALOG_URI} + } + } + ``` + +=== "UI" + + 1. Go to `Connection` tab in the top bar + 2. Select data source as `Iceberg` + 1. Enter in data source name `customer_accounts` + 2. Select catalog type `hadoop` + 3. Enter warehouse path as `/opt/app/data/customer/iceberg` + #### Schema Depending on how you want to define the schema, follow the below: @@ -139,15 +180,50 @@ have unique values generated. execute(myPlan, config, accountTask, transactionTask) ``` +=== "YAML" + + In `application.conf`: + ``` + flags { + enableUniqueCheck = true + } + folders { + generatedReportsFolderPath = "/opt/app/data/report" + } + ``` + +=== "UI" + + 1. Click on `Advanced Configuration` towards the bottom of the screen + 2. Click on `Flag` and click on `Unique Check` + 3. Click on `Folder` and enter `/tmp/data-caterer/report` for `Generated Reports Folder Path` + ### Run Now we can run via the script `./run.sh` that is in the top level directory of the `data-caterer-example` to run the class we just created. -```shell -./run.sh -#input class MyIcebergJavaPlan or MyIcebergPlan -``` +=== "Java" + + ```shell + ./run.sh MyIcebergJavaPlan + ``` + +=== "Scala" + + ```shell + ./run.sh MyIcebergPlan + ``` + +=== "YAML" + + ```shell + ./run.sh my-iceberg.yaml + ``` + +=== "UI" + + 1. Click on `Execute` at the top Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the `IcebergJavaPlan.java` or `IcebergPlan.scala` files as well to check that your plan is the same. diff --git a/docs/setup/guide/data-source/metadata/open-data-contract-standard.md b/docs/setup/guide/data-source/metadata/open-data-contract-standard.md index 4e8a5f0b..9f4c85b0 100644 --- a/docs/setup/guide/data-source/metadata/open-data-contract-standard.md +++ b/docs/setup/guide/data-source/metadata/open-data-contract-standard.md @@ -1,6 +1,6 @@ --- title: "Using Open Data Contract Standard (ODCS) for Test Data Management" -description: "Example of using Open Data Contract Standard (ODCS) for data generation and testing tool." +description: "Example of using Open Data Contract Standard (ODCS) for data generation and testing." image: "https://data.catering/diagrams/logo/data_catering_logo.svg" --- @@ -10,7 +10,7 @@ image: "https://data.catering/diagrams/logo/data_catering_logo.svg" Generating data based on an external metadata source is a paid feature. -Creating a data generator for a JSON file based on metadata stored +Creating a data generator for a CSV file based on metadata stored in [Open Data Contract Standard (ODCS)](https://github.com/bitol-io/open-data-contract-standard). ## Requirements @@ -52,13 +52,13 @@ You can follow the local docker setup found [**here**](https://docs.open-metadata.org/v1.2.x/quick-start/local-docker-deployment) to help with setting up Open Data Contract Standard (ODCS) in your local environment. - ### Plan Setup Create a new Java or Scala class. - Java: `src/main/java/io/github/datacatering/plan/MyAdvancedODCSJavaPlanRun.java` - Scala: `src/main/scala/io/github/datacatering/plan/MyAdvancedODCSPlanRun.scala` +- YAML: `docker/data/customer/plan/my-odcs.yaml` Make sure your class extends `PlanRun`. @@ -88,248 +88,248 @@ Make sure your class extends `PlanRun`. } ``` +=== "YAML" + + In `docker/data/custom/plan/my-odcs.yaml`: + ```yaml + name: "my_odcs_plan" + description: "Create account data in CSV via ODCS metadata" + tasks: + - name: "csv_account_file" + dataSourceName: "customer_accounts" + enabled: true + ``` + + In `application.conf`: + ``` + flags { + enableUniqueCheck = true + } + folders { + generatedReportsFolderPath = "/opt/app/data/report" + } + ``` + +=== "UI" + + 1. Click on `Advanced Configuration` towards the bottom of the screen + 2. Click on `Flag` and click on `Unique Check` + 3. Click on `Folder` and enter `/tmp/data-caterer/report` for `Generated Reports Folder Path` + We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access. #### Schema -We can point the schema of a data source to our Open Data Contract Standard (ODCS) instance. We will use a JSON data source so that we can -show how nested data types are handled and how we could customise it. - -##### Single Schema +We can point the schema of a data source to our Open Data Contract Standard (ODCS) file. === "Java" ```java - import io.github.datacatering.datacaterer.api.model.Constants; - ... - - var jsonTask = json("my_json", "/opt/app/data/json", Map.of("saveMode", "overwrite")) - .schema(metadataSource().openMetadataJava( - "http://localhost:8585/api", //url - Constants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(), //auth type - Map.of( //additional options (including auth options) - Constants.OPEN_METADATA_JWT_TOKEN(), "abc123", //get from settings/bots/ingestion-bot - Constants.OPEN_METADATA_TABLE_FQN(), "sample_data.ecommerce_db.shopify.raw_customer" //table fully qualified name - ) - )) - .count(count().records(10)); + var accountTask = csv("my_csv", "/opt/app/data/account-odcs", Map.of("header", "true")) + .schema(metadataSource().openDataContractStandard("/opt/app/mount/odcs/full-example.yaml")) + .count(count().records(100)); ``` === "Scala" ```scala - import io.github.datacatering.datacaterer.api.model.Constants.{OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, OPEN_METADATA_JWT_TOKEN, OPEN_METADATA_TABLE_FQN, SAVE_MODE} - ... - - val jsonTask = json("my_json", "/opt/app/data/json", Map("saveMode" -> "overwrite")) - .schema(metadataSource.openMetadata( - "http://localhost:8585/api", //url - OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, //auth type - Map( //additional options (including auth options) - OPEN_METADATA_JWT_TOKEN -> "abc123", //get from settings/bots/ingestion-bot - OPEN_METADATA_TABLE_FQN -> "sample_data.ecommerce_db.shopify.raw_customer" //table fully qualified name - ) - )) - .count(count.records(10)) + val accountTask = csv("customer_accounts", "/opt/app/data/customer/account-odcs", Map("header" -> "true")) + .schema(metadataSource.openDataContractStandard("/opt/app/mount/odcs/full-example.yaml")) + .count(count.records(100)) + ``` + +=== "YAML" + + In `docker/data/custom/task/file/csv/csv-odcs-account-task.yaml`: + ``` + name: "csv_account_file" + steps: + - name: "accounts" + type: "csv" + options: + path: "/opt/app/data/csv/account-odcs" + metadata_source_type: "open_data_contract_standard" + dataContractFile: "/opt/app/mount/odcs/full-example.yaml" + count: + records: 100 ``` -The above defines that the schema will come from Open Data Contract Standard (ODCS), which is a type of metadata source that contains -information about schemas. Specifically, it points to the `sample_data.ecommerce_db.shopify.raw_customer` table. You -can check out the schema [here](http://localhost:8585/table/sample_data.ecommerce_db.shopify.raw_customer/schema) to -see what it looks like. +=== "UI" + + 1. Click on `Connection` tab at the top + 1. Select `ODCS` as the data source and enter `example-odcs` + 1. Copy [this file](https://github.com/data-catering/data-caterer-example/blob/main/docker/mount/odcs/full-example.yaml) into `/tmp/odcs/full-example.yaml` + 1. Enter `/tmp/odcs/full-example.yaml` as the `Contract File` + +The above defines that the schema will come from Open Data Contract Standard (ODCS), which is a type of metadata source +that contains information about schemas. +[Specifically, it points to the schema provided here](https://github.com/data-catering/data-caterer-example/blob/main/docker/mount/odcs/full-example.yaml#L42) +in the `docker/mount/odcs` folder of data-caterer-example repo. ### Run Let's try run and see what happens. -```shell -cd .. -./run.sh -#input class MyAdvancedODCSJavaPlanRun or MyAdvancedODCSPlanRun -#after completing -cat docker/sample/json/part-00000-* -``` +=== "Java" + + ```shell + ./run.sh MyAdvancedODCSJavaPlanRun + head docker/sample/account-odcs/part-00000-* + ``` + +=== "Scala" + + ```shell + ./run.sh MyAdvancedODCSPlanRun + head docker/sample/account-odcs/part-00000-* + ``` + +=== "YAML" + + ```shell + ./run.sh my-odcs.yaml + head docker/sample/account-odcs/part-00000-* + ``` + +=== "UI" + + 1. Click on `Execute` at the top + ```shell + head /tmp/data-caterer/customer/account-odcs/part-00000* + ``` It should look something like this. -```json -{ - "comments": "Mh6jqpD5e4M", - "creditcard": "6771839575926717", - "membership": "Za3wCQUl9E EJj712", - "orders": [ - { - "product_id": "Aa6NG0hxfHVq", - "price": 16139, - "onsale": false, - "tax": 58134, - "weight": 40734, - "others": 45813, - "vendor": "Kh" - }, - { - "product_id": "zbHBY ", - "price": 17903, - "onsale": false, - "tax": 39526, - "weight": 9346, - "others": 52035, - "vendor": "jbkbnXAa" - }, - { - "product_id": "5qs3gakppd7Nw5", - "price": 48731, - "onsale": true, - "tax": 81105, - "weight": 2004, - "others": 20465, - "vendor": "nozCDMSXRPH Ev" - }, - { - "product_id": "CA6h17ANRwvb", - "price": 62102, - "onsale": true, - "tax": 96601, - "weight": 78849, - "others": 79453, - "vendor": " ihVXEJz7E2EFS" - } - ], - "platform": "GLt9", - "preference": { - "key": "nmPmsPjg C", - "value": true - }, - "shipping_address": [ - { - "name": "Loren Bechtelar", - "street_address": "Suite 526 293 Rohan Road, Wunschshire, NE 25532", - "city": "South Norrisland", - "postcode": "56863" - } - ], - "shipping_date": "2022-11-03", - "transaction_date": "2023-02-01", - "customer": { - "username": "lance.murphy", - "name": "Zane Brakus DVM", - "sex": "7HcAaPiO", - "address": "594 Loida Haven, Gilland, MA 26071", - "mail": "Un3fhbvK2rEbenIYdnq", - "birthdate": "2023-01-31" - } -} +``` +txn_ref_dt,rcvr_id,rcvr_cntry_code +2023-07-11,PB0Wo dMx,nWlbRGIinpJfP +2024-05-01,5GtkNkHfwuxLKdM,1a +2024-05-01,OxuATCLAUIhHzr,gSxn2ct +2024-05-22,P4qe,y9htWZhyjW ``` Looks like we have some data now. But we can do better and add some enhancements to it. ### Custom metadata -We can see from the data generated, that it isn't quite what we want. The metadata is not sufficient for us to produce -production-like data yet. Let's try to add some enhancements to it. +We can see from the data generated, that it isn't quite what we want. Sometimes, the metadata is not sufficient for us +to produce production-like data yet, and we want to manually edit it. Let's try to add some enhancements to it. -Let's make the `platform` field a choice field that can only be a set of certain values and the nested -field `customer.sex` is also from a predefined set of values. +Let's make the `rcvr_id` field follow the regex `RC[0-9]{8}` and the field `rcvr_cntry_code` should only be one of +either `AU, US or TW`. For the full guide on data generation options, +[check the following page](../../scenario/data-generation.md). === "Java" ```java - var jsonTask = json("my_json", "/opt/app/data/json", Map.of("saveMode", "overwrite")) - .schema( - metadata... - )) + var accountTask = csv("my_csv", "/opt/app/data/account-odcs", Map.of("header", "true")) + .schema(metadata...) .schema( - field().name("platform").oneOf("website", "mobile"), - field().name("customer").schema(field().name("sex").oneOf("M", "F", "O")) + field().name("rcvr_id").regex("RC[0-9]{8}"), + field().name("rcvr_cntry_code").oneOf("AU", "US", "TW") ) - .count(count().records(10)); + .count(count().records(100)); ``` === "Scala" ```scala - val jsonTask = json("my_json", "/opt/app/data/json", Map("saveMode" -> "overwrite")) + val accountTask = csv("customer_accounts", "/opt/app/data/customer/account-odcs", Map("header" -> "true")) + .schema(metadata...) .schema( - metadata... - )) - .schema( - field.name("platform").oneOf("website", "mobile"), - field.name("customer").schema(field.name("sex").oneOf("M", "F", "O")) + field.name("rcvr_id").regex("RC[0-9]{8}"), + field.name("rcvr_cntry_code").oneOf("AU", "US", "TW") ) - .count(count.records(10)) + .count(count.records(100)) ``` +=== "YAML" + + In `docker/data/custom/task/file/csv/csv-odcs-account-task.yaml`: + ``` + name: "csv_account_file" + steps: + - name: "accounts" + type: "csv" + options: + path: "/opt/app/data/csv/account-odcs" + count: + records: 100 + schema: + fields: + - name: "rcvr_id" + options: + regex: "RC[0-9]{8}" + - name: "rcvr_cntry_code" + options: + oneOf: + - "AU" + - "US" + - "TW" + ``` + +=== "UI" + + 1. Click on `Generation` and tick the `Manual` checkbox + 1. Click on `+ Field` + 1. Go to `rcvr_id` field + 1. Click on `+` dropdown next to `string` data type + 1. Click `Regex` and enter `RC[0-9]{8}` + 1. Click on `+ Field` + 1. Go to `rcvr_cntry_code` field + 1. Click on `+` dropdown next to `string` data type + 1. Click `One Of` and enter `AU,US,TW` + Let's test it out by running it again -```shell -./run.sh -#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun -cat docker/sample/json/part-00000-* -``` +=== "Java" + + ```shell + ./run.sh MyAdvancedODCSJavaPlanRun + head docker/sample/account-odcs/part-00000-* + ``` + +=== "Scala" + + ```shell + ./run.sh MyAdvancedODCSPlanRun + head docker/sample/account-odcs/part-00000-* + ``` + +=== "YAML" + + ```shell + ./run.sh my-odcs.yaml + head docker/sample/account-odcs/part-00000-* + ``` + +=== "UI" + + 1. Click on `Execute` at the top + ```shell + head /tmp/data-caterer/customer/account-odcs/part-00000* + ``` -```json -{ - "comments": "vqbPUm", - "creditcard": "6304867705548636", - "membership": "GZ1xOnpZSUOKN", - "orders": [ - { - "product_id": "rgOokDAv", - "price": 77367, - "onsale": false, - "tax": 61742, - "weight": 87855, - "others": 26857, - "vendor": "04XHR64ImMr9T" - } - ], - "platform": "mobile", - "preference": { - "key": "IB5vNdWka", - "value": true - }, - "shipping_address": [ - { - "name": "Isiah Bins", - "street_address": "36512 Ross Spurs, Hillhaven, IA 18760", - "city": "Averymouth", - "postcode": "75818" - }, - { - "name": "Scott Prohaska", - "street_address": "26573 Haley Ports, Dariusland, MS 90642", - "city": "Ashantimouth", - "postcode": "31792" - }, - { - "name": "Rudolf Stamm", - "street_address": "Suite 878 0516 Danica Path, New Christiaport, ID 10525", - "city": "Doreathaport", - "postcode": "62497" - } - ], - "shipping_date": "2023-08-24", - "transaction_date": "2023-02-01", - "customer": { - "username": "jolie.cremin", - "name": "Fay Klein", - "sex": "O", - "address": "Apt. 174 5084 Volkman Creek, Hillborough, PA 61959", - "mail": "BiTmzb7", - "birthdate": "2023-04-07" - } -} +``` +txn_ref_dt,rcvr_id,rcvr_cntry_code +2024-02-15,RC02579393,US +2023-08-18,RC14320425,AU +2023-07-07,RC17915355,TW +2024-06-07,RC47347046,TW ``` -Great! Now we have the ability to get schema information from an external source, add our own metadata and generate +Great! Now we have the ability to get schema information from an external source, add our own metadata and generate data. ### Data validation -Another aspect of Open Data Contract Standard (ODCS) that can be leveraged is the definition of data quality rules. These rules can be -incorporated into your Data Caterer job as well by enabling data validations via `enableGenerateValidations` in -`configuration`. +[To find out what data validation options are available, check this link.](../../../validation.md) + +Another aspect of Open Data Contract Standard (ODCS) that can be leveraged is the definition of data quality rules. +Once the latest version of ODCS is released (version 3.x), there should be a vendor neutral definition of data quality +rules that Data Caterer can use. Once available, it will be as easy as enabling data validations +via `enableGenerateValidations` in `configuration`. === "Java" @@ -338,7 +338,7 @@ incorporated into your Data Caterer job as well by enabling data validations via .enableGenerateValidations(true) .generatedReportsFolderPath("/opt/app/data/report"); - execute(conf, jsonTask); + execute(conf, accountTask); ``` === "Scala" @@ -348,7 +348,7 @@ incorporated into your Data Caterer job as well by enabling data validations via .enableGenerateValidations(true) .generatedReportsFolderPath("/opt/app/data/report") - execute(conf, jsonTask) + execute(conf, accountTask) ``` Check out the full example under `AdvancedODCSSourcePlanRun` in the example repo. diff --git a/docs/setup/guide/scenario/data-generation.md b/docs/setup/guide/scenario/data-generation.md index 6108b51e..e5ae6050 100644 --- a/docs/setup/guide/scenario/data-generation.md +++ b/docs/setup/guide/scenario/data-generation.md @@ -45,10 +45,11 @@ First, we will clone the data-caterer-example repo which will already have the b ### Plan Setup -Create a new Java or Scala class. +Create a new Java or Scala class or plan YAML. - Java: `src/main/java/io/github/datacatering/plan/MyCsvPlan.java` - Scala: `src/main/scala/io/github/datacatering/plan/MyCsvPlan.scala` +- YAML: `docker/data/customer/plan/my-csv.yaml` Make sure your class extends `PlanRun`. @@ -70,6 +71,22 @@ Make sure your class extends `PlanRun`. } ``` +=== "YAML" + + In `docker/data/custom/plan/my-csv.yaml`: + ```yaml + name: "my_csv_plan" + description: "Create account data in CSV file" + tasks: + - name: "csv_account_file" + dataSourceName: "customer_accounts" + enabled: true + ``` + +=== "UI" + + Go to next section. + This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use. @@ -102,6 +119,26 @@ high level configurations. [Other additional options for CSV can be found here](https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option) +=== "YAML" + + In `application.conf`: + ``` + csv { + customer_accounts { + path = "/opt/app/data/customer/account" + path = ${?CSV_PATH} + header = "true" + } + } + ``` + +=== "UI" + + 1. Go to `Connection` tab in the top bar + 2. Select data source as `CSV` + 1. Enter in data source name `customer_accounts` + 3. Enter path as `/tmp/data-caterer/customer/account` + #### Schema Our CSV file that we generate should adhere to a defined schema where we can also define data types. @@ -137,6 +174,43 @@ data type defined. This is because the default data type is `StringType`. ) ``` +=== "YAML" + + In `docker/data/custom/task/file/csv/csv-account-task.yaml`: + ```yaml + name: "csv_account_file" + steps: + - name: "accounts" + type: "csv" + options: + path: "/opt/app/custom/csv/transactions" + schema: + fields: + - name: "account_id" + - name: "balance" + type: "double" + - name: "created_by" + - name: "name" + - name: "open_time" + type: "timestamp" + - name: "status" + ``` + +=== "UI" + + 1. Go to `Home` tab in the top bar + 2. Enter `my-csv` as the `Plan name` + 3. Under `Tasks`, enter `csv-account-task` as `Task name` and select data source as `customer_accounts` + 4. Click on `Generation` and tick the `Manual` checkbox + 5. Click on `+ Field` + 1. Add field `account_id` with type `string` + 1. Add field `balance` with type `double` + 1. Add field `created_by` with type `string` + 1. Add field `name` with type `string` + 1. Add field `open_time` with type `timestamp` + 1. Add field `status` with type `string` + + #### Field Metadata We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data @@ -161,6 +235,23 @@ attributes that add guidelines that the data generator will understand when gene field.name("account_id").regex("ACC[0-9]{8}").unique(true), ``` +=== "YAML" + + ```yaml + fields: + - name: "account_id" + options: + regex: "ACC[0-9]{8}" + unique: true + ``` + +=== "UI" + + 1. Go to `account_id` field + 2. Click on `+` dropdown next to `string` data type + 3. Click `Regex` and enter `ACC[0-9]{8}` + 4. Click `Unique` and select `true` + ##### balance - `balance` let's make the numbers not too large, so we can define a min and max for the generated numbers to be between @@ -178,6 +269,24 @@ attributes that add guidelines that the data generator will understand when gene field.name("balance").`type`(DoubleType).min(1).max(1000), ``` +=== "YAML" + + ```yaml + fields: + - name: "balance" + type: "double" + options: + min: 1 + max: 1000 + ``` + +=== "UI" + + 1. Go to `balance` field + 2. Click on `+` dropdown next to `double` data type + 3. Click `Min` and enter `1` + 4. Click `Max` and enter `1000` + ##### name - `name` is a string that also follows a certain pattern, so we could also define a regex but here we will choose to @@ -197,6 +306,21 @@ attributes that add guidelines that the data generator will understand when gene field.name("name").expression("#{Name.name}"), ``` +=== "YAML" + + ```yaml + fields: + - name: "name" + options: + expression: "#{Name.name}" + ``` + +=== "UI" + + 1. Go to `name` field + 2. Click on `+` dropdown next to `string` data type + 3. Click `Faker Expression` and enter `#{Name.name}` + ##### open_time - `open_time` is a timestamp that we want to have a value greater than a specific date. We can define a min date by @@ -215,6 +339,22 @@ attributes that add guidelines that the data generator will understand when gene field.name("open_time").`type`(TimestampType).min(java.sql.Date.valueOf("2022-01-01")), ``` +=== "YAML" + + ```yaml + fields: + - name: "open_time" + type: "timestamp" + options: + min: "2022-01-01" + ``` + +=== "UI" + + 1. Go to `open_time` field + 2. Click on `+` dropdown next to `timestamp` data type + 3. Click `Min` and enter `2022-01-01` + ##### status - `status` is a field that can only obtain one of four values, `open, closed, suspended or pending`. @@ -231,6 +371,25 @@ attributes that add guidelines that the data generator will understand when gene field.name("status").oneOf("open", "closed", "suspended", "pending") ``` +=== "YAML" + + ```yaml + fields: + - name: "status" + options: + oneOf: + - "open" + - "closed" + - "suspended" + - "pending" + ``` + +=== "UI" + + 1. Go to `status` field + 2. Click on `+` dropdown next to `string` data type + 3. Click `One Of` and enter `open,closed,suspended,pending` + ##### created_by - `created_by` is a field that is based on the `status` field where it follows the @@ -249,7 +408,22 @@ attributes that add guidelines that the data generator will understand when gene field.name("created_by").sql("CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"), ``` -Putting it all the fields together, our class should now look like this. +=== "YAML" + + ```yaml + fields: + - name: "created_by" + options: + sql: "CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END" + ``` + +=== "UI" + + 1. Go to `created_by` field + 2. Click on `+` dropdown next to `string` data type + 3. Click `SQL` and enter `CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END` + +Putting it all the fields together, our structure should now look like this. === "Java" @@ -279,6 +453,54 @@ Putting it all the fields together, our class should now look like this. ) ``` +=== "YAML" + + In `docker/data/custom/task/file/csv/csv-account-task.yaml`: + ```yaml + name: "csv_account_file" + steps: + - name: "accounts" + type: "csv" + options: + path: "/opt/app/custom/csv/account" + count: + records: 100 + schema: + fields: + - name: "account_id" + generator: + type: "regex" + options: + regex: "ACC1[0-9]{9}" + unique: true + - name: "balance" + type: "double" + options: + min: 1 + max: 1000 + - name: "created_by" + options: + sql: "CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END" + - name: "name" + options: + expression: "#{Name.name}" + - name: "open_time" + type: "timestamp" + options: + min: "2022-01-01" + - name: "status" + options: + oneOf: + - "open" + - "closed" + - "suspended" + - "pending" + ``` + +=== "UI" + + Open `Task` and `Generation` to see all the fields. + #### Record Count We only want to generate 100 records, so that we can see what the output looks like. This is controlled at the @@ -304,6 +526,28 @@ We only want to generate 100 records, so that we can see what the output looks l .count(count.records(100)) ``` +=== "YAML" + + In `docker/data/custom/task/file/csv/csv-account-task.yaml`: + ```yaml + name: "csv_account_file" + steps: + - name: "accounts" + type: "csv" + options: + path: "/opt/app/custom/csv/transactions" + count: + records: 100 + schema: + fields: + ... + ``` + +=== "UI" + + 1. Under task `customer_accounts`, click on `Generation` + 2. Under title `Record Count`, set `Records` to `100` + #### Additional Configurations At the end of data generation, a report gets generated that summarises the actions it performed. We can control the @@ -326,6 +570,24 @@ have unique values generated. .enableUniqueCheck(true) ``` +=== "YAML" + + In `application.conf`: + ``` + flags { + enableUniqueCheck = true + } + folders { + generatedReportsFolderPath = "/opt/app/data/report" + } + ``` + +=== "UI" + + 1. Click on `Advanced Configuration` towards the bottom of the screen + 2. Click on `Flag` and click on `Unique Check` + 3. Click on `Folder` and enter `/tmp/data-caterer/report` for `Generated Reports Folder Path` + #### Execute To tell Data Caterer that we want to run with the configurations along with the `accountTask`, we have to call `execute` @@ -377,18 +639,46 @@ To tell Data Caterer that we want to run with the configurations along with the } ``` +=== "YAML" + + Plan and task file should be ready. + +=== "UI" + + 1. Click `Save` at the top + ### Run Now we can run via the script `./run.sh` that is in the top level directory of the `data-caterer-example` to run the -class we just -created. +class we just created. -```shell -./run.sh -#input class MyCsvJavaPlan or MyCsvPlan -#after completing -head docker/sample/customer/account/part-00000* -``` +=== "Java" + + ```shell + ./run.sh MyCsvJavaPlan + head docker/sample/customer/account/part-00000* + ``` + +=== "Scala" + + ```shell + ./run.sh MyCsvPlan + head docker/sample/customer/account/part-00000* + ``` + +=== "YAML" + + ```shell + ./run.sh my-csv.yaml + head docker/sample/customer/account/part-00000* + ``` + +=== "UI" + + 1. Click on `Execute` at the top + ```shell + head /tmp/data-caterer/customer/account/part-00000* + ``` Your output should look like this. @@ -443,6 +733,56 @@ We can define our schema the same way along with any additional metadata. ) ``` + +=== "YAML" + + In `docker/data/custom/task/file/csv/csv-account-task.yaml`: + ```yaml + name: "csv_account_file" + steps: + - name: "accounts" + type: "csv" + options: + path: "/opt/app/custom/csv/account" + ... + - name: "transactions" + type: "csv" + options: + path: "/opt/app/custom/csv/transactions" + schema: + fields: + - name: "account_id" + - name: "full_name" + - name: "amount" + type: "double" + options: + min: 1 + max: 100 + - name: "time" + type: "timestamp" + options: + min: "2022-01-01" + - name: "date" + type: "date" + options: + sql: "DATE(time)" + ``` + +=== "UI" + + 1. Go to `Connection` tab and add new `CSV` data source with path `/tmp/data-caterer/customer/transactions` + 2. Go to `Plan` tab and click on `Edit` for `my-csv` + 3. Click on `+ Task` towards the top + 4. Under the new task, enter `csv-transaction-task` as `Task name` and select data source as `customer_accounts` + 5. Click on `Generation` and tick the `Manual` checkbox + 6. Click on `+ Field` + 1. Add field `account_id` with type `string` + 1. Add field `balance` with type `double` + 1. Add field `created_by` with type `string` + 1. Add field `name` with type `string` + 1. Add field `open_time` with type `timestamp` + 1. Add field `status` with type `string` + #### Records Per Column Usually, for a given `account_id, full_name`, there should be multiple records for it as we want to simulate a customer @@ -453,9 +793,7 @@ function. ```java var transactionTask = csv("customer_transactions", "/opt/app/data/customer/transaction", Map.of("header", "true")) - .schema( - ... - ) + .schema(...) .count(count().recordsPerColumn(5, "account_id", "full_name")); ``` @@ -463,12 +801,38 @@ function. ```scala val transactionTask = csv("customer_transactions", "/opt/app/data/customer/transaction", Map("header" -> "true")) - .schema( - ... - ) + .schema(...) .count(count.recordsPerColumn(5, "account_id", "full_name")) ``` +=== "YAML" + + In `docker/data/custom/task/file/csv/csv-account-task.yaml`: + ``` + name: "csv_account_file" + steps: + - name: "accounts" + ... + - name: "transactions" + type: "csv" + options: + path: "/opt/app/custom/csv/transactions" + count: + records: 100 + perColumn: + columnNames: + - "account_id" + - "name" + count: 5 + ``` + +=== "UI" + + 1. Under title `Record count`, click on `Advanced` + 2. Enter `account_id,name` in `Column(s)` + 3. Click on `Per unique set of values` checkbox + 4. Set `Records` to `5` + ##### Random Records Per Column Above, you will notice that we are generating 5 records per `account_id, full_name`. This is okay but still not quite @@ -495,6 +859,38 @@ can accommodate for this via defining a random number of records per column. .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id", "full_name")) ``` +=== "YAML" + + In `docker/data/custom/task/file/csv/csv-account-task.yaml`: + ``` + name: "csv_account_file" + steps: + - name: "accounts" + ... + - name: "transactions" + type: "csv" + options: + path: "/opt/app/custom/csv/transactions" + count: + records: 100 + perColumn: + columnNames: + - "account_id" + - "name" + generator: + type: "random" + options: + min: 0 + max: 5 + ``` + +=== "UI" + + 1. Under title `Record count`, click on `Advanced` + 2. Enter `account_id,name` in `Column(s)` + 3. Click on `Per unique set of values between` checkbox + 4. Set `Min` to `0` and `Max to `5` + Here we set the minimum number of records per column to be 0 and the maximum to 5. #### Foreign Key @@ -521,6 +917,31 @@ below. ) ``` +=== "YAML" + + In `docker/data/custom/plan/my-csv.yaml`: + ```yaml + name: "my_csv_plan" + description: "Create account data in CSV file" + tasks: + - name: "csv_account_file" + dataSourceName: "customer_accounts" + enabled: true + + sinkOptions: + foreignKeys: + - - "customer_accounts.accounts.account_id,name" + - - "customer_accounts.transactions.account_id,full_name" + - [] + ``` + +=== "UI" + + 1. Click `Relationships` and then click `+ Relationship` + 2. Select `csv-account-task` and enter `account_id,name` in `Column(s)` + 3. Open `Generation` and click `+ Link` + 4. Select `csv-transaction-task` and enter `account_id,full_name` in `Column(s)` + Now, stitching it all together for the `execute` function, our final plan should look like this. === "Java" @@ -602,19 +1023,57 @@ Now, stitching it all together for the `execute` function, our final plan should } ``` -Let's try run again. +=== "YAML" + + Check content of `docker/data/custom/plan/my-csv.yaml` and `docker/data/custom/task/file/csv/csv-account-task.yaml`. + +=== "UI" + + Open UI dropdowns to see all details. + +Let's clean up the old data and try run again. ```shell #clean up old data rm -rf docker/sample/customer/account -./run.sh -#input class MyCsvJavaPlan or MyCsvPlan -#after completing, let's pick an account and check the transactions for that account -account=$(tail -1 docker/sample/customer/account/part-00000* | awk -F "," '{print $1 "," $4}') -echo $account -cat docker/sample/customer/transaction/part-00000* | grep $account ``` +=== "Java" + + ```shell + ./run.sh MyCsvJavaPlan + account=$(tail -1 docker/sample/customer/account/part-00000* | awk -F "," '{print $1 "," $4}') + echo $account + cat docker/sample/customer/transaction/part-00000* | grep $account + ``` + +=== "Scala" + + ```shell + ./run.sh MyCsvPlan + account=$(tail -1 docker/sample/customer/account/part-00000* | awk -F "," '{print $1 "," $4}') + echo $account + cat docker/sample/customer/transaction/part-00000* | grep $account + ``` + +=== "YAML" + + ```shell + ./run.sh my-csv.yaml + account=$(tail -1 docker/sample/customer/account/part-00000* | awk -F "," '{print $1 "," $4}') + echo $account + cat docker/sample/customer/transaction/part-00000* | grep $account + ``` + +=== "UI" + + 1. Click on `Execute` at the top + ```shell + account=$(tail -1 /tmp/data-caterer/customer/account/part-00000* | awk -F "," '{print $1 "," $4}') + echo $account + cat /tmp/data-caterer/customer/transaction/part-00000* | grep $account + ``` + It should look something like this. ```shell