Add in guide for postgres and mysql, move data generation guide to se…

…parate doc, add option of how to run for UI
data-catering · Jun 11, 2024 · 484362e · 484362e
1 parent a5e40f7
commit 484362e
Show file tree

Hide file tree

Showing 26 changed files with 2,110 additions and 1,909 deletions.
diff --git a/docs/setup/connection.md b/docs/setup/connection.md
@@ -22,12 +22,13 @@ These configurations can be done via API or from configuration. Examples of both
 | Database         | Postgres            | :white_check_mark:                        | :white_check_mark:                        |
 | Database         | Elasticsearch       | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark:                        |
 | Database         | MongoDB             | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark:                        |
+| Database         | Opensearch          | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark:                        |
 | File             | CSV                 | :white_check_mark:                        | :white_check_mark:                        |
+| File             | Delta Lake          | :white_check_mark:                        | :white_check_mark:                        |
 | File             | Iceberg             | :white_check_mark:                        | :white_check_mark:                        |
 | File             | JSON                | :white_check_mark:                        | :white_check_mark:                        |
 | File             | ORC                 | :white_check_mark:                        | :white_check_mark:                        |
 | File             | Parquet             | :white_check_mark:                        | :white_check_mark:                        |
-| File             | Delta Lake          | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark:                        |
 | File             | Hudi                | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark:                        |
 | HTTP             | REST API            | :white_check_mark:                        | :octicons-x-circle-fill-12:{ .red-cross } |
 | Messaging        | Kafka               | :white_check_mark:                        | :octicons-x-circle-fill-12:{ .red-cross } |
@@ -97,8 +98,9 @@ configurations can be found below.
     csv("customer_transactions", "/data/customer/transaction")
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     csv {
       customer_transactions {
@@ -124,8 +126,9 @@ configurations can be found below.
     json("customer_transactions", "/data/customer/transaction")
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     json {
       customer_transactions {
@@ -151,8 +154,9 @@ configurations can be found below.
     orc("customer_transactions", "/data/customer/transaction")
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     orc {
       customer_transactions {
@@ -178,8 +182,9 @@ configurations can be found below.
     parquet("customer_transactions", "/data/customer/transaction")
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     parquet {
       customer_transactions {
@@ -191,7 +196,7 @@ configurations can be found below.
 
 [Other available configuration for Parquet can be found here](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option)
 
-#### Delta (not supported yet)
+#### Delta
 
 === "Java"
 
@@ -205,8 +210,9 @@ configurations can be found below.
     delta("customer_transactions", "/data/customer/transaction")
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     delta {
       customer_transactions {
@@ -216,6 +222,50 @@ configurations can be found below.
     }
     ```
 
+#### Iceberg
+
+=== "Java"
+
+    ```java
+    iceberg(
+      "customer_accounts",              //name
+      "account.accounts",               //table name
+      "/opt/app/data/customer/iceberg", //warehouse path
+      "hadoop",                         //catalog type
+      "",                               //catalogUri
+      Map.of()                          //additional options
+    );
+    ```
+
+=== "Scala"
+
+    ```scala
+    iceberg(
+      "customer_accounts",              //name
+      "account.accounts",               //table name
+      "/opt/app/data/customer/iceberg", //warehouse path
+      "hadoop",                         //catalog type
+      "",                               //catalogUri
+      Map()                             //additional options
+    )
+    ```
+
+=== "YAML"
+
+    In `application.conf`:
+    ```
+    iceberg {
+      customer_transactions {
+        path = "/opt/app/data/customer/iceberg"
+        path = ${?ICEBERG_WAREHOUSE_PATH}
+        catalogType = "hadoop"
+        catalogType = ${?ICEBERG_CATALOG_TYPE}
+        catalogUri = ""
+        catalogUri = ${?ICEBERG_CATALOG_URI}
+      }
+    }
+    ```
+
 ### RMDBS
 
 Follows the same configuration used by Spark as
@@ -244,8 +294,9 @@ Sample can be found below
     )
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     jdbc {
         customer_postgres {
@@ -310,8 +361,9 @@ Following permissions are required when generating plan and tasks:
     )
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     jdbc {
         customer_mysql {
@@ -364,8 +416,9 @@ found [**here**](https://github.com/datastax/spark-cassandra-connector/blob/mast
     )
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     org.apache.spark.sql.cassandra {
         customer_cassandra {
@@ -426,8 +479,9 @@ found [**here**](https://spark.apache.org/docs/latest/structured-streaming-kafka
     )
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     kafka {
         customer_kafka {
@@ -476,8 +530,9 @@ via JNDI otherwise a connection cannot be created.
     )
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     jms {
         customer_solace {
@@ -520,8 +575,9 @@ The url is defined in the tasks to allow for generated data to be populated in t
     )
     ```
 
-=== "application.conf"
+=== "YAML"
 
+    In `application.conf`:
     ```
     http {
         customer_api {

diff --git a/docs/setup/guide/data-source/database/cassandra.md b/docs/setup/guide/data-source/database/cassandra.md
@@ -21,9 +21,27 @@ for the tables you configure.
 
 First, we will clone the data-caterer-example repo which will already have the base project setup required.
 
-```shell
-git clone [email protected]:data-catering/data-caterer-example.git
-```
+=== "Java"
+
+    ```shell
+    git clone [email protected]:data-catering/data-caterer-example.git
+    ```
+
+=== "Scala"
+
+    ```shell
+    git clone [email protected]:data-catering/data-caterer-example.git
+    ```
+
+=== "YAML"
+
+    ```shell
+    git clone [email protected]:data-catering/data-caterer-example.git
+    ```
+
+=== "UI"
+
+    [Run Data Caterer UI via the 'Quick Start' found here.](../../../../get-started/quick-start.md)
 
 If you already have a Cassandra instance running, you can skip to [this step](#plan-setup).
 
@@ -123,7 +141,7 @@ Within our class, we can start by defining the connection properties to connect
 
 Let's create a task for inserting data into the `account.accounts` and `account.account_status_history` tables as
 defined under`docker/data/cql/customer.cql`. This table should already be setup for you if you followed this
-[step](#cassandra-setup). We can check if the table is setup already via the following command:
+[step](#cassandra-setup). We can check if the table is set up already via the following command:
 
 ```shell
 docker exec docker-cassandraserver-1 cqlsh -e 'describe account.accounts; describe account.account_status_history;'
@@ -190,146 +208,11 @@ corresponds to `text` in Cassandra.
       )
     ```
 
-#### Field Metadata
-
-We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data
-that is closer to the structure of the data that would come in production? We can do this by defining various metadata
-that add guidelines that the data generator will understand when generating data.
-
-##### account_id
-
-`account_id` follows a particular pattern that where it starts with `ACC` and has 8 digits after it.
-This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that
-unique values are generated.
-
-=== "Java"
-
-    ```java
-    field().name("account_id").regex("ACC[0-9]{8}").primaryKey(true),
-    ```
-
-=== "Scala"
-
-    ```scala
-    field.name("account_id").regex("ACC[0-9]{8}").primaryKey(true),
-    ```
-
-##### amount
-
-`amount` the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between
-`1` and `1000`.
-
-=== "Java"
-
-    ```java
-    field().name("amount").type(DoubleType.instance()).min(1).max(1000),
-    ```
-
-=== "Scala"
-
-    ```scala
-    field.name("amount").`type`(DoubleType).min(1).max(1000),
-    ```
-
-##### name
-
-`name` is a string that also follows a certain pattern, so we could also define a regex but here we will choose to
-leverage the DataFaker library and create an `expression` to generate real looking name. All possible faker expressions
-can be found [**here**](../../../../sample/datafaker/expressions.txt)
-
-=== "Java"
-
-    ```java
-    field().name("name").expression("#{Name.name}"),
-    ```
-
-=== "Scala"
-
-    ```scala
-    field.name("name").expression("#{Name.name}"),
-    ```
+Depending on how you want to define the schema, follow the below:
 
-##### open_time
-
-`open_time` is a timestamp that we want to have a value greater than a specific date. We can define a min date by using
-`java.sql.Date` like below.
-
-=== "Java"
-
-    ```java
-    field().name("open_time").type(TimestampType.instance()).min(java.sql.Date.valueOf("2022-01-01")),
-    ```
-
-=== "Scala"
-
-    ```scala
-    field.name("open_time").`type`(TimestampType).min(java.sql.Date.valueOf("2022-01-01")),
-    ```
-
-##### status
-
-`status` is a field that can only obtain one of four values, `open, closed, suspended or pending`.
-
-=== "Java"
-
-    ```java
-    field().name("status").oneOf("open", "closed", "suspended", "pending")
-    ```
-
-=== "Scala"
-
-    ```scala
-    field.name("status").oneOf("open", "closed", "suspended", "pending")
-    ```
-
-##### created_by
-
-`created_by` is a field that is based on the `status` field where it follows the logic: `if status is open or closed, then
-it is created_by eod else created_by event`. This can be achieved by defining a SQL expression like below.
-
-=== "Java"
-
-    ```java
-    field().name("created_by").sql("CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"),
-    ```
-
-=== "Scala"
-
-    ```scala
-    field.name("created_by").sql("CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"),
-    ```
-
-Putting it all the fields together, our class should now look like this.
-
-=== "Java"
-
-    ```java
-    var accountTask = cassandra("customer_cassandra", "host.docker.internal:9042")
-            .table("account", "accounts")
-            .schema(
-                    field().name("account_id").regex("ACC[0-9]{8}").primaryKey(true),
-                    field().name("amount").type(DoubleType.instance()).min(1).max(1000),
-                    field().name("created_by").sql("CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"),
-                    field().name("name").expression("#{Name.name}"),
-                    field().name("open_time").type(TimestampType.instance()).min(java.sql.Date.valueOf("2022-01-01")),
-                    field().name("status").oneOf("open", "closed", "suspended", "pending")
-            );
-    ```
-
-=== "Scala"
-
-    ```scala
-    val accountTask = cassandra("customer_cassandra", "host.docker.internal:9042")
-      .table("account", "accounts")
-      .schema(
-        field.name("account_id").primaryKey(true),
-        field.name("amount").`type`(DoubleType).min(1).max(1000),
-        field.name("created_by").sql("CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"),
-        field.name("name").expression("#{Name.name}"),
-        field.name("open_time").`type`(TimestampType).min(java.sql.Date.valueOf("2022-01-01")),
-        field.name("status").oneOf("open", "closed", "suspended", "pending")
-      )
-    ```
+- [Manual schema guide](../../scenario/data-generation.md)
+- Automatically detect schema from the data source, you can simply enable `configuration.enableGeneratePlanAndTasks(true)`
+- [Automatically detect schema from a metadata source](../../index.md#metadata)
 
 #### Additional Configurations