Skip to content

Commit

Permalink
Add in guide for postgres and mysql, move data generation guide to se…
Browse files Browse the repository at this point in the history
…parate doc, add option of how to run for UI
  • Loading branch information
pflooky committed Jun 11, 2024
1 parent a5e40f7 commit 484362e
Show file tree
Hide file tree
Showing 26 changed files with 2,110 additions and 1,909 deletions.
82 changes: 69 additions & 13 deletions docs/setup/connection.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,13 @@ These configurations can be done via API or from configuration. Examples of both
| Database | Postgres | :white_check_mark: | :white_check_mark: |
| Database | Elasticsearch | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark: |
| Database | MongoDB | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark: |
| Database | Opensearch | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark: |
| File | CSV | :white_check_mark: | :white_check_mark: |
| File | Delta Lake | :white_check_mark: | :white_check_mark: |
| File | Iceberg | :white_check_mark: | :white_check_mark: |
| File | JSON | :white_check_mark: | :white_check_mark: |
| File | ORC | :white_check_mark: | :white_check_mark: |
| File | Parquet | :white_check_mark: | :white_check_mark: |
| File | Delta Lake | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark: |
| File | Hudi | :octicons-x-circle-fill-12:{ .red-cross } | :white_check_mark: |
| HTTP | REST API | :white_check_mark: | :octicons-x-circle-fill-12:{ .red-cross } |
| Messaging | Kafka | :white_check_mark: | :octicons-x-circle-fill-12:{ .red-cross } |
Expand Down Expand Up @@ -97,8 +98,9 @@ configurations can be found below.
csv("customer_transactions", "/data/customer/transaction")
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
csv {
customer_transactions {
Expand All @@ -124,8 +126,9 @@ configurations can be found below.
json("customer_transactions", "/data/customer/transaction")
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
json {
customer_transactions {
Expand All @@ -151,8 +154,9 @@ configurations can be found below.
orc("customer_transactions", "/data/customer/transaction")
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
orc {
customer_transactions {
Expand All @@ -178,8 +182,9 @@ configurations can be found below.
parquet("customer_transactions", "/data/customer/transaction")
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
parquet {
customer_transactions {
Expand All @@ -191,7 +196,7 @@ configurations can be found below.

[Other available configuration for Parquet can be found here](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option)

#### Delta (not supported yet)
#### Delta

=== "Java"

Expand All @@ -205,8 +210,9 @@ configurations can be found below.
delta("customer_transactions", "/data/customer/transaction")
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
delta {
customer_transactions {
Expand All @@ -216,6 +222,50 @@ configurations can be found below.
}
```

#### Iceberg

=== "Java"

```java
iceberg(
"customer_accounts", //name
"account.accounts", //table name
"/opt/app/data/customer/iceberg", //warehouse path
"hadoop", //catalog type
"", //catalogUri
Map.of() //additional options
);
```

=== "Scala"

```scala
iceberg(
"customer_accounts", //name
"account.accounts", //table name
"/opt/app/data/customer/iceberg", //warehouse path
"hadoop", //catalog type
"", //catalogUri
Map() //additional options
)
```

=== "YAML"

In `application.conf`:
```
iceberg {
customer_transactions {
path = "/opt/app/data/customer/iceberg"
path = ${?ICEBERG_WAREHOUSE_PATH}
catalogType = "hadoop"
catalogType = ${?ICEBERG_CATALOG_TYPE}
catalogUri = ""
catalogUri = ${?ICEBERG_CATALOG_URI}
}
}
```

### RMDBS

Follows the same configuration used by Spark as
Expand Down Expand Up @@ -244,8 +294,9 @@ Sample can be found below
)
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
jdbc {
customer_postgres {
Expand Down Expand Up @@ -310,8 +361,9 @@ Following permissions are required when generating plan and tasks:
)
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
jdbc {
customer_mysql {
Expand Down Expand Up @@ -364,8 +416,9 @@ found [**here**](https://github.com/datastax/spark-cassandra-connector/blob/mast
)
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
org.apache.spark.sql.cassandra {
customer_cassandra {
Expand Down Expand Up @@ -426,8 +479,9 @@ found [**here**](https://spark.apache.org/docs/latest/structured-streaming-kafka
)
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
kafka {
customer_kafka {
Expand Down Expand Up @@ -476,8 +530,9 @@ via JNDI otherwise a connection cannot be created.
)
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
jms {
customer_solace {
Expand Down Expand Up @@ -520,8 +575,9 @@ The url is defined in the tasks to allow for generated data to be populated in t
)
```

=== "application.conf"
=== "YAML"

In `application.conf`:
```
http {
customer_api {
Expand Down
169 changes: 26 additions & 143 deletions docs/setup/guide/data-source/database/cassandra.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,27 @@ for the tables you configure.

First, we will clone the data-caterer-example repo which will already have the base project setup required.

```shell
git clone [email protected]:data-catering/data-caterer-example.git
```
=== "Java"

```shell
git clone [email protected]:data-catering/data-caterer-example.git
```

=== "Scala"

```shell
git clone [email protected]:data-catering/data-caterer-example.git
```

=== "YAML"

```shell
git clone [email protected]:data-catering/data-caterer-example.git
```

=== "UI"

[Run Data Caterer UI via the 'Quick Start' found here.](../../../../get-started/quick-start.md)

If you already have a Cassandra instance running, you can skip to [this step](#plan-setup).

Expand Down Expand Up @@ -123,7 +141,7 @@ Within our class, we can start by defining the connection properties to connect

Let's create a task for inserting data into the `account.accounts` and `account.account_status_history` tables as
defined under`docker/data/cql/customer.cql`. This table should already be setup for you if you followed this
[step](#cassandra-setup). We can check if the table is setup already via the following command:
[step](#cassandra-setup). We can check if the table is set up already via the following command:

```shell
docker exec docker-cassandraserver-1 cqlsh -e 'describe account.accounts; describe account.account_status_history;'
Expand Down Expand Up @@ -190,146 +208,11 @@ corresponds to `text` in Cassandra.
)
```

#### Field Metadata

We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data
that is closer to the structure of the data that would come in production? We can do this by defining various metadata
that add guidelines that the data generator will understand when generating data.

##### account_id

`account_id` follows a particular pattern that where it starts with `ACC` and has 8 digits after it.
This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that
unique values are generated.

=== "Java"

```java
field().name("account_id").regex("ACC[0-9]{8}").primaryKey(true),
```

=== "Scala"

```scala
field.name("account_id").regex("ACC[0-9]{8}").primaryKey(true),
```

##### amount

`amount` the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between
`1` and `1000`.

=== "Java"

```java
field().name("amount").type(DoubleType.instance()).min(1).max(1000),
```

=== "Scala"

```scala
field.name("amount").`type`(DoubleType).min(1).max(1000),
```

##### name

`name` is a string that also follows a certain pattern, so we could also define a regex but here we will choose to
leverage the DataFaker library and create an `expression` to generate real looking name. All possible faker expressions
can be found [**here**](../../../../sample/datafaker/expressions.txt)

=== "Java"

```java
field().name("name").expression("#{Name.name}"),
```

=== "Scala"

```scala
field.name("name").expression("#{Name.name}"),
```
Depending on how you want to define the schema, follow the below:

##### open_time

`open_time` is a timestamp that we want to have a value greater than a specific date. We can define a min date by using
`java.sql.Date` like below.

=== "Java"

```java
field().name("open_time").type(TimestampType.instance()).min(java.sql.Date.valueOf("2022-01-01")),
```

=== "Scala"

```scala
field.name("open_time").`type`(TimestampType).min(java.sql.Date.valueOf("2022-01-01")),
```

##### status

`status` is a field that can only obtain one of four values, `open, closed, suspended or pending`.

=== "Java"

```java
field().name("status").oneOf("open", "closed", "suspended", "pending")
```

=== "Scala"

```scala
field.name("status").oneOf("open", "closed", "suspended", "pending")
```

##### created_by

`created_by` is a field that is based on the `status` field where it follows the logic: `if status is open or closed, then
it is created_by eod else created_by event`. This can be achieved by defining a SQL expression like below.

=== "Java"

```java
field().name("created_by").sql("CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"),
```

=== "Scala"

```scala
field.name("created_by").sql("CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"),
```

Putting it all the fields together, our class should now look like this.

=== "Java"

```java
var accountTask = cassandra("customer_cassandra", "host.docker.internal:9042")
.table("account", "accounts")
.schema(
field().name("account_id").regex("ACC[0-9]{8}").primaryKey(true),
field().name("amount").type(DoubleType.instance()).min(1).max(1000),
field().name("created_by").sql("CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"),
field().name("name").expression("#{Name.name}"),
field().name("open_time").type(TimestampType.instance()).min(java.sql.Date.valueOf("2022-01-01")),
field().name("status").oneOf("open", "closed", "suspended", "pending")
);
```

=== "Scala"

```scala
val accountTask = cassandra("customer_cassandra", "host.docker.internal:9042")
.table("account", "accounts")
.schema(
field.name("account_id").primaryKey(true),
field.name("amount").`type`(DoubleType).min(1).max(1000),
field.name("created_by").sql("CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"),
field.name("name").expression("#{Name.name}"),
field.name("open_time").`type`(TimestampType).min(java.sql.Date.valueOf("2022-01-01")),
field.name("status").oneOf("open", "closed", "suspended", "pending")
)
```
- [Manual schema guide](../../scenario/data-generation.md)
- Automatically detect schema from the data source, you can simply enable `configuration.enableGeneratePlanAndTasks(true)`
- [Automatically detect schema from a metadata source](../../index.md#metadata)

#### Additional Configurations

Expand Down
Loading

0 comments on commit 484362e

Please sign in to comment.