Skip to content

Commit

Permalink
further refining tut/ref/docs
Browse files Browse the repository at this point in the history
  • Loading branch information
royendo committed Nov 14, 2024
1 parent 7062c24 commit 9ff71fa
Show file tree
Hide file tree
Showing 10 changed files with 114 additions and 201 deletions.
6 changes: 6 additions & 0 deletions docs/docs/build/incremental-models/incremental-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,12 @@ When you are testing with incremental models in Rill Developer, you will notice

![img](/img/tutorials/302/now-incremental.png)

:::tip What's the difference?
Once increments are enabled on a model, this grants you the ability to refresh the model in increments, instead of loading the full data each time. This is handy when you're data is massive and reingesting the data may take time. For a project on production, this allows for less downtime when needing to update your dashboards when the source data is updated.

There are times where a full refresh may be required. In these cases, running the full refresh is equiavalent to running a normal refresh with incremental disabled.
:::

When selecting to refresh incrementally what is being run in the CLI is:

```bash
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,10 @@ sidebar_position: 05

Putting the two concepts together, it is possible to create a incremental partitioned model. Doing so will allow you to not only partition the model but refresh only the partition that you need and incrementally ingest partitions

:::tip
:::note
If you need any assistance with setting up a incremental partitioned model, [reach out](contact.md) to us for assistance!

If you're looking for a working example, take a look at [my-rill-tutorial in our examples repository](https://github.com/rilldata/rill-examples).
:::


Expand Down Expand Up @@ -53,23 +55,21 @@ Refresh initiated. Check the project logs for status updates.
## How Incremental Partitioned Models Work

### Initial Ingestion:
When a model is first created, an intial ingestion will occur to bring in all of the data. This is also what occurs when you run a `Full Refresh`.
When a model is first created, an intial ingestion will occur to bring in all of the data. This is also what occurs when you run a `Full Refresh`. Note in the below image, all **gray** portions of the partitioned source are saved in separate partitions in the partitioned model.

<div style={{ textAlign: "center" }}>
<img src="/img/build/advanced-models/initial-ingestion.png" width="600" />
</div>

### Additional Partition:
If you add an additional partition to the source table, on the next refresh, Rill will detect the new partition and **only** add the additional partition to the model, as you can see in the diagram. If the other partitions have not been modified, these will not be touched.
If you add an additional partition to the source table, on the next refresh, Rill will detect the new partition and **only** add the additional partition to the model, as you can see in the diagram, the **blue** additional partition is added in its own partition in the partitioned model. If the other partitions have not been modified, these will not be touched.
<div style={{ textAlign: "center" }}>

<img src="/img/build/advanced-models/addition-partition.png" width="600" />
</div>

### Modify Existing Partition:
If you modify any of the already existing partitions, Rill will reingest just the modified file during the scheduled refresh by checking the `last_modified_date` parameter.
If you modify any of the already existing partition, **yellow**, Rill will reingest just the modified file during the scheduled refresh by checking the `last_modified_date` parameter.
<div style={{ textAlign: "center" }}>

<img src="/img/build/advanced-models/modified-partition.png" width="600" />
</div>

10 changes: 6 additions & 4 deletions docs/docs/build/incremental-models/staging.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ sidebar_position: 10
Staging models are required for situations where the input and output type are not supported, such as Snowflake to ClickHouse.

:::note Supported
Staging Models is in ongoing development, while we do have support for the following, please reach out to us if you have any specific requirements.
Staging Models is in ongoing development, while we do have support for the following, please [reach out to us](contact.md) if you have any specific requirements.

Snowflake --> S3 --> ClickHouse

Expand All @@ -21,7 +21,7 @@ Snowflake --> S3 --> ClickHouse

In the above example, during the ingestion from Snowflake to Clickhouse, we use the temporary staging table in S3 to write from Snowflake to S3, then from S3 to ClickHouse. Once this procedure is complete, we clear the temporary data from S3.

### Example:
### Sample YAML:

```yaml
# Use DuckDB to generate a range of days from 1st Jan to today
Expand All @@ -41,7 +41,9 @@ stage:
connector: s3
path: s3://bucket/temp-data

# Produce the final output into ClickHouse
# Produce the final output into ClickHouse, requires a clickhouse.yaml connector defined.
output:
connector: clickhouse
```
```
55 changes: 38 additions & 17 deletions docs/docs/reference/project-files/advanced-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,38 +5,59 @@ sidebar_position: 35
hide_table_of_contents: true
---

In some cases, advanced models will be required when implementing advanced features such as incremental models.

In some cases, advanced models will be required when implementing advanced features such as incremental partitioned models or staging models.
## Properties

**`type`** - refers to the resource type and must be 'model'
**`type`** - refers to the resource type and must be 'model'_(required)_

**`refresh`** - Specifies the refresh schedule that Rill should follow to re-ingest and update the underlying source data _(optional)_.
- **`cron`** - a cron schedule expression, which should be encapsulated in single quotes, e.g. `'* * * * *'` _(optional)_
- **`every`** - a Go duration string, such as `24h` ([docs](https://pkg.go.dev/time#ParseDuration)) _(optional)_
```
refresh:
cron: "0 8 * * *"
```

**`timeout`**
— The maximum time to wait for model ingestion _(optional)_.
**`timeout`** — The maximum time to wait for model ingestion _(optional)_.

**`incremental`** - set to `true` or `false` whether incremental modeling is required _(optional)_

**`state`** - refers to the explicitly defined state of your model, cannot be used _(optional)_
- **`sql/glob`** - refers to the location of the data depending if the data is cloud storage or a data warehouse.

**`partitions`** - refers to the special state that is defined by the a predefined partition. In the case of partitions, your data needs to already be in a supported format.
**`state`** - refers to the explicitly defined state of your model, cannot be used with `partitions` _(optional)_.
- **`sql/glob`** - refers to the location of the data depending if the data is cloud storage or a data warehouse.
- **`path`** -
- **`partition`** -

**`partitions_watermark`** -

**`partitions_concurrency`** -
**`partitions`** - refers to the how your data is partitioned, cannot be used with `state`. _(optional)_.
- **`connector`** - refers to the connector that the partitions is using _(optional)_.
- **`sql`** - refers to the SQL query used to access the data in your data warehouse, use `sql` or `glob` _(optional)_.
- **`glob`** - refers to the location of the data in your cloud warehouse, use `sql` or `glob` _(optional)_.
- **`path`** - in the case `glob` is selected, you will need to set the path of your source _(optional)_.
- **`partition`** - in the case `glob` is selected, you can defined how to partition the table. directory or hive _(optional)_.

```yaml
partitions:
connector: duckdb
sql: SELECT range AS num FROM range(0,10)
```
```yaml
partitions:
glob:
connector: [s3/gcs]
path: [s3/gs]://path/to/file/**/*.parquet[.csv]
```
**`sql`** - refers to the SQL query for your model. _(required)_.

**`partitions_watermark`** - refers to a customizable timestamp that can be set to check if an object has been updated _(optional)_.

**`partitions_concurrency`** - refers to the number of concurrent partitions that can be read at the same time _(optional)_.

**`stage`** - in the case of staging models, where an input source does not support direct write to the output and a staging table is required _(optional)_.
- **`connector`** - refers to the connector type for the staging table
-**`path`** - path of the temporary staging table
- **`path`** - path of the temporary staging table

**`output`** - in the case of staging models, where the output needs to be defined where the staging table will write the temporary data _(optional)_.
- **`connector`** - refers to the connector type for the staging table
- **`connector`** - refers to the connector type for the staging table _(optional)_.
- **`incremental_strategy`** - refers to how the incremental refresh will behave, (merge or append) _(optional)_.
- **`unique_key`** - required if incremental_stategy is defined, refers to the unique column to use to merge _(optional)_.
- **`materialize`** - refers to the output table being materialized _(optional)_.

**`materialize`** -
**`materialize`** - refers to the model being materialized as a table or not _(optional)_.
122 changes: 0 additions & 122 deletions docs/docs/tutorials/other/deep-dive-incremental-modeling.md

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,25 @@ sidebar_label: "Partitions and Incremental Models"
sidebar_position: 1
---

In order to help with data ingestion into Rill, we will introduce the concepts of [partitions](https://docs.rilldata.com/build/incremental-models/partitions) and [incremental models](https://docs.rilldata.com/build/incremental-models/) Before diving into our ClickHouse project, let's understand what each of these are used for and do.
In order to help with data ingestion into Rill, we will introduce the concepts of [partitions](https://docs.rilldata.com/build/incremental-models/#what-are-partitions) and [incremental models](https://docs.rilldata.com/build/incremental-models/#what-is-an-incremental-model) Before diving into our ClickHouse project, let's understand what each of these are used for and do.

## Incremental Model

:::tip Review the Reference!
While we will go over the main points to get started, there are more customizations possiblities so we recommened to review the [reference guide](https://docs.rilldata.com/reference/project-files/advanced-models) and docs along with following the tutorial.

:::
## [Incremental Model](https://docs.rilldata.com/build/incremental-models/#what-is-an-incremental-model)

An incremental model is defined using the following key pair.

```yaml
incremental: true
```
Once this is enabled, this allows Rill to configure the model YAML as an incrementing model.
In some following examples, we will use both a time based incremental and glob based increments.
## Partitions
Once this is enabled, Rill will configure the model YAML as an incrementing model.
In some of the examples, we will use both a time based incremental and glob based increments.
## [Partitioned Model](https://docs.rilldata.com/build/incremental-models/#what-are-partitions)
Partitions in models are enabled by defining the partition parameter as seen below:
Expand All @@ -28,16 +34,11 @@ partitions:
Depending on your data, this can be defined as a `SQL:` statement or a `glob:` pattern. Once configured, Rill will try to partition your existing data into smaller subcategories which allows you to refresh specific partitions only instead of reingesting the whole dataset. (only when incremental is enabled)

By running the following command, you can see all the available partitions and run a refresh command on a specific key or keys.
By running the following command, you can see all the available partitions.
```bash
rill project partitions <model_name>
```

```bash
rill project refresh --model <model_name> --partition <partition>
```


Let's look at a few simple examples before diving into our ClickHouse project.

import DocsRating from '@site/src/components/DocsRating';
Expand Down
Loading

0 comments on commit 9ff71fa

Please sign in to comment.