Skip to content

Commit

Permalink
[Docs] Update concept related docs info
Browse files Browse the repository at this point in the history
  • Loading branch information
tcodehuber committed Jul 12, 2024
1 parent 7e02c88 commit a73ebaa
Show file tree
Hide file tree
Showing 12 changed files with 75 additions and 85 deletions.
14 changes: 7 additions & 7 deletions docs/en/concept/JobEnvConfig.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
# Job Env Config

This document describes env configuration information, the common parameters can be used in all engines. In order to better distinguish between engine parameters, the additional parameters of other engine need to carry a prefix.
This document describes env configuration information. The common parameters can be used in all engines. In order to better distinguish between engine parameters, the additional parameters of other engine need to carry a prefix.
In flink engine, we use `flink.` as the prefix. In the spark engine, we do not use any prefixes to modify parameters, because the official spark parameters themselves start with `spark.`

## Common Parameter

The following configuration parameters are common to all engines
The following configuration parameters are common to all engines.

### job.name

This parameter configures the task name.

### jars

Third-party packages can be loaded via `jars`, like `jars="file://local/jar1.jar;file://local/jar2.jar"`
Third-party packages can be loaded via `jars`, like `jars="file://local/jar1.jar;file://local/jar2.jar"`.

### job.mode

You can configure whether the task is in batch mode or stream mode through `job.mode`, like `job.mode = "BATCH"` or `job.mode = "STREAMING"`
You can configure whether the task is in batch or stream mode through `job.mode`, like `job.mode = "BATCH"` or `job.mode = "STREAMING"`

### checkpoint.interval

Expand Down Expand Up @@ -47,11 +47,11 @@ you can set it to `CLIENT`. Please use `CLUSTER` mode as much as possible, becau

Specify the method of encryption, if you didn't have the requirement for encrypting or decrypting config files, this option can be ignored.

For more details, you can refer to the documentation [config-encryption-decryption](../connector-v2/Config-Encryption-Decryption.md)
For more details, you can refer to the documentation [Config Encryption Decryption](../connector-v2/Config-Encryption-Decryption.md)

## Flink Engine Parameter

Here are some SeaTunnel parameter names corresponding to the names in Flink, not all of them, please refer to the official [flink documentation](https://flink.apache.org/) for more.
Here are some SeaTunnel parameter names corresponding to the names in Flink, not all of them. Please refer to the official [Flink Documentation](https://flink.apache.org/).

| Flink Configuration Name | SeaTunnel Configuration Name |
|---------------------------------|---------------------------------------|
Expand All @@ -62,4 +62,4 @@ Here are some SeaTunnel parameter names corresponding to the names in Flink, not

## Spark Engine Parameter

Because spark configuration items have not been modified, they are not listed here, please refer to the official [spark documentation](https://spark.apache.org/).
Because Spark configuration items have not been modified, they are not listed here, please refer to the official [Spark Documentation](https://spark.apache.org/).
58 changes: 29 additions & 29 deletions docs/en/concept/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,24 @@ sidebar_position: 2

# Intro to config file

In SeaTunnel, the most important thing is the Config file, through which users can customize their own data
In SeaTunnel, the most important thing is the config file, through which users can customize their own data
synchronization requirements to maximize the potential of SeaTunnel. So next, I will introduce you how to
configure the Config file.
configure the config file.

The main format of the Config file is `hocon`, for more details of this format type you can refer to [HOCON-GUIDE](https://github.com/lightbend/config/blob/main/HOCON.md),
BTW, we also support the `json` format, but you should know that the name of the config file should end with `.json`
The main format of the config file is `hocon`, for more details you can refer to [HOCON-GUIDE](https://github.com/lightbend/config/blob/main/HOCON.md),
BTW, we also support the `json` format, but you should keep in mind that the name of the config file should end with `.json`.

We also support the `SQL` format, for details, please refer to the [SQL configuration](sql-config.md) file.
We also support the `SQL` format, please refer to [SQL configuration](sql-config.md) for more details.

## Example

Before you read on, you can find config file
examples [here](https://github.com/apache/seatunnel/tree/dev/config) and in distribute package's
examples [Here](https://github.com/apache/seatunnel/tree/dev/config) from the binary package's
config directory.

## Config file structure
## Config File Structure

The Config file will be similar to the one below.
The config file is similar to the below one:

### hocon

Expand Down Expand Up @@ -125,12 +125,12 @@ sql = """ select * from "table" """
```

As you can see, the Config file contains several sections: env, source, transform, sink. Different modules
have different functions. After you understand these modules, you will understand how SeaTunnel works.
As you can see, the config file contains several sections: env, source, transform, sink. Different modules
have different functions. After you understand these modules, you will see how SeaTunnel works.

### env

Used to add some engine optional parameters, no matter which engine (Spark or Flink), the corresponding
Used to add some engine optional parameters, no matter which engine (Zeta, Spark or Flink), the corresponding
optional parameters should be filled in here.

Note that we have separated the parameters by engine, and for the common parameters, we can configure them as before.
Expand All @@ -140,9 +140,9 @@ For flink and spark engine, the specific configuration rules of their parameters

### source

source is used to define where SeaTunnel needs to fetch data, and use the fetched data for the next step.
Multiple sources can be defined at the same time. The supported source at now
check [Source of SeaTunnel](../connector-v2/source). Each source has its own specific parameters to define how to
Source is used to define where SeaTunnel needs to fetch data, and use the fetched data for the next step.
Multiple sources can be defined at the same time. The supported source can be found
in [Source of SeaTunnel](../connector-v2/source). Each source has its own specific parameters to define how to
fetch data, and SeaTunnel also extracts the parameters that each source will use, such as
the `result_table_name` parameter, which is used to specify the name of the data generated by the current
source, which is convenient for follow-up used by other modules.
Expand Down Expand Up @@ -180,35 +180,35 @@ sink {
fields = ["name", "age", "card"]
username = "default"
password = ""
source_table_name = "fake1"
source_table_name = "fake"
}
}
```

Like source, transform has specific parameters that belong to each module. The supported source at now check.
The supported transform at now check [Transform V2 of SeaTunnel](../transform-v2)
Like source, transform has specific parameters that belong to each module. The supported transform can be found
in [Transform V2 of SeaTunnel](../transform-v2)

### sink

Our purpose with SeaTunnel is to synchronize data from one place to another, so it is critical to define how
and where data is written. With the sink module provided by SeaTunnel, you can complete this operation quickly
and efficiently. Sink and source are very similar, but the difference is reading and writing. So go check out
our [supported sinks](../connector-v2/sink).
and efficiently. Sink and source are very similar, but the difference is reading and writing. So please check out
[Supported Sinks](../connector-v2/sink).

### Other

You will find that when multiple sources and multiple sinks are defined, which data is read by each sink, and
which is the data read by each transform? We use `result_table_name` and `source_table_name` two key
configurations. Each source module will be configured with a `result_table_name` to indicate the name of the
which is the data read by each transform? We introduce two key configurations called `result_table_name` and
`source_table_name`. Each source module will be configured with a `result_table_name` to indicate the name of the
data source generated by the data source, and other transform and sink modules can use `source_table_name` to
refer to the corresponding data source name, indicating that I want to read the data for processing. Then
transform, as an intermediate processing module, can use both `result_table_name` and `source_table_name`
configurations at the same time. But you will find that in the above example Config, not every module is
configurations at the same time. But you will find that in the above example config, not every module is
configured with these two parameters, because in SeaTunnel, there is a default convention, if these two
parameters are not configured, then the generated data from the last module of the previous node will be used.
This is much more convenient when there is only one source.

## Config variable substitution
## Config Variable Substitution

In config file we can define some variables and replace it in run time. **This is only support `hocon` format file**.

Expand Down Expand Up @@ -266,7 +266,7 @@ We can replace those parameters with this shell command:
-i nameVal=abc
-i username=seatunnel=2.3.1
-i password='$a^b%c.d~e0*9('
-e local
-m local
```

Then the final submitted config is:
Expand Down Expand Up @@ -312,12 +312,12 @@ sink {
```

Some Notes:
- quota with `'` if the value has special character (like `(`)
- if the replacement variables is in `"` or `'`, like `resName` and `nameVal`, you need add `"`
- the value can't have space `' '`, like `-i jobName='this is a job name' `, this will be replaced to `job.name = "this"`
- If you want to use dynamic parameters,you can use the following format: -i date=$(date +"%Y%m%d").
- Quota with `'` if the value has special character such as `(`
- If the replacement variables is in `"` or `'`, like `resName` and `nameVal`, you need add `"`
- The value can't have space `' '`, like `-i jobName='this is a job name' `, this will be replaced to `job.name = "this"`
- If you want to use dynamic parameters, you can use the following format: -i date=$(date +"%Y%m%d").

## What's More

If you want to know the details of this format configuration, Please
If you want to know the details of the format configuration, please
see [HOCON](https://github.com/lightbend/config/blob/main/HOCON.md).
14 changes: 7 additions & 7 deletions docs/en/concept/connector-v2-features.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Intro To Connector V2 Features

## Differences Between Connector V2 And Connector v1
## Differences Between Connector V2 And V1

Since https://github.com/apache/seatunnel/issues/1608 We Added Connector V2 Features.
Connector V2 is a connector defined based on the SeaTunnel Connector API interface. Unlike Connector V1, Connector V2 supports the following features.
Connector V2 is a connector defined based on the SeaTunnel Connector API interface. Unlike Connector V1, V2 supports the following features:

* **Multi Engine Support** SeaTunnel Connector API is an engine independent API. The connectors developed based on this API can run in multiple engines. Currently, Flink and Spark are supported, and we will support other engines in the future.
* **Multi Engine Version Support** Decoupling the connector from the engine through the translation layer solves the problem that most connectors need to modify the code in order to support a new version of the underlying engine.
Expand All @@ -18,23 +18,23 @@ Source connectors have some common core features, and each source connector supp

If each piece of data in the data source will only be sent downstream by the source once, we think this source connector supports exactly once.

In SeaTunnel, we can save the read **Split** and its **offset**(The position of the read data in split at that time,
such as line number, byte size, offset, etc) as **StateSnapshot** when checkpoint. If the task restarted, we will get the last **StateSnapshot**
In SeaTunnel, we can save the read **Split** and its **offset** (The position of the read data in split at that time,
such as line number, byte size, offset, etc.) as **StateSnapshot** when checkpointing. If the task restarted, we will get the last **StateSnapshot**
and then locate the **Split** and **offset** read last time and continue to send data downstream.

For example `File`, `Kafka`.

### column projection

If the connector supports reading only specified columns from the data source (note that if you read all columns first and then filter unnecessary columns through the schema, this method is not a real column projection)
If the connector supports reading only specified columns from the data source (Note that if you read all columns first and then filter unnecessary columns through the schema, this method is not a real column projection)

For example `JDBCSource` can use sql define read columns.
For example `JDBCSource` can use sql to define reading columns.

`KafkaSource` will read all content from topic and then use `schema` to filter unnecessary columns, This is not `column projection`.

### batch

Batch Job Mode, The data read is bounded and the job will stop when all data read complete.
Batch Job Mode, The data read is bounded and the job will stop after completing all data read.

### stream

Expand Down
Loading

0 comments on commit a73ebaa

Please sign in to comment.