diff --git a/docs/v0.17.0/_index.md b/docs/v0.17.0/_index.md new file mode 100644 index 0000000..fa62533 --- /dev/null +++ b/docs/v0.17.0/_index.md @@ -0,0 +1,16 @@ +--- +order: + [ + 'overview', + 'start', + 'platform', + 'write', + 'receive', + 'process', + 'ingest-and-distribute', + 'deploy', + 'security', + 'reference', + 'release-notes.md', + ] +--- diff --git a/docs/v0.17.0/deploy/_index.md b/docs/v0.17.0/deploy/_index.md new file mode 100644 index 0000000..fbb5b7f --- /dev/null +++ b/docs/v0.17.0/deploy/_index.md @@ -0,0 +1,6 @@ +--- +order: ["deploy-helm", "deploy-k8s", "deploy-docker", "quick-deploy-ssh"] +collapsed: false +--- + +Deploy diff --git a/docs/v0.17.0/deploy/deploy-docker.md b/docs/v0.17.0/deploy/deploy-docker.md new file mode 100644 index 0000000..ed6a399 --- /dev/null +++ b/docs/v0.17.0/deploy/deploy-docker.md @@ -0,0 +1,268 @@ +# Manual Deployment with Docker + +This document describes how to run HStreamDB cluster with docker. + +::: warning + +This tutorial only shows the main process of starting HStreamDB cluster with +docker, the parameters are not configured with any security in mind, so please +do not use them directly when deploying! + +::: + +## Set up a ZooKeeper ensemble + +`HServer` and `HStore` require ZooKeeper in order to store some metadata. We +need to set up a ZooKeeper ensemble first. + +You can find a tutorial online on how to build a proper ZooKeeper ensemble. As +an example, here we just quickly start a single-node ZooKeeper via docker. + +```shell +docker run --rm -d --name zookeeper --network host zookeeper +``` + +## Create data folders on storage nodes + +Storage nodes store data in shards. Typically each shard maps to a different +physical disk. Assume your data disk is mounted on `/mnt/data0` + +```shell +# creates the root folder for data +sudo mkdir -p /data/logdevice/ + +# writes the number of shards that this box will have +echo 1 | sudo tee /data/logdevice/NSHARDS + +# creates symlink for shard 0 +sudo ln -s /mnt/data0 /data/logdevice/shard0 + +# adds the user for the logdevice daemon +sudo useradd logdevice + +# changes ownership for the data directory and the disk +sudo chown -R logdevice /data/logdevice/ +sudo chown -R logdevice /mnt/data0/ +``` + +- See + [Create data folders](https://logdevice.io/docs/FirstCluster.html#4-create-data-folders-on-storage-nodes) + for details + +## Create a configuration file + +Here is a minimal configuration file example. Before using it, please modify it +to suit your situation. + +```json +{ + "server_settings": { + "enable-nodes-configuration-manager": "true", + "use-nodes-configuration-manager-nodes-configuration": "true", + "enable-node-self-registration": "true", + "enable-cluster-maintenance-state-machine": "true" + }, + "client_settings": { + "enable-nodes-configuration-manager": "true", + "use-nodes-configuration-manager-nodes-configuration": "true", + "admin-client-capabilities": "true" + }, + "cluster": "logdevice", + "internal_logs": { + "config_log_deltas": { + "replicate_across": { + "node": 3 + } + }, + "config_log_snapshots": { + "replicate_across": { + "node": 3 + } + }, + "event_log_deltas": { + "replicate_across": { + "node": 3 + } + }, + "event_log_snapshots": { + "replicate_across": { + "node": 3 + } + }, + "maintenance_log_deltas": { + "replicate_across": { + "node": 3 + } + }, + "maintenance_log_snapshots": { + "replicate_across": { + "node": 3 + } + } + }, + "metadata_logs": { + "nodeset": [], + "replicate_across": { + "node": 3 + } + }, + "zookeeper": { + "zookeeper_uri": "ip://10.100.2.11:2181", + "timeout": "30s" + } +} +``` + +- If you have a multi-node ZooKeeper ensemble, use the list of ZooKeeper + ensemble nodes and ports to modify `zookeeper_uri` in the `zookeeper` section: + + ```json + "zookeeper": { + "zookeeper_uri": "ip://10.100.2.11:2181,10.100.2.12:2181,10.100.2.13:2181", + "timeout": "30s" + } + ``` + +- Detailed explanations of all the attributes can be found in the + [Cluster configuration](https://logdevice.io/docs/Config.html) docs. + +## Store the configuration file + +You can the store configuration file in ZooKeeper, or store it on each storage +nodes. + +### Store configuration file in ZooKeeper + +Suppose you have a configuration file on one of your ZooKeeper nodes with the +path `~/logdevice.conf`. Save the configuration file to the ZooKeeper by running +the following command. + +```shell +docker exec zookeeper zkCli.sh create /logdevice.conf "`cat ~/logdevice.conf`" +``` + +You can verify the create operation by: + +```shell +docker exec zookeeper zkCli.sh get /logdevice.conf +``` + +## Set up HStore cluster + +For the configuration file stored in ZooKeeper, assume that the value of the +`zookeeper_uri` field in the configuration file is `"ip:/10.100.2.11:2181"` and +the path to the configuration file in ZooKeeper is `/logdevice.conf`. + +For the configuration file stored on each node, assume that your file path is +`/data/logdevice/logdevice.conf`. + +### Start admin server on a single node + +- Configuration file stored in ZooKeeper: + + ```shell-vue + docker run --rm -d --name storeAdmin --network host -v /data/logdevice:/data/logdevice \ + hstreamdb/hstream:{{ $version() }} /usr/local/bin/ld-admin-server \ + --config-path zk:10.100.2.11:2181/logdevice.conf \ + --enable-maintenance-manager \ + --maintenance-log-snapshotting \ + --enable-safety-check-periodic-metadata-update + ``` + + - If you have a multi-node ZooKeeper ensemble, Replace `--config-path` + parameter to: + `--config-path zk:10.100.2.11:2181,10.100.2.12:2181,10.100.2.13:2181/logdevice.conf` + +- Configuration file stored in each node: + + Replace `--config-path` parameter to + `--config-path /data/logdevice/logdevice.conf` + +### Start logdeviced on every node + +- Configuration file stored in ZooKeeper: + + ```shell-vue + docker run --rm -d --name hstore --network host -v /data/logdevice:/data/logdevice \ + hstreamdb/hstream:{{ $version() }} /usr/local/bin/logdeviced \ + --config-path zk:10.100.2.11:2181/logdevice.conf \ + --name store-0 \ + --address 192.168.0.3 \ + --local-log-store-path /data/logdevice + ``` + + - For each node, you should update the `--name` to a **different value** and + `--address` to the host IP address of that node. + +- Configuration file stored in each node: + + Replace `--config-path` parameter to + `--config-path /data/logdevice/logdevice.conf` + +### Bootstrap the cluster + +After starting the admin server and logdeviced for each storage node, now we can +bootstrap our cluster. + +On the admin server node, run: + +```shell +docker exec storeAdmin hadmin store nodes-config bootstrap --metadata-replicate-across 'node:3' +``` + +And you should see something like this: + +``` +Successfully bootstrapped the cluster, new nodes configuration version: 7 +Took 0.019s +``` + +You can check the cluster status by run: + +```shell +docker exec storeAdmin hadmin store status +``` + +And the result should be: + +``` ++----+---------+----------+-------+-----------+---------+---------------+ +| ID | NAME | PACKAGE | STATE | UPTIME | SEQ. | HEALTH STATUS | ++----+---------+----------+-------+-----------+---------+---------------+ +| 0 | store-0 | 99.99.99 | ALIVE | 2 min ago | ENABLED | HEALTHY | +| 1 | store-2 | 99.99.99 | ALIVE | 2 min ago | ENABLED | HEALTHY | +| 2 | store-1 | 99.99.99 | ALIVE | 2 min ago | ENABLED | HEALTHY | ++----+---------+----------+-------+-----------+---------+---------------+ +Took 7.745s +``` + +Now we finish setting up the `HStore` cluster. + +## Set up HServer cluster + +To start a single `HServer` instance, you can modify the start command to fit +your situation: + +```shell-vue +docker run -d --name hstream-server --network host \ + hstreamdb/hstream:{{ $version() }} /usr/local/bin/hstream-server \ + --bind-address $SERVER_HOST \ + --advertised-address $SERVER_HOST \ + --seed-nodes $SERVER_HOST \ + --metastore-uri zk://$ZK_ADDRESS \ + --store-config zk:$ZK_ADDRESS/logdevice.conf \ + --store-admin-host $ADMIN_HOST \ + --server-id 1 +``` + +- `$SERVER_HOST` :The host IP address of your server node, e.g `192.168.0.1` +- `metastore-uri`: The address of HMeta, it currently support `zk://$ZK_ADDRESS` for zookeeper and `rq://$RQ_ADDRESS` for rqlite (experimental). +- `$ZK_ADDRESS` :Your ZooKeeper ensemble address list, e.g + `10.100.2.11:2181,10.100.2.12:2181,10.100.2.13:2181` +- `--store-config` :The path to your `HStore` configuration file. Should match + the value of the `--config-path` parameter when starting the `HStore` cluster +- `--store-admin-host`: The IP address of the `HStore Admin Server` node +- `--server-id` :You should set a **unique identifier** for each server + instance + +You can start multiple server instances on different nodes in the same way. diff --git a/docs/v0.17.0/deploy/deploy-helm.md b/docs/v0.17.0/deploy/deploy-helm.md new file mode 100644 index 0000000..37413bd --- /dev/null +++ b/docs/v0.17.0/deploy/deploy-helm.md @@ -0,0 +1,106 @@ +# Running on Kubernetes by Helm + +This document describes how to run HStreamDB kubernetes using the helm chart +that we provide. The document assumes basic previous kubernetes knowledge. By +the end of this section, you'll have a fully running HStreamDB cluster on +kubernetes that's ready to receive reads/writes, process datas, etc. + +## Building your Kubernetes Cluster + +The first step is to have a running kubernetes cluster. You can use a managed +cluster (provided by your cloud provider), a self-hosted cluster or a local +kubernetes cluster using a tool like minikube. Make sure that kubectl points to +whatever cluster you're planning to use. + +Also, you need a storageClass, you can create by `kubectl`or by your cloud +provider web page if it has. minikube provides a storage class called `standard` +by default, which is used by the helm chart by default. + +## Starting HStreamDB + +### Clone code and get helm dependencies + +```sh +git clone https://github.com/hstreamdb/hstream.git +cd hstream/deploy/chart/hstream/ +helm dependency build . +``` + +### Deploy HStreamDB by Helm + +```sh +helm install my-hstream . +``` + +Helm chart also provides the `value.yaml` file where you can modify your +configuration, for example when you want to use other storage classes to deploy +the cluster, you can modify `logdevice.persistence.storageClass` and +`zookeeper.persistence.storageClass` in `value.yaml`, and use +`helm install my-hstream -f values.yaml .` to deploy. + +### Check Cluster Status + +The `helm install` command will deploy the zookeeper cluster, logdevice cluster +and hstream cluster, this can take some time, you can check the status of the +cluster with `kubectl get pods`, there will be some `Error` and +`CrashLoopBackOff` status during the cluster deployment, these will disappear +after some time, eventually you will see something like the following. + +``` +NAME READY STATUS RESTARTS AGE +my-hstream-0 1/1 Running 3 (16h ago) 16h +my-hstream-1 1/1 Running 2 (16h ago) 16h +my-hstream-2 1/1 Running 0 16h +my-hstream-logdevice-0 1/1 Running 3 (16h ago) 16h +my-hstream-logdevice-1 1/1 Running 3 (16h ago) 16h +my-hstream-logdevice-2 1/1 Running 0 16h +my-hstream-logdevice-3 1/1 Running 0 16h +my-hstream-logdevice-admin-server-6867fd9494-bk5mf 1/1 Running 3 (16h ago) 16h +my-hstream-zookeeper-0 1/1 Running 0 16h +my-hstream-zookeeper-1 1/1 Running 0 16h +my-hstream-zookeeper-2 1/1 Running 0 16h +``` + +You can check the status of the HStreamDB cluster with the `hadmin server` +command. + +```sh +kubectl exec -it hstream-1 -- bash -c "hadmin server status" +``` +``` ++---------+---------+------------------+ +| node_id | state | address | ++---------+---------+------------------+ +| 100 | Running | 172.17.0.4:6570 | +| 101 | Running | 172.17.0.10:6570 | +| 102 | Running | 172.17.0.12:6570 | ++---------+---------+------------------+ +``` + +## Manage HStore Cluster + +Now you can run `hadmin store` to manage the hstore cluster. +To check the state of the cluster, you can then run: + +```sh +kubectl exec -it my-hstream-0 -- bash -c "hadmin store --host my-hstream-logdevice-admin-server status" +``` +``` ++----+------------------------+----------+-------+--------------+----------+ +| ID | NAME | PACKAGE | STATE | UPTIME | LOCATION | ++----+------------------------+----------+-------+--------------+----------+ +| 0 | my-hstream-logdevice-0 | 99.99.99 | ALIVE | 16 hours ago | | +| 1 | my-hstream-logdevice-1 | 99.99.99 | DEAD | 16 hours ago | | +| 2 | my-hstream-logdevice-2 | 99.99.99 | DEAD | 16 hours ago | | +| 3 | my-hstream-logdevice-3 | 99.99.99 | DEAD | 16 hours ago | | ++----+------------------------+----------+-------+--------------+----------+ ++---------+-------------+---------------+------------+---------------+ +| SEQ. | DATA HEALTH | STORAGE STATE | SHARD OP. | HEALTH STATUS | ++---------+-------------+---------------+------------+---------------+ +| ENABLED | HEALTHY(1) | READ_WRITE(1) | ENABLED(1) | HEALTHY | +| ENABLED | HEALTHY(1) | READ_WRITE(1) | ENABLED(1) | HEALTHY | +| ENABLED | HEALTHY(1) | READ_WRITE(1) | ENABLED(1) | HEALTHY | +| ENABLED | HEALTHY(1) | READ_WRITE(1) | ENABLED(1) | HEALTHY | ++---------+-------------+---------------+------------+---------------+ +Took 16.727s +``` diff --git a/docs/v0.17.0/deploy/deploy-k8s.md b/docs/v0.17.0/deploy/deploy-k8s.md new file mode 100644 index 0000000..ddd0c94 --- /dev/null +++ b/docs/v0.17.0/deploy/deploy-k8s.md @@ -0,0 +1,226 @@ +# Running on Kubernetes + +This document describes how to run HStreamDB kubernetes using the specs that we +provide. The document assumes basic previous kubernetes knowledge. By the end of +this section, you'll have a fully running HStreamDB cluster on kubernetes that's +ready to receive reads/writes, process datas, etc. + +## Building your Kubernetes Cluster + +The first step is to have a running kubernetes cluster. You can use a managed +cluster (provided by your cloud provider), a self-hosted cluster or a local +kubernetes cluster using a tool like minikube. Make sure that kubectl points to +whatever cluster you're planning to use. + +Also, you need a storageClass named `hstream-store`, you can create by `kubectl` +or by your cloud provider web page if it has. + +::: tip + +For minikube user, you can use the default storage class called `standard`. + +::: + +## Install Zookeeper + +HStreamDB depends on Zookeeper for storing queries information and some internal +storage configuration. So we will need to provision a zookeeper ensemble that +HStreamDB will be able to access. For this demo, we will use +[helm](https://helm.sh/) (A package manager for kubernetes) to install +zookeeper. After installing helm run: + +```sh +helm repo add bitnami https://charts.bitnami.com/bitnami +helm repo update + +helm install zookeeper bitnami/zookeeper \ + --set image.tag=3.6 \ + --set replicaCount=3 \ + --set persistence.storageClass=hstream-store \ + --set persistence.size=20Gi +``` + +``` +NAME: zookeeper +LAST DEPLOYED: Tue Jul 6 10:51:37 2021 +NAMESPACE: test +STATUS: deployed +REVISION: 1 +TEST SUITE: None +NOTES: +** Please be patient while the chart is being deployed ** + +ZooKeeper can be accessed via port 2181 on the following DNS name from within your cluster: + + zookeeper.svc.cluster.local + +To connect to your ZooKeeper server run the following commands: + + export POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=zookeeper,app.kubernetes.io/component=zookeeper" -o jsonpath="{.items[0].metadata.name}") + kubectl exec -it $POD_NAME -- zkCli.sh + +To connect to your ZooKeeper server from outside the cluster execute the following commands: + + kubectl port-forward svc/zookeeper 2181:2181 & + zkCli.sh 127.0.0.1:2181 +WARNING: Rolling tag detected (bitnami/zookeeper:3.6), please note that it is strongly recommended to avoid using rolling tags in a production environment. ++info https://docs.bitnami.com/containers/how-to/understand-rolling-tags-containers/ +``` + +This will by default install a 3 nodes zookeeper ensemble. Wait until all the +three pods are marked as ready: + +```sh +kubectl get pods +``` + +``` +NAME READY STATUS RESTARTS AGE +zookeeper-0 1/1 Running 0 22h +zookeeper-1 1/1 Running 0 4d22h +zookeeper-2 1/1 Running 0 16m +``` + +## Configuring and Starting HStreamDB + +Once all the zookeeper pods are ready, we're ready to start installing the +HStreamDB cluster. + +### Fetching The K8s Specs + +```sh +git clone git@github.com:hstreamdb/hstream.git +cd hstream/deploy/k8s +``` + +### Update Configuration + +If you used a different way to install zookeeper, make sure to update the +zookeeper connection string in storage config file `config.json` and server +service file `hstream-server.yaml`. + +It should look something like this: + +```sh +cat config.json | grep -A 2 zookeeper +``` +``` + "zookeeper": { + "zookeeper_uri": "ip://zookeeper-0.zookeeper-headless:2181,zookeeper-1.zookeeper-headless:2181,zookeeper-2.zookeeper-headless:2181", + "timeout": "30s" + } +``` + +```sh +cat hstream-server.yaml | grep -A 1 metastore-uri +``` + +``` +- "--metastore-uri" +- "zk://zookeeper-0.zookeeper-headless:2181,zookeeper-1.zookeeper-headless:2181,zookeeper-2.zookeeper-headless:2181" +``` + +::: tip + +The zookeeper connection string in storage config file and the service file can +be different. But for normal scenario, they are the same. + +::: + +By default, this spec installs a 3 nodes HStream server cluster and 4 nodes +storage cluster. If you want a bigger cluster, modify the `hstream-server.yaml` +and `logdevice-statefulset.yaml` file, and increase the number of replicas to +the number of nodes you want in the cluster. Also by default, we attach a 40GB +persistent storage to the nodes, if you want more you can change that under the +volumeClaimTemplates section. + +### Starting the Cluster + +```sh +kubectl apply -k . +``` + +When you run `kubectl get pods`, you should see something like this: + +``` +NAME READY STATUS RESTARTS AGE +hstream-server-0 1/1 Running 0 6d18h +hstream-server-1 1/1 Running 0 6d18h +hstream-server-2 1/1 Running 0 6d18h +logdevice-0 1/1 Running 0 6d18h +logdevice-1 1/1 Running 0 6d18h +logdevice-2 1/1 Running 0 6d18h +logdevice-3 1/1 Running 0 6d18h +logdevice-admin-server-deployment-5c5fb9f8fb-27jlk 1/1 Running 0 6d18h +zookeeper-0 1/1 Running 0 6d22h +zookeeper-1 1/1 Running 0 10d +zookeeper-2 1/1 Running 0 6d +``` + +### Bootstrapping cluster + +Once all the logdevice pods are running and ready, you'll need to bootstrap the +cluster to enable all the nodes. To do that, run: + +```sh-vue +kubectl run hstream-admin -it --rm --restart=Never --image=hstreamdb/hstream:{{ $version() }} -- \ + hadmin store --host logdevice-admin-server-service \ + nodes-config bootstrap --metadata-replicate-across 'node:3' +``` + +This will start a hstream-admin pod, that connects to the store admin server and +invokes the `nodes-config bootstrap` hadmin store command and sets the metadata +replication property of the cluster to be replicated across three different +nodes. On success, you should see something like: + +```txt +Successfully bootstrapped the cluster +pod "hstream-admin" deleted +``` + +Now, you can boostrap hstream server, by running the following command: + +```sh-vue +kubectl run hstream-admin -it --rm --restart=Never --image=hstreamdb/hstream:{{ $version() }} -- \ + hadmin server --host hstream-server-0.hstream-server init +``` + +On success, you should see something like: + +```txt +Cluster is ready! +pod "hstream-admin" deleted +``` + +Note that depending on how fast the storage cluster completes bootstrap, running +`hadmin init` may fail. So you may need to run the command more than once. + +## Managing the Storage Cluster + +```sh-vue +kubectl run hstream-admin -it --rm --restart=Never --image=hstreamdb/hstream:{{ $version() }} -- bash +``` + +Now you can run `hadmin store` to manage the cluster: + +```sh +hadmin store --help +``` + +To check the state of the cluster, you can then run: + +```sh +hadmin store --host logdevice-admin-server-service status +``` + +```txt ++----+-------------+-------+---------------+ +| ID | NAME | STATE | HEALTH STATUS | ++----+-------------+-------+---------------+ +| 0 | logdevice-0 | ALIVE | HEALTHY | +| 1 | logdevice-1 | ALIVE | HEALTHY | +| 2 | logdevice-2 | ALIVE | HEALTHY | +| 3 | logdevice-3 | ALIVE | HEALTHY | ++----+-------------+-------+---------------+ +Took 2.567s +``` diff --git a/docs/v0.17.0/deploy/quick-deploy-ssh.md b/docs/v0.17.0/deploy/quick-deploy-ssh.md new file mode 100644 index 0000000..75a6557 --- /dev/null +++ b/docs/v0.17.0/deploy/quick-deploy-ssh.md @@ -0,0 +1,460 @@ +# Deployment with hdt + +This document provides a way to start an HStreamDB cluster quickly using the deployment tool `hdt`. + +## Pre-Require + +- The local host needs to be able to connect to the remote server via SSH + +- Make sure remote server has docker installed. + +- Make sure that the logged-in user has `sudo` execute privileges and configure `sudo` to run without prompting for a password. + +- For nodes that deploy `HStore` instances, mount the data disks to `/mnt/data*`, where `*` matches an incremental number starting from zero. + - Each disk should be mounted to a separate directory. For example, if there are two data disks, `/dev/vdb` and `/dev/vdc`, then `/dev/vdb` should be mounted to `/mnt/data0`, and `/dev/vdc` should be mounted to `/mnt/data1`. + +## Deploy `hdt` on the control machine + +We'll use a deployment tool `hdt` to help us set up the cluster. The binaries are available here: https://github.com/hstreamdb/deployment-tool/releases. + +1. Log in to the control machine and download the binaries. + +2. Generate configuration template with command: + + ```shell + ./hdt init + ``` + + The current directory structure will be as follows after running the `init` command: + + ```markdown + ├── hdt + └── template + ├── config.yaml + ├── logdevice.conf + ├── alertmanager + | └── alertmanager.yml + ├── grafana + │   ├── dashboards + │   └── datasources + ├── prometheus + ├── hstream_console + ├── filebeat + ├── kibana + │   └── export.ndjson + └── script + ``` + +## Update `Config.yaml` + +`template/config.yaml` contains the template for the configuration file. Refer to the description of the fields in the file and modify the template according to your actual needs. + +As a simple example, we will be deploying a cluster on three nodes, each consisting of an HServer instance, an HStore instance, and a Meta-Store instance. In addition, we will deploy HStream Console, Prometheus, and HStream Exporter on another node. For hstream monitor stack, refer to [monitor components config](./quick-deploy-ssh.md#monitor-stack-components). + +The final configuration file may looks like: + +```yaml +global: + user: "root" + key_path: "~/.ssh/id_rsa" + ssh_port: 22 + +hserver: + - host: 172.24.47.175 + - host: 172.24.47.174 + - host: 172.24.47.173 + +hstore: + - host: 172.24.47.175 + enable_admin: true + - host: 172.24.47.174 + - host: 172.24.47.173 + +meta_store: + - host: 172.24.47.175 + - host: 172.24.47.174 + - host: 172.24.47.173 + +hstream_console: + - host: 172.24.47.172 + +prometheus: + - host: 172.24.47.172 + +hstream_exporter: + - host: 172.24.47.172 +``` + +## Set up cluster + +### set up cluster with ssh key-value pair + +```shell +./hdt start -c template/config.yaml -i ~/.ssh/id_rsa -u root +``` + +### set up cluster with passwd + +```shell +./hdt start -c template/config.yaml -p -u root +``` + +then type your password. + +use `./hdt start -h` for more information + +## Remove cluster + +remove cluster will stop cluster and remove ***ALL*** related data. + +### remove cluster with ssh key-value pair + +```shell +./hdt remove -c template/config.yaml -i ~/.ssh/id_rsa -u root +``` + + ### remove cluster with passwd + +```shell +./hdt remove -c template/config.yaml -p -u root +``` + +then type your password. + +## Detailed configuration items + +This section describes the meaning of each field in the configuration file in detail. The configuration file is divided into three main sections: global configuration items, monitoring component configuration items, and other component configuration items. + +### Global + +```yaml +global: + # # Username to login via SSH + user: "root" + # # The path of SSH identity file + key_path: "~/.ssh/hstream-aliyun.pem" + # # SSH service monitor port + ssh_port: 22 + # # Replication factors of store metadata + meta_replica: 1 + # # Local path to MetaStore config file + meta_store_config_path: "" + # # Local path to HStore config file + hstore_config_path: "" + # # HStore config file can be loaded from network filesystem, for example, the config file + # # can be stored in meta store and loaded via network request. Set this option to true will + # # force store load config file from its local filesystem. + disable_store_network_config_path: true + # # Local path to HServer config file + hserver_config_path: "" + # # use grpc-haskell framework + enable_grpc_haskell: false + # # Local path to ElasticSearch config file + elastic_search_config_path: "" + # # Only enable for linux kernel which support dscp reflection(linux kernel version + # # greater and equal than 4.x) + enable_dscp_reflection: false + # # Global container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The Global section is used to set the default configuration values for all other configuration items. + +- `meta_replica` set the replication factors of HStreamDB metadata logs. This value should not exceed the number of `hstore` instances. +- `meta_store_config_path`、`hstore_config_path` and `hserver_config_path` are configuration file path for `meta_store`、`hstore` and `hserver` in the control machine. If the paths are set, these configuration files will be synchronized to the specified location on the node where the respective instance is located, and the corresponding configuration items will be updated when the instance is started. +- `enable_grpc_haskell`: use `grpc-haskell` framework. The default value is false, which will use `hs-grpc` framework. +- `enable_dscp_reflection`: if your operation system version is greater and equal to linux 4.x, you can set this field to true. +- `container_config` let you set resource limitations for all containers. + +### Monitor + +```yaml +monitor: + # # Node exporter port + node_exporter_port: 9100 + # # Node exporter image + node_exporter_image: "prom/node-exporter" + # # Cadvisor port + cadvisor_port: 7000 + # # Cadvisor image + cadvisor_image: "gcr.io/cadvisor/cadvisor:v0.39.3" + # # List of nodes that won't be monitored. + excluded_hosts: [] + # # root directory for all monitor related config files. + remote_config_path: "/home/deploy/monitor" + # # root directory for all monitor related data files. + data_dir: "/home/deploy/data/monitor" + # # Set up grafana without login + grafana_disable_login: true + # # Global container configuration for monitor stacks. + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The Monitor section is used to specify the configuration options for the `cadvisor` and `node-exporter`. + +### HServer + +```yaml +hserver: + # # The ip address of the HServer + - host: 10.1.0.10 + # # HServer docker image + image: "hstreamdb/hstream" + # # The listener is an adderss that a server advertises to its clients so they can connect to the server. + # # Each listener is specified as "listener_name:hstream://host_name:port_number". The listener_name is + # # a name that identifies the listener, and the "host_name" and "port_number" are the IP address and + # # port number that reachable from the client's network. Multi listener will split by comma. + # # For example: public_ip:hstream://39.101.190.70:6582 + advertised_listener: "" + # # HServer listen port + port: 6570 + # # HServer internal port + internal_port: 6571 + # # HServer configuration + server_config: + # # HServer log level, valid values: [critical|error|warning|notify|info|debug] + server_log_level: info + # # HStore log level, valid values: [critical|error|warning|notify|info|debug|spew] + store_log_level: info + # # Specific server compression algorithm, valid values: [none|lz4|lz4hc] + compression: lz4 + # # Root directory of HServer config files + remote_config_path: "/home/deploy/hserver" + # # Root directory of HServer data files + data_dir: "/home/deploy/data/hserver" + # # HServer container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The HServer section is used to specify the configuration options for the `hserver` instance. + +### HAdmin + +```yaml +hadmin: + - host: 10.1.0.10 + # # HAdmin docker image + image: "hstreamdb/hstream" + # # HAdmin listen port + admin_port: 6440 + # # Root directory of HStore config files + remote_config_path: "/home/deploy/hadmin" + # # Root directory of HStore data files + data_dir: "/home/deploy/data/hadmin" + # # HStore container configuration + container_config: + cpu_limit: 2.00 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The HAdmin section is used to specify the configuration options for the `hadmin` instance. + +- Hadmin is not a necessary component. You can configure `hstore` instance to take on the functionality of `hadmin` by setting the configuration option `enable_admin: true` within the hstore. + +- If you have both a HAdmin instance and a HStore instance running on the same node, please note that they cannot both use the same `admin_port` for monitoring purposes. To avoid conflicts, you will need to assign a unique `admin_port` value to each instance. + +### HStore + +```yaml +hstore: + - host: 10.1.0.10 + # # HStore docker image + image: "hstreamdb/hstream" + # # HStore admin port + admin_port: 6440 + # # Root directory of HStore config files + remote_config_path: "/home/deploy/hstore" + # # Root directory of HStore data files + data_dir: "/home/deploy/data/store" + # # Total used disks + disk: 1 + # # Total shards + shards: 2 + # # The role of the HStore instance. + role: "Both" # [Storage|Sequencer|Both] + # # When Enable_admin is turned on, the instance can receive and process admin requests + enable_admin: true + # # HStore container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The HStore section is used to specify the configuration options for the `hstore` instance. + +- `admin_port`: HStore service will listen on this port. +- `disk` and `shards`: Set total used disks and total shards. For example, `disk: 2` and `shards: 4` means the hstore will persistant data in two disks, and each disk will contain 2 shards. +- `role`: a HStore instance can act as a Storage, a Sequencer or both, default is both. +- `enable_admin`: If the 'true' value is assigned to this setting, the current hstore instance will be able to perform the same functions as hadmin. + +### Meta-store + +```yaml +meta_store: + - host: 10.1.0.10 + # # Meta-store docker image + image: "zookeeper:3.6" + # # Meta-store port, currently only works for rqlite. zk will + # # monitor on 4001 + port: 4001 + # # Raft port used by rqlite + raft_port: 4002 + # # Root directory of Meta-Store config files + remote_config_path: "/home/deploy/metastore" + # # Root directory of Meta-store data files + data_dir: "/home/deploy/data/metastore" + # # Meta-store container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The Meta-store section is used to specify the configuration options for the `meta-store` instance. + +- `port` and `raft_port`: these are used by `rqlite` + +### Monitor stack components + +```yaml +prometheus: + - host: 10.1.0.15 + # # Prometheus docker image + image: "prom/prometheus" + # # Prometheus service monitor port + port: 9090 + # # Root directory of Prometheus config files + remote_config_path: "/home/deploy/prometheus" + # # Root directory of Prometheus data files + data_dir: "/home/deploy/data/prometheus" + # # Prometheus container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +grafana: + - host: 10.1.0.15 + # # Grafana docker image + image: "grafana/grafana-oss:main" + # # Grafana service monitor port + port: 3000 + # # Root directory of Grafana config files + remote_config_path: "/home/deploy/grafana" + # # Root directory of Grafana data files + data_dir: "/home/deploy/data/grafana" + # # Grafana container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +alertmanager: + # # The ip address of the Alertmanager Server. + - host: 10.0.1.15 + # # Alertmanager docker image + image: "prom/alertmanager" + # # Alertmanager service monitor port + port: 9093 + # # Root directory of Alertmanager config files + remote_config_path: "/home/deploy/alertmanager" + # # Root directory of Alertmanager data files + data_dir: "/home/deploy/data/alertmanager" + # # Alertmanager container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +hstream_exporter: + - host: 10.1.0.15 + # # hstream_exporter docker image + image: "hstreamdb/hstream-exporter" + # # hstream_exporter service monitor port + port: 9250 + # # Root directory of hstream_exporter config files + remote_config_path: "/home/deploy/hstream-exporter" + # # Root directory of hstream_exporter data files + data_dir: "/home/deploy/data/hstream-exporter" + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +Currently, HStreamDB monitor stack contains the following components:`node-exporter`, `cadvisor`, `hstream-exporter`, `grafana`, `alertmanager` and `hstream-exporter`. The global configuration of the monitor stack is available in [monitor](./quick-deploy-ssh.md#monitor) field. + +### Elasticsearch, Kibana and Filebeat + +```yaml +elasticsearch: + - host: 10.1.0.15 + # # Elasticsearch service monitor port + port: 9200 + # # Elasticsearch docker image + image: "docker.elastic.co/elasticsearch/elasticsearch:8.5.0" + # # Root directory of Elasticsearch config files + remote_config_path: "/home/deploy/elasticsearch" + # # Root directory of Elasticsearch data files + data_dir: "/home/deploy/data/elasticsearch" + # # Elasticsearch container configuration + container_config: + cpu_limit: 2.00 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +kibana: + - host: 10.1.0.15 + # # Kibana service monitor port + port: 5601 + # # Kibana docker image + image: "docker.elastic.co/kibana/kibana:8.5.0" + # # Root directory of Kibana config files + remote_config_path: "/home/deploy/kibana" + # # Root directory of Kibana data files + data_dir: "/home/deploy/data/kibana" + # # Kibana container configuration + container_config: + cpu_limit: 2.00 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +filebeat: + - host: 10.1.0.10 + # # Filebeat docker image + image: "docker.elastic.co/beats/filebeat:8.5.0" + # # Root directory of Filebeat config files + remote_config_path: "/home/deploy/filebeat" + # # Root directory of Filebeat data files + data_dir: "/home/deploy/data/filebeat" + # # Filebeat container configuration + container_config: + cpu_limit: 2.00 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + diff --git a/docs/v0.17.0/index.md b/docs/v0.17.0/index.md new file mode 100644 index 0000000..f1ab673 --- /dev/null +++ b/docs/v0.17.0/index.md @@ -0,0 +1,76 @@ + +# Introduction to HStreamDB + +## Overview + +**HStreamDB is a streaming database designed for streaming data, with complete +lifecycle management for accessing, storing, processing, and distributing +large-scale real-time data streams**. It uses standard SQL (and its stream +extensions) as the primary interface language, with real-time as the main +feature, and aims to simplify the operation and management of data streams and +the development of real-time applications. + +## Why HStreamDB? + +Nowadays, data is continuously being generated from various sources, e.g. sensor +data from the IoT, user-clicking events on the Internet, etc.. We want to build +low-latency applications that respond quickly to these incoming streaming data +to provide a better user experience, real-time data insights and timely business +decisions. + +However, currently, it is not easy to build such stream processing applications. +To construct a basic stream processing architecture, we always need to combine +multiple independent components. For example, you would need at least a +streaming data capture subsystem, a message/event storage component, a stream +processing engine, and multiple derived data systems for different queries. + +None of these should be so complicated, and this is where HStreamDB comes into +play. Just as you can easily build a simple CRUD application based on a +traditional database, with HStreamDB, you can easily build a basic streaming +application without any other dependencies. + +## Key Features + +### Reliable, low-latency streaming data storage + +With an optimized storage engine design, HStreamDB provides low latency persistent storage of streaming data and replicates written data to multiple storage nodes to ensure data reliability. + +It also supports hierarchical data storage and can automatically dump historical data to lower-cost storage services such as object storage, distributed file storage, etc. The storage capacity is infinitely scalable, enabling permanent storage of data. + +### Easy support and management of large scale data streams + +HStreamDB uses a stream-native design where data is organized and accessed as streams, supporting creating and managing large data streams. Stream creation is a very lightweight operation in HStreamDB, maintaining stable read and write latency despite large numbers of streams being read and written concurrently. + +The performance of HStreamDB streams is excellent thanks to its native design, supporting millions of streams in a single cluster. + +### Real-time, orderly data subscription delivery + +HStreamDB is based on the classic publish-subscribe model, providing low-latency data subscription delivery for data consumption and the ability to deliver data subscriptions in the event of cluster failures and errors. + +It also guarantees the orderly delivery of machines in the event of cluster failures and errors. + +### Powerful stream processing support built-in + +HStreamDB has designed a complete processing solution based on event time. It supports basic filtering and conversion operations, aggregations by key, calculations based on various time windows, joining between data streams, and processing disordered and late messages to ensure the accuracy of calculation results. Simultaneously, the stream processing solution of HStream is highly extensible, and users can extend the interface according to their own needs. + +### Real-time analysis based on materialized views + +HStreamDB will offer materialized view to support complex query and analysis operations on continuously updated data streams. The incremental computing engine updates the materialized view instantly according to the changes of data streams, and users can query the materialized view through SQL statements to get real-time data insights. + +### Easy integration with multiple external systems + +The stream-native design of HStreamDB and the powerful stream processing capabilities built-in make it ideally suited as a data hub for the enterprise, responsible for all data access and flow, connecting multiple upstream and downstream services and data systems. + +For this reason, HStreamDB also provides Connector components for interfacing with various external systems, such as MySQL, ClickHouse, etc., making it easy to integrate with external data systems. + +### Cloud-native architecture, unlimited horizontal scaling + +HStreamDB is built with a Cloud-Native architecture, where the compute and storage layers are separated and can be horizontally scaled independently. + +It also supports online cluster scaling, dynamic expansion and contraction, and is efficient in scaling without data repartitioning, mass copying, etc. + +### Fault tolerance and high availability + +HStreamDB has built-in automatic node failure detection and error recovery mechanisms to ensure high availability while using an optimized consistency model based on Paxos. + +Data is always securely replicated to multiple nodes, ensuring consistency and orderly delivery even in errors and failures. diff --git a/docs/v0.17.0/ingest-and-distribute/_index.md b/docs/v0.17.0/ingest-and-distribute/_index.md new file mode 100644 index 0000000..a95e6a7 --- /dev/null +++ b/docs/v0.17.0/ingest-and-distribute/_index.md @@ -0,0 +1,6 @@ +--- +order: ['overview.md', 'user_guides.md', 'connectors.md'] +collapsed: false +--- + +Ingest and Distribute data diff --git a/docs/v0.17.0/ingest-and-distribute/connectors.md b/docs/v0.17.0/ingest-and-distribute/connectors.md new file mode 100644 index 0000000..869c906 --- /dev/null +++ b/docs/v0.17.0/ingest-and-distribute/connectors.md @@ -0,0 +1,20 @@ +# Connectors + +Sources: + +| Name | Configuration | Image | +| ----------------- | --------------------------------------------------------------------------------------------------------------- | -------------------------------------------- | +| source-mysql | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_mysql_spec.md) | hstreamdb/connector:source-mysql:latest | +| source-postgresql | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_postgresql_spec.md) | hstreamdb/connector:source-postgresql:latest | +| source-sqlserver | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_sqlserver_spec.md) | hstreamdb/connector:source-sqlserver:latest | +| source-mongodb | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_mongodb_spec.md) | hstreamdb/connector:source-mongodb:latest | +| source-generator | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_generator_spec.md) | hstreamdb/connector:source-generator:latest | + +Sinks: + +| Name | Configuration | Image | +| --------------- | ------------------------------------------------------------------------------------------------------------- | ------------------------------------------ | +| sink-mysql | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_mysql_spec.md) | hstreamdb/connector:sink-mysql:latest | +| sink-postgresql | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_postgresql_spec.md) | hstreamdb/connector:sink-postgresql:latest | +| sink-mongodb | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_mongodb_spec.md) | hstreamdb/connector:sink-mongodb:latest | +| sink-blackhole | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_blackhole_spec.md) | hstreamdb/connector:sink-blackhole:latest | diff --git a/docs/v0.17.0/ingest-and-distribute/overview.md b/docs/v0.17.0/ingest-and-distribute/overview.md new file mode 100644 index 0000000..42d5a03 --- /dev/null +++ b/docs/v0.17.0/ingest-and-distribute/overview.md @@ -0,0 +1,121 @@ +# HStream IO Overview + +HStream IO is an internal data integration framework for HStreamDB, composed of connectors and IO runtime. +It allows interconnection with various external systems, +facilitating the efficient flow of data across the enterprise data stack and thereby unleashing the value of real-time-ness. + +## Motivation + +HStreamDB is a streaming database, +we want to build a reliable data integration framework to connect HStreamDB with external systems easily, +we also want to use HStreamDB to build a real-time data synchronization service (e.g. synchronizes data from MySQL to PostgreSQL). + +Here are our goals for HStream IO: + +* easy to use +* scalability +* fault-tolerance +* extensibility +* streaming and batch +* delivery semantics + +HStream IO is highly inspired by Kafka Connect, Pulsar IO, Airbyte, etc. frameworks, +we will introduce the architecture and workflow of HStream IO, +and compare it with other frameworks to describe how HStream IO achieves the goals listed above. + +## Architect and Workflow + +HStream IO consists of two components: + +* IO Runtime: IO Runtime is a part of HStreamDB managing and empowering scalability, fault-tolerance, and load-balancing for connectors. +* Connectors: Connectors are used to synchronize data between HStreamDB and external systems. + +HStream IO provides two types of connectors: +* Source Connector - A source connector subscribes to data from other systems such as MySQL, and PostgreSQL, making the data available for data processing in HStreamDB. +* Sink Connector - A sink connector writes data to other systems from HStreamDB streams. + +For a clear understanding, +we would name a running connector process to be a task and the Docker image for the connector is a connector plugin. + +Here is a summary workflow of creating a source connector: + +1. Users can send a CREATE SOURCE CONNECTOR SQL to HStreamDB to create a connector +2. HStreamDB dispatches the request to a correct node +3. HStream IO Runtime handles the request to launch a connector task +4. the connector task will fetch data from source systems and store them in HStreamDB. + +## Design and Implement + +### Easy to use + +HStream IO is a part of HStreamDB, +so if you want to create a connector, +do not need to deploy an HStream IO cluster like Kafka Connect, +just send a SQL to HStreamDB, e.g.: + +``` +create source connector source01 from mysql with + ( "host" = "mysql-s1" + , "port" = 3306 + , "user" = "root" + , "password" = "password" + , "database" = "d1" + , "table" = "person" + , "stream" = "stream01" + ); +``` + +### Scalability, Availability, and Delivery Semantics + +Connectors are resources for HStreamDB Cluster, +HStreamDB Cluster provides high scalability and fault-tolerance for HStream IO, +for more details, please check HStreamDB docs. + +Users can manually create multiple connectors for sources or streams to use parallel synchronization to achieve better performance, +we will support a connector scheduler for dynamical parallel synchronization like Kafka Connect and Pulsar IO soon. + +When a connector is running, the offsets of the connector will be recorded in HStreamDB, +so if the connector failed unexpectedly, +HStream IO Runtime will detect the failure and recover it by recent offsets, +even if the node crashed, +HStreamDB cluster will rebalance the connectors on the node to other nodes and recover them. + +HStream IO supported at-least-once delivery semantics now, +we will support more delivery semantics(e.g. exactly-once delivery semantic) for some connectors later. + +### Streaming and Batch + +Many ELT frameworks like Airbyte are designed for batch systems, +they can not handle streaming data efficiently, +HStreamDB is a streaming database, +and a lot of streaming data need to be loaded into HStreamDB, +so HStream IO is designed to support both streaming and batch data, +and users can use it to build a real-time streaming data synchronization service. + +### Extensibility + +We want to establish a great ecosystem like Kafka Connect and Airbyte, +so an accessible connector API for deploying new connectors is necessary. + +Kafka Connect design a java connector API, +you can not develop connectors in other languages easily, +Airbyte and Pulsar IO inspired us to build a connector plugin as a Docker image to support multiple languages +and design a protocol between HStream IO Runtime and connectors, +but it brings more challenges to simplify the connector API, +you can not implement a couple of Java interfaces to build a connector easily like Kafka Connect, +you have to care about how to build a Docker image, +handle command line arguments, +implement the protocol interfaces correctly, etc. + +So to avoid that we split the connector API into two parts: + +* HStream IO Protocol +* Connector Toolkit + +Compared with Airbyte's heavy protocol, +HStream IO Protocol is designed as simple as possible, +it provides basic management interfaces for launching and stopping connectors, +does not need to exchange record messages(it will bring more latencies), +the Connector Toolkit is designed to handle heaviest jobs(e.g. fetch data from source systems, write data into HStreamDB, recorded offsets, etc.) +to provide the simplest connector API, +so developers can use Connector Toolkit to implement new connectors easily like Kafka Connect. diff --git a/docs/v0.17.0/ingest-and-distribute/user_guides.md b/docs/v0.17.0/ingest-and-distribute/user_guides.md new file mode 100644 index 0000000..f9e5f5f --- /dev/null +++ b/docs/v0.17.0/ingest-and-distribute/user_guides.md @@ -0,0 +1,146 @@ +# HStream IO User Guides + +Data synchronization service is used to synchronize data between systems(e.g. databases) in real time, +which is useful for many cases, for example, MySQL is a widely-used database, +if your application is running on MySQL, and: + +* You found its query performance is not enough. + + You want to migrate your data to another database (e.g. PostgreSQL), but you need to switch your application seamlessly. + + Your applications highly depended on MySQL, migrating is difficult, so you have to migrate gradually. + + You don't need to migrate the whole MySQL data, instead, just copy some data from MySQL to other databases(e.g. HStreamDB) for data analysis. +* You need to upgrade your MySQL version for some new features seamlessly. +* You need to back up your MySQL data in multiple regions in real time. + +In those cases, you will find a data synchronization service is helpful, +HStream IO is an internal data integration framework for HStreamDB, +and it can be used as a data synchronization service, +this document will show you how to use HStream IO to build a data synchronization service from a MySQL table to a PostgreSQL table, +you will learn: + +* How to create a source connector that synchronizes records from a MySQL table to an HStreamDB stream. +* How to create a sink connector that synchronizes records from an HStreamDB stream to a PostgreSQL table. +* How to use HStream SQLs to manage the connectors. + +## Set up an HStreamDB Cluster + +You can check +[quick-start](https://hstream.io/docs/en/latest/start/quickstart-with-docker.html) +to find how to set up an HStreamDB cluster and connect to it. + +## Set up a MySQL + +Set up a MySQL instance with docker: + +```shell +docker run --network=hstream-quickstart --name mysql-s1 -e MYSQL_ROOT_PASSWORD=password -d mysql +``` + +Here we use the `hstream-quickstart` network if you set up your HStreamDB +cluster based on +[quick-start](https://hstream.io/docs/en/latest/start/quickstart-with-docker.html). + +Connect to the MySQL instance: + +```shell +docker exec -it mysql-s1 mysql -uroot -h127.0.0.1 -P3306 -ppassword +``` + +Create a database `d1`, a table `person` and insert some records: + +```sql +create database d1; +use d1; +create table person (id int primary key, name varchar(256), age int); +insert into person values (1, "John", 20), (2, "Jack", 21), (3, "Ken", 33); +``` + +the table `person` must include a primary key, or the `DELETE` statement may not +be synced correctly. + +## Set up a PostgreSQL + +Set up a PostgreSQL instance with docker: + +```shell +docker run --network=hstream-quickstart --name pg-s1 -e POSTGRES_PASSWORD=postgres -d postgres +``` + +Connect to the PostgreSQL instance: + +```shell +docker exec -it pg-s1 psql -h 127.0.0.1 -U postgres +``` + +`sink-postgresql` doesn't support the automatic creation of a table yet, so you +need to create the database `d1` and the table `person` first: + +```sql +create database d1; +\c d1; +create table person (id int primary key, name varchar(256), age int); +``` + +The table `person` must include a primary key. + +## Download Connector Plugins + +A connector plugin is a docker image, so before you can set up the connectors, +you should download and update their plugins with `docker pull`: + +```shell +docker pull hstreamdb/source-mysql:latest +docker pull hstreamdb/sink-postgresql:latest +``` + +[Here](https://hstream.io/docs/en/latest/io/connectors.html) is a table of all +available connectors. + +## Create Connectors + +After connecting an HStream Server, you can use create source/sink connector +SQLs to create connectors. + +Connect to the HStream server: + +```shell-vue +docker run -it --rm --network host hstreamdb/hstream:{{ $version() }} hstream sql --port 6570 +``` + +Create a source connector: + +```sql +create source connector source01 from mysql with ("host" = "mysql-s1", "port" = 3306, "user" = "root", "password" = "password", "database" = "d1", "table" = "person", "stream" = "stream01"); +``` + +The source connector will run an HStream IO task, which continually synchronizes +data from MySQL table `d1.person` to stream `stream01`. Whenever you update +records in MySQL, the change events will be recorded in stream `stream01` if the +connector is running. + +You can use `SHOW CONNECTORS` to check connectors and their status and use +`PAUSE` and `RESUME` to stop/restart connectors: + +```sql +PAUSE connector source01; +RESUME connector source01; +``` + +If resume the connector task, the task will not fetch data from the beginning, +instead, it will continue from the point when it was paused. + +Then you can create a sink connector that consumes the records from the stream +`stream01` to PostgreSQL table `d1.public.person`: + +```sql +create sink connector sink01 to postgresql with ("host" = "pg-s1", "port" = 5432, "user" = "postgres", "password" = "postgres", "database" = "d1", "table" = "person", "stream" = "stream01"); +``` + +With both `source01` and `sink01` connectors running, you get a synchronization +service from MySQL to PostgreSQL. + +You can use the `DROP CONNECTOR` statement to delete the connectors: + +```sql +drop connector source01; +drop connector sink01; +``` diff --git a/docs/v0.17.0/overview/_index.md b/docs/v0.17.0/overview/_index.md new file mode 100644 index 0000000..43e9b9a --- /dev/null +++ b/docs/v0.17.0/overview/_index.md @@ -0,0 +1,6 @@ +--- +order: ['concepts.md'] +collapsed: false +--- + +Overview diff --git a/docs/v0.17.0/overview/concepts.md b/docs/v0.17.0/overview/concepts.md new file mode 100644 index 0000000..1192788 --- /dev/null +++ b/docs/v0.17.0/overview/concepts.md @@ -0,0 +1,41 @@ +# Concepts + +This page explains key concepts in HStream, which we recommend you to understand before you start. + +## Record + +In HStream, a record is a unit of data that may contain arbitrary user data and is immutable. Each record is assigned a unique recordID in a stream. Additionally, a partition key is included in every record, represented as a string, and used to determine the stream shard where the record is stored. + +## Stream + +All records live in streams. A stream is essentially an unbound, append-only dataset. A stream can contain multiple shards and each shard can be located in different nodes. There are some attributes of a stream, such as: +- Replicas: how many replicas the data in the stream has +- Retention period: how long the data in the stream is retained + +## Subscription + +Clients can obtain the latest data in streams in real time by subscriptions. A subscription can automatically track and save the progress of clients processing records: clients indicate that a record has been successfully received and processed by replying to the subscription with a corresponding ACK, and the subscription will not continue to deliver records to clients that have already been acked. If the subscription has not received ACKs from clients after the specified time, it will redeliver last records. + +A subscription is immutable, which means you cannot reset its internal delivery progress. Multiple subscriptions can be created on the same stream, and they are independent of each other. + +Multiple clients can join the same subscription, and the system will distribute records accross clients based on different subscription modes. Currently, the default subscription mode is shard-based, which means that records in a shard will be delivered to the same client, and different shards can be assigned to different clients. + +## Query + +Unlike queries in traditional databases that operate on static datasets, return limited results, and immediately terminate execution, queries in HStream operate on unbound data streams, continuously update results as the source streams changes, and keep running until the user stops it explicitly. This kind of long-running queries against unbound streams is also known as the streaming query. + +By default, a query will write its computing results to a new stream continuously. Clients can subscribe to the result stream to obtain real-time updates of the computing results. + +## Materialized View + +Queries are also usually used to create materialized views. Unlike streams that store records in an append-only way, materialized views are more similar to tables in relational databases that hold results in a compacted form, which means that you can query them directly and get the latest results immediately. + +Traditional databases also have materialized views, but the results are often stale and they have a significant impact on database performance, so they are generally rarely used. However, in HStream, the results saved in materialized views are automatically updated in real-time with changes in the source streams, which is very useful in many scenarios, such as building real-time dashboards. + +## Connector + +The connectors are responsible for the streaming data integration between HStream and external systems and can be divided into two categories according to the direction of integration: source connectors and sink connectors. + +Source connectors are used for continuously ingesting data from other systems into HStream, and sink connectors are responsible for continuously distributing data from HStream to other systems.There are also different types of connectors for different systems,such as PostgreSQL connector, MongoDB connector… + +The running of connectors are supervised, managed and scheduled by HStream itself, without relying on any other systems. diff --git a/docs/v0.17.0/platform/_index.md b/docs/v0.17.0/platform/_index.md new file mode 100644 index 0000000..d7f40da --- /dev/null +++ b/docs/v0.17.0/platform/_index.md @@ -0,0 +1,12 @@ +--- +order: + - stream-in-platform.md + - write-in-platform.md + - subscription-in-platform.md + - create-queries-in-platform.md + - create-views-in-platform.md + - create-connectors-in-platform.md +collapsed: false +--- + +HStream Platform diff --git a/docs/v0.17.0/platform/create-connectors-in-platform.md b/docs/v0.17.0/platform/create-connectors-in-platform.md new file mode 100644 index 0000000..d1e264c --- /dev/null +++ b/docs/v0.17.0/platform/create-connectors-in-platform.md @@ -0,0 +1,108 @@ +# Create and Manage Connectors + +This page describes how to create and manage connectors in HStream Platform. + +## Create a connector + +Connector has two types: source connector and sink connector. The source connector is used to ingest data from external systems into HStream Platform, while the sink connector is used to distribute data from HStream Platform to external systems. + +### Create a source connector + +First, navigate to the **Sources** page and click the **New source** button to go to the **Create a new source** page. + +In this page, you should first select the **Connector type**. Currently, HStream Platform supports the following source connectors: + +- MongoDB +- MySQL +- PostgreSQL +- SQL Server +- Generator + +Click one of them to select it, and then the page will display the corresponding configuration form. + +After filling in the configuration, click the **Create** button to create the source connector. + +::: tip + +For more details about the configuration of each source connector, please refer to [Connectors](../ingest-and-distribute/connectors.md). + +::: + +### Create a sink connector + +Create a sink connector is similar to create a source connector. First, navigate to the **Sinks** page and click the **New sink** button to go to the **Create a new sink** page. + +Then the next steps are the same as creating a source connector. + +Currently, HStream Platform supports the following sink connectors: + +- MongoDB +- MySQL +- PostgreSQL +- Blackhole + +## View connectors + +The **Sources** and **Sinks** pages display all the connectors in your account. For each connector, you can view the following information: + +- The **Name** of the connector. +- The **Created time** of the connector. +- The **Status** of the connector. +- The **Type** of the connector. +- **Actions**, which for the extra operations of the connector: + + - **Duplicate**: Duplicate the connector. + - **Delete**: Delete the connector. + +To view a specific connector, you can click the **Name** of the connector to go to the [details page](#view-connector-details). + +## View connector details + +The details page displays the detailed information of a connector: + +1. All the information in the [connectors](#view-connectors) page. +2. Different tabs are provided to display the related information of the connector: + + - [**Overview**](#view-connector-overview): Besides the basic information, also can view the metrics of the connector. + - **Config**: View the configuration of the connector. + - [**Consumption Process**](#view-connector-consumption-process): View the consumption process of the connector. + - **Logs**: View the tasks of the connector. + +## View connector overview + +The **Overview** page displays the metrics of a connector. The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the connector include (with last 5 minutes as an example), from left to right: + +- **Processed records throughput**: The number of records that the connector processes per second. +- **Processed bytes throughput**: The number of bytes that the connector processes per second. +- **Total records** (Sink): The number of records that the connector processes in the last 5 minutes. + +## View connector consumption process + +The **Consumption Process** page displays the consumption process of a connector. Different connectors have different consumption processes. + +## Delete a connector + +This section describes how to delete a connector. + +### Delete a connector from the Connectors page + +1. Navigate to the **Connectors** page. +2. Click the **Delete** icon of the connector you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a connector from the Connector Details page + +1. Navigate to the details page of the connector you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/v0.17.0/platform/create-queries-in-platform.md b/docs/v0.17.0/platform/create-queries-in-platform.md new file mode 100644 index 0000000..f95ec68 --- /dev/null +++ b/docs/v0.17.0/platform/create-queries-in-platform.md @@ -0,0 +1,127 @@ +# Create and Manage Streaming Queries + +This page describes how to create and manage streaming queries in HStream Platform. + +## Create a query + +First, navigate to the **Queries** page and click the **Create query** button to +go to the **Create query** page. + +In this page, you can see 3 areas used throughout the creation process: + +- The **Stream Catalog** area on the left is used to select the streams you + want to use in the query. +- The **Query Editor** area on the top right is used to write the query. +- The **Query Result** area on the bottom right is used to display the query result. + +Below sections describe how these areas are used in the creation process. + +### Stream Catalog + +The **Stream Catalog** will display all the streams as a list. You can use one of +them as the source stream of the query. For example, if a stream is `test`, after +selecting it, the **Query Editor** will be filled with the following query: + +```sql +CREATE STREAM stream_iyngve AS SELECT * FROM test; +``` + +This can help you quickly create a query. You can also change the query to meet your needs. + +::: tip +The auto-generated query is commented by default. You need to uncomment it to make it work. +::: + +::: info +The auto-generated query will generate a stream with a `stream_` prefix and a random suffix after +creating it. You can change the name of the stream to meet your needs. +::: + +### Query Editor + +The **Query Editor** is used to write the query. + +Besides the textarea, there are still a right sidebar to assist you in writing the query. +To create a query, you need to provide a **Query name** to identify the query, an text field in the right sidebar will automatically generate a query name for you. You can also change it to meet your needs. + +Once you finish writing the query, click the **Save & Run** button to create the query and run it. + +### Query Result + +After creating the query, the **Query Result** area will display the query result in real time. + +The query result is displayed in a table. Each row represents a record in the stream. You can refer to [Get Records](./write-in-platform.md#get-records) to learn more about the record. + +If you want to stop viewing the query result, you can click the **Cancel** button to stop it. For re-viewing the query result, you can click the **View Live Result** button to view again. + +::: info + +When creating a materialized view, it will internally create a query and its result is the view. So the query result is the same as the view result. + +::: + +## View queries + +The **Queries** page displays all the queries in your account. For each query, you can view the following information: + +- The **Name** of the query. +- The **Created time** of the query. +- The **Status** of the query. +- The **SQL** of the query. +- **Actions**, which for the extra operations of the query: + + - **Terminate**: Terminate the query. + - **Duplicate**: Duplicate the query. + - **Delete**: Delete the query. + +To view a specific query, you can click the **Name** of the query to go to the [details page](#view-query-details). + +## View query details + +The details page displays the detailed information of a query: + +1. All the information in the [queries](#view-queries) page. +2. Different tabs are provided to display the related information of the query: + + - [**Overview**](#view-query-overview): Besides the basic information, also can view the metrics of the query. + - [**SQL**](#view-query-sql): View the SQL of the query. + +## View query overview + +The **Overview** page displays the metrics of a query. The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the query include (with last 5 minutes as an example), from left to right: + +- **Input records throughput**: The number of records that the query receives from the source stream per second. +- **Output records throughput**: The number of records that the query outputs to the result stream per second. +- **Total records**: The number of records to the query in the last 5 minutes. Including input and output records. +- **Execution errors**: The number of errors that the query encounters in the last 5 minutes. + +## View query SQL + +The **SQL** page displays the SQL of a query. You can only view the SQL of the query, but cannot edit it. + +## Delete a query + +This section describes how to delete a query. + +### Delete a query from the Queries page + +1. Navigate to the **Queries** page. +2. Click the **Delete** icon of the query you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a query from the Query Details page + +1. Navigate to the details page of the query you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/v0.17.0/platform/create-views-in-platform.md b/docs/v0.17.0/platform/create-views-in-platform.md new file mode 100644 index 0000000..f9a6c78 --- /dev/null +++ b/docs/v0.17.0/platform/create-views-in-platform.md @@ -0,0 +1,70 @@ +# Create and Manage Materialized Views + +This page describes how to create and manage materialized views in HStream Platform. + +## Create a view + +Create a view is similar to create a query. The main difference is that the SQL is a `CREATE VIEW` statement. + +Please refer to [Create a query](./create-queries-in-platform.md#create-a-query) for more details. + +## View views + +The **Views** page displays all the views in your account. For each view, you can view the following information: + +- The **Name** of the view. +- The **Created time** of the view. +- The **Status** of the view. +- **Actions**, which for the extra operations of the view: + + - **Delete**: Delete the view. + +To view a specific view, you can click the **Name** of the view to go to the [details page](#view-view-details). + +## View view details + +The details page displays the detailed information of a view: + +1. All the information in the [views](#view-views) page. +2. Different tabs are provided to display the related information of the view: + + - [**Overview**](#view-view-overview): Besides the basic information, also can view the metrics of the view. + - [**SQL**](#view-view-sql): View the SQL of the view. + +## View view overview + +The **Overview** page displays the metrics of a view. The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the view include (with last 5 minutes as an example), from left to right: + +- **Execution queries throughput**: The number of queries that the view executes per second. +- **Execution queries**: The number of queries that the view executes in the last 5 minutes. + +## View view SQL + +The **SQL** page displays the SQL of a view. You can only view the SQL of the view, but cannot edit it. + +## Delete a view + +This section describes how to delete a view. + +### Delete a view from the Views page + +1. Navigate to the **Views** page. +2. Click the **Delete** icon of the view you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a view from the View Details page + +1. Navigate to the details page of the view you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/v0.17.0/platform/stream-in-platform.md b/docs/v0.17.0/platform/stream-in-platform.md new file mode 100644 index 0000000..cd5a9ea --- /dev/null +++ b/docs/v0.17.0/platform/stream-in-platform.md @@ -0,0 +1,156 @@ +# Create and Manage Streams + +This tutorial guides you on how to create and manage streams in HStream Platform. + +## Preparation + +1. If you do not have an account, please [apply for a trial](../start/try-out-hstream-platform.md#apply-for-a-trial) first and log in. After logging in, click **Streams** on the left sidebar to enter the streams page. + +2. If you have already logged in, click **Streams** on the left sidebar to enter the **Streams** page. + +3. Click the **New stream** button to create a stream. + +## Create a stream + +After clicking the **New stream** button, you will be directed to the **New Stream** page. You need to set some necessary properties for your stream and create it: + +1. Specify the **stream name**. You can refer to [Guidelines to name a resource](../write/stream.md#guidelines-to-name-a-resource) to name a stream. + +2. Fill in with the number of **shards** you want this stream to have. The default value is **1**. + + > Shard is the primary storage unit for the stream. For more details, please refer to [Sharding in HStreamDB](../write/shards.md#sharding-in-hstreamdb). + +3. Fill in with the number of **replicas** for each stream. The default value is **3**. + +4. Fill in with the number of **retention** for each stream. Default value is **72**. Unit is **hour**. + +5. Click the **Confirm** button to create a stream. + +::: tip +For more details about **replicas** and **retention**, please refer to [Attributes of a Stream](../write/stream.md#attributes-of-a-stream). +::: + +::: warning +Currently, the number of **replicas** and **retention** are fixed for each stream in HStream Platform. We will gradually adjust these attributes in the future. +::: + +## View streams + +The **Streams** page lists all the streams in your account with a high-level overview. For each stream, you can view the following information: + +- The **name** of the stream. +- The **Creation time** of the stream. +- The number of **shards** in a stream. +- The number of **replicas** in a stream. +- The **Data retention period** of the records in a stream. +- **Actions**, which for the extra operations of the stream: + + - **Metrics**: View the metrics of the stream. + - **Subscriptions**: View the subscriptions of the stream. + - **Shards**: View the shard details of the stream. + - **Delete**: Delete the stream. + +To view a specific stream, click the name. [The details page of the stream](#view-stream-details) will be displayed. + +## View stream details + +The details page displays the detailed information of a stream: + +1. All the information in the [streams](#view-streams) page. +2. Different tabs are provided to display the related information of the stream: + + - [**Metrics**](#view-stream-metrics): View the metrics of the stream. + - [**Subscriptions**](#view-stream-subscriptions): View the subscriptions of the stream. + - [**Shards**](#view-stream-shards): View the shard details of the stream. + - [**Records**](#get-records-in-a-stream): Search records in the stream. + +### View stream metrics + +After clicking the **Metrics** tab, you can view the metrics of the stream. +The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the stream include (with last 5 minutes as an example), from left to right: + +- The **Append records throughout** chart shows the number of records to the stream per second in the last 5 minutes. +- The **Append bytes throughout** chart shows the number of bytes to the stream per second in the last 5 minutes. +- The **Total requests** chart shows the number of requests to the stream in the last 5 minutes. Including failed requests. +- The **Append requests throughout** chart shows the number of requests to the stream per second in the last 5 minutes. + +### View stream subscriptions + +After clicking the **Subscriptions** tab, you can view the subscriptions of the stream. + +To create a new subscription, please refer to [Create a Subscription](./subscription-in-platform.md#create-a-subscription). + +For more details about the subscription, please refer to [Subscription Details](./subscription-in-platform.md#subscription-details) + +### View stream shards + +After clicking the **Shards** tab, you can view the shard details of the stream. + +For each shard, you can view the following information: + +- The **ID** of the shard. +- The **Range start** of the shard. +- The **Range end** of the shard. +- The current **Status** of the shard. + +You can use the ID to get records. Please refer to [Get records in a stream](#get-records-in-a-stream) or [Get Records](./write-in-platform.md#get-records). + +### Get records in a stream + +After clicking the **Records** tab, you can get records in the stream. + +::: tip + +To get records from any streams, please refer to [Get Records](./write-in-platform.md#get-records). + +::: + +You can specify the following filters to get records: + +- **Shard**: Select one of the shards in the stream you want to get records from. +- Special filters: + - **Start record ID**: Get records after a specified record ID. The default is the first record. + - **Start date**: Get records after a specified date. + +After providing the filters (or not), click the **Get records** button to get records. Each record is displayed in a row with the following information: + +- The **ID** of the record. +- The **Key** of the record. +- The **Value** of the record. +- The **Shard ID** of the record. +- The **Creation time** of the record. + +## Delete a Stream + +This section describes how to delete a stream. + +::: warning +If a stream has subscriptions, this stream cannot be deleted. +::: + +::: danger +Deleting a stream is irreversible, and the data cannot be recovered after deletion. +::: + +### Delete a stream on the Streams page + +1. Navigate to the **Streams** page. +2. Click the **Delete** icon of the stream you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a stream on the Stream Details page + +1. Navigate to the details page of the stream you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/v0.17.0/platform/subscription-in-platform.md b/docs/v0.17.0/platform/subscription-in-platform.md new file mode 100644 index 0000000..de1bc84 --- /dev/null +++ b/docs/v0.17.0/platform/subscription-in-platform.md @@ -0,0 +1,117 @@ +# Create and Manage Subscriptions + +This tutorial guides you on how to create and manage subscriptions in HStream Platform. + +## Preparation + +1. If you do not have an account, please [apply for a trial](../start/try-out-hstream-platform.md#apply-for-a-trial) first and log in. After logging in, click **Subscriptions** on the left sidebar to enter the subscriptions page. + +2. If you have already logged in, click **Subscriptions** on the left sidebar to enter the **Subscriptions** page. + +3. Click the **New subscription** button to create a subscription. + +## Create a subscription + +After clicking the **New subscription** button, you will be directed to the **New subscription** page. You need to set some necessary properties for your stream and create it: + +1. Specify the **Subscription ID**. You can refer to [Guidelines to name a resource](../write/stream.md#guidelines-to-name-a-resource) to name a subscription. + +2. Select a stream as the source from the dropdown list. + +3. Fill in with the **ACK timeout**. The default value is **60**. Unit is **second**. + +4. Fill in the number of **max unacked records**. The default value is **100**. + +5. Click the **Confirm** button to create a subscription. + +::: tip +For more details about **ACK timeout** and **max unacked records**, please refer to [Attributes of a Subscription](../receive/subscription.md#attributes-of-a-subscription). +::: + +::: warning +Currently, the number of **ACK timeout** and **max unacked records** are fixed for each subscription in HStream Platform. We will gradually adjust these attributes in the future. +::: + +## View subscriptions + +The **Subscriptions** page lists all the subscriptions in your account with a high-level overview. For each subscription, you can view the following information: + +- The subscription's **ID**. +- The name of the **stream** source. You can click on the stream name to navigate the [stream details](./stream-in-platform.md#view-stream-details) page. +- The **ACK timeout** of the subscription. +- The **Max unacked records** of the subscription. +- The **Creation time** of the subscription. +- **Actions**, which is used to expand the operations of the subscription: + + - **Metrics**: View the metrics of the subscription. + - **Consumers**: View the consumers of the subscription. + - **Delete**: Delete the subscription. + +To view a specific subscription, click the subscription's name. [The details page of the subscription](#view-subscription-details) will be displayed. + +## View subscription details + +The details page displays detailed information about a subscription: + +1. All the information in the [subscriptions](#view-subscriptions) page. +2. Different tabs are provided to view the related information of the subscription: + + - [**Metrics**](#view-subscription-metrics): View the metrics of the subscription. + - [**Consumers**](#view-subscription-consumers): View the consumers of the subscription. + - [**Consumption progress**](#view-the-consumption-progress-of-the-subscription): View the consumption progress of the subscription. + +### View subscription metrics + +After clicking the **Metrics** tab, you can view the metrics of the subscription. +The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the subscription include (with last 5 minutes as an example), from left to right: + +- The **Outcoming bytes throughput** chart shows the number of bytes sent by the subscription per second in the last 5 minutes. +- The **Outcoming records throughput** chart shows the number of records sent by the subscription per second in the last 5 minutes. +- The **Acknowledgements throughput** chart shows the number of acknowledgements received in the subscription per second in the last 5 minutes. +- The **Resent records** chart shows the number of records resent in the subscription in the last 5 minutes. + +### View subscription consumers + +After clicking the **Consumers** tab, you can view the consumers of the subscription. + +For each consumer, you can view the following information: + +- The **Name** of the consumer. +- The **Type** of the consumer. +- The **URI** of the consumer. + +### View the consumption progress of the subscription + +After clicking the **Consumption progress** tab, you can view the consumption progress of the subscription. + +For each progress, you can view the following information: + +- The **Shard ID** of the progress. +- The **Last checkpoint** of the progress. + +## Delete a Subscription + +This section describes how to delete a subscription. + +### Delete a subscription on the Subscriptions page + +1. Go to the **Subscriptions** page. +2. Click the **Delete** icon of the subscription you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a subscription on Subscription Details page + +1. Go to the details page of the subscription you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/v0.17.0/platform/write-in-platform.md b/docs/v0.17.0/platform/write-in-platform.md new file mode 100644 index 0000000..f6605b0 --- /dev/null +++ b/docs/v0.17.0/platform/write-in-platform.md @@ -0,0 +1,85 @@ +# Write Records to Streams + +After creating a stream, you can write records to it according to the needs of your application. +This page describes how to write and get records in HStream Platform. + +## Preparation + +To write records, you need to create a stream first. + +1. If you do not have a stream, please refer to [Create a Stream](./stream-in-platform.md#create-a-stream) to create a stream. + +2. Go into any stream you want to write records to on the **Streams** page. + +3. Click the **Write records** button to write records to the stream. + +## Write records + +A record is like a piece of JSON data. You can add arbitrary fields to a record, only ensure that the record is a valid JSON object. + +A record also ships with a partition key, which is used to determine which shard the record will be allocated to and improve the read/write performance. + +::: tip +For more details about the partition key, please refer to [Partition Key](../write/write.md#write-records-with-partition-keys). +::: + +Take the following steps to write records to a stream: + +1. Specify the optional **Key**. This is the partition key of the record. The server will automatically assign a default key to the record if not provided. + +2. Fill in the **Value**. This is the content of the record. It must be a valid JSON object. + +3. Click the **Produce** button to write the record to the stream. + +4. If the record is written successfully, you will see a success message below the **Produce** button. + +5. If the record is written failed, you will see a failure message below the **Produce** button. + +## Get Records + +After writing records to a stream, you can get records from the **Records** page or the **Stream Details** page. + +### Get records from the Records page + +After navigating to the **Records** page, you can get records from a stream. + +Below are several filters you can use to get records: + +- **Stream**: Select the stream you want to get records. +- **Shard**: Select one of the shards in the stream you want to get records. +- Special filters: + - **Start record ID**: Get records after a specified record ID. The default is the first record. + - **Start date**: Get records after a specified date. + +::: info +The **Stream** and **Shard** will be filled automatically after loading the page. +By default, the filled value is the first stream and the first shard in the stream. +You can change them to get records from other streams. +::: + +::: info +We default to showing at most **100 records** after getting. If you want to get more records, +please specify a recent record ID in the **Start record ID** field or a recent date in the **Start date** field. +::: + +After filling in the filters, click the **Get records** button to get records. + +For each record, you can view the following information: + +1. The **ID** of the record. +2. The **Key** of the record. +3. The **Value** of the record. +4. The **Shard ID** of the record. +5. The **Creation time** of the record. + +In the next section, you will learn how to get records from the Stream Details page. + +### Get records from the Stream Details page + +Similar to [Get records from Records page](#get-records-from-the-records-page), +you can also get records from the **Stream Details** page. + +The difference is that you can get records without specifying the stream. +The records will automatically be retrieved from the stream you are currently viewing. + +For more details, please refer to [Get records in a stream](./stream-in-platform.md#get-records-in-a-stream). diff --git a/docs/v0.17.0/process/_index.md b/docs/v0.17.0/process/_index.md new file mode 100644 index 0000000..fbb0662 --- /dev/null +++ b/docs/v0.17.0/process/_index.md @@ -0,0 +1,6 @@ +--- +order: ['sql.md'] +collapsed: false +--- + +Process data diff --git a/docs/v0.17.0/process/sql.md b/docs/v0.17.0/process/sql.md new file mode 100644 index 0000000..c549964 --- /dev/null +++ b/docs/v0.17.0/process/sql.md @@ -0,0 +1,202 @@ +# Perform Stream Processing by SQL + +This part provides a demo of performing real-time stream processing by SQL. You +will be introduced to some basic concepts such as **streams**, **queries** and +**materialized views** with some examples to demonstrate the power of our +processing engine, such as the ease to use and dealing with complex queries. + +## Overview + +One of the most important applications of stream processing is real-time +business information analysis. Imagine that we are managing a supermarket and +would like to analyze the sales information to adjust our marketing strategies. + +Suppose we have two **streams** of data: + +```sql +info(product, category) // represents the category a product belongs to +visit(product, user, length) // represents the length of time when a customer looks at a product +``` + +Unlike tables in traditional relational databases, a stream is an endless series +of data which comes with time. Next, we will run some analysis on the two +streams to get some useful information. + +## Requirements + +Ensure you have deployed HStreamDB successfully. The easiest way is to follow +[quickstart](../start/quickstart-with-docker.md) to start a local cluster. Of +course, you can also try other methods mentioned in the Deployment part. + +## Step 1: Create related streams + +We have mentioned that we have two streams, `info` and `visit` in the +[overview](#overview). Now let's create them. Start an HStream SQL shell and run +the following statements: + +```sql +CREATE STREAM info; +``` + +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| info | 1 | 604800 seconds | 1 | ++-------------+---------+----------------+-------------+ +``` + +```sql +CREATE STREAM visit; +``` + +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| visit | 1 | 604800 seconds | 1 | ++-------------+---------+----------------+-------------+ +``` + +We have successfully created two streams. + +## Step 2: Create streaming queries + +We can now create streaming **queries** on the streams. A query is a running +task that fetches data from the stream(s) and produces results continuously. +Let's create a trivial query that fetches data from stream `info` and outputs +them: + +```sql +SELECT * FROM info EMIT CHANGES; +``` + +The query will keep running until you interrupt it. Next, we can just leave it +there and start another query. It fetches data from the stream `visit` and +outputs the maximum length of time of each product. Start a new SQL shell and +run + +```sql +SELECT product, MAX(length) AS max_len FROM visit GROUP BY product EMIT CHANGES; +``` + +Neither of the queries will print any results since we have not inserted any +data yet. So let's do that. + +## Step 3: Insert data into streams + +There are multiple ways to insert data into the streams, such as client +libraries, and the data inserted will all be cheated the same +while processing. You can refer to [write data](../write/write.md) for client usage. + +For consistency and ease of demonstration, we would use SQL statements. + +Start a new SQL shell and run: + +```sql +INSERT INTO info (product, category) VALUES ('Apple', 'Fruit'); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Alice', 10); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Bob', 20); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Caleb', 10); +``` + +Switch to the shells with running queries You should be able to see the expected +outputs as follows: + +```sql +SELECT * FROM info EMIT CHANGES; +``` +``` +{"category":"Fruit","product":"Apple"} +``` + +```sql +SELECT product, MAX(length) AS max_len FROM visit GROUP BY product EMIT CHANGES; +``` +``` +{"max_len":{"$numberLong":"10"},"product":"Apple"} +{"max_len":{"$numberLong":"20"},"product":"Apple"} +{"max_len":{"$numberLong":"20"},"product":"Apple"} +``` + +Note that `max_len` changes from `10` to `20`, which is expected. + +## Step 4: Create materialized views + +Now let's do some more complex analysis. If we want to know the longest visit +time of each category **any time we need it**, the best way is to create +**materialized views**. + +A materialized view is an object which contains the result of a query. In +HStreamDB, the view is maintained and continuously updated in memory, which +means we can read the results directly from the view right when needed without +any extra computation. Thus getting results from a view is very fast. + +Here we can create a view like + +```sql +CREATE VIEW result AS SELECT info.category, MAX(visit.length) as max_length FROM info JOIN visit ON info.product = visit.product WITHIN (INTERVAL 1 HOUR) GROUP BY info.category; +``` +``` ++--------------------------+---------+--------------------------+---------------------------+ +| Query ID | Status | Created Time | SQL Text | ++--------------------------+---------+--------------------------+---------------------------+ +| cli_generated_xbexrdhwgz | RUNNING | 2023-07-06T07:46:13+0000 | CREATE VIEW result AS ... | ++--------------------------+---------+--------------------------+---------------------------+ +``` + +Note the query ID will be different from the one shown above. Now let's try to +get something from the view: + +```sql +SELECT * FROM result; +``` + +It outputs no data because we have not inserted any data into the streams since +**after** the view is created. Let's do it now: + +```sql +INSERT INTO info (product, category) VALUES ('Apple', 'Fruit'); +INSERT INTO info (product, category) VALUES ('Banana', 'Fruit'); +INSERT INTO info (product, category) VALUES ('Carrot', 'Vegetable'); +INSERT INTO info (product, category) VALUES ('Potato', 'Vegetable'); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Alice', 10); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Bob', 20); +INSERT INTO visit (product, user, length) VALUES ('Carrot', 'Bob', 50); +``` + +## Step 5: Get results from views + +Now let's find out what is in our view: + +```sql +SELECT * FROM result; +``` +``` +{"category":"Fruit","max_length":{"$numberLong":"20"}} +{"category":"Vegetable","max_length":{"$numberLong":"50"}} +``` + +It works. Now insert more data and repeat the inspection: + +```sql +INSERT INTO visit (product, user, length) VALUES ('Banana', 'Alice', 40); +INSERT INTO visit (product, user, length) VALUES ('Potato', 'Eve', 60); +``` + +And query again: + +```sql +SELECT * FROM result; +``` +``` +{"category":"Fruit","max_length":{"$numberLong":"40"}} +{"category":"Vegetable","max_length":{"$numberLong":"60"}} +``` + +The result is updated right away. + +## Related Pages + +For a detailed introduction to the SQL, see +[HStream SQL](../reference/sql/sql-overview.md). diff --git a/docs/v0.17.0/receive/_index.md b/docs/v0.17.0/receive/_index.md new file mode 100644 index 0000000..abd674f --- /dev/null +++ b/docs/v0.17.0/receive/_index.md @@ -0,0 +1,6 @@ +--- +order: ['subscription.md', 'consume.md', 'read.md'] +collapsed: false +--- + +Receive data diff --git a/docs/v0.17.0/receive/consume.md b/docs/v0.17.0/receive/consume.md new file mode 100644 index 0000000..4ba2e1a --- /dev/null +++ b/docs/v0.17.0/receive/consume.md @@ -0,0 +1,143 @@ +# Consume Records with Subscriptions + +## What is a Subscription? + +To consume data from a stream, you must create a subscription to the stream. +When initiated, every subscription will retrieve the data from the beginning. +Consumers which receive and process records connect to a stream through a +subscription. A stream can have multiple subscriptions, but a given subscription +belongs to a single stream. Similarly, a subscription corresponds to one +consumer group with multiple consumers. However, every consumer belongs to only +a single subscription. + +Please refer to [this page](./subscription.md) for detailed information about +creating and managing your subscriptions. + +## How to consume data with a subscription + +To consume data appended to a stream, HStreamDB Clients libraries have provided +asynchronous consumer API, which will initiate requests to join the consumer +group of the subscription specified. + +### Two HStream Record types and corresponding receivers + +As we [explained](../write/write.md#hstream-record), there are two types of records in +HStreamDB, HRecord and RawRecord. When initiating a consumer, corresponding +receivers are required. In the case where only HRecord Receiver is set, when the +consumer received a raw record, the consumer will ignore it and consume the next +record. Therefore, in principle, we do not recommend writing both HRecord and +RawRecord in the same stream. However, this is not strictly forbidden in +implementation, and you can provide both receivers to process both types of +records. + +## Simple Consumer Example + +To get higher throughput for your application, we provide asynchronous fetching +that does not require your application to block for new messages. Messages can +be received in your application using a long-running message receiver and +acknowledged one at a time, as shown in the example below. + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ConsumeDataSimpleExample.java [Java] + +<<< @/../examples/go/examples/ExampleConsumer.go [Go] + +@snippet examples/py/snippets/guides.py common subscribe-records + +::: + +For better performance, Batched Ack is enabled by default with settings +`ackBufferSize` = 100 and `ackAgeLimit` = 100, which you can change when +initiating your consumers. + +::: code-group + +```java +Consumer consumer = + client + .newConsumer() + .subscription("you_subscription_id") + .name("your_consumer_name") + .hRecordReceiver(your_receiver) + // When ack() is called, the consumer will not send it to servers immediately, + // the ack request will be buffered until the ack count reaches ackBufferSize + // or the consumer is stopping or reached ackAgelimit + .ackBufferSize(100) + .ackAgeLimit(100) + .build(); +``` + +::: + +## Multiple consumers and shared consumption progress + +In HStream, a subscription is consumed by a consumer group. In this consumer +group, there could be multiple consumers which share the subscription's +progress. To increase the rate of consuming data from a subscription, we could +have a new consumer join the existing subscription. The code is for +demonstration of how consumers can join the consumer group. Usually, the case is +that users would have consumers from different clients. + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ConsumeDataSharedExample.java [Java] + +<<< @/../examples/go/examples/ExampleConsumerGroup.go [Go] + +::: + +## Flow Control with `maxUnackedRecords` + +A common scenario is that your consumers may not process and acknowledge data as +fast as the server sends, or some unexpected problems causing the consumer +client to be unable to acknowledge the data received, which could cause problems +as such: + +The server would have to keep resending unacknowledged messages, and maintain +the information about unacknowledged messages, which would consume resources of +the server, and cause the server to face the issue of resource exhaustion. + +To mitigate the issue above, use the `maxUnackedRecords` setting of the +subscription to control the maximum number of allowed un-acknowledged records +when the consumers receive messages. Once the number exceeds the +`maxUnackedRecords`, the server will stop sending messages to consumers of the +current subscription. + +## Receiving messages in order + +Note: the order described below is just for a single consumer. If a subscription +has multiple consumers, the order can still be guaranteed in each, but the order +is no longer preserved if we see the consumer group as an entity. + +Consumers will receive messages with the same partition key in the order that +the HStream server receives them. Since HStream delivers hstream records with +at-least-once semantics, in some cases, when HServer does not receive the ack +for some record in the middle, it might deliver the record more than once. In +these cases, we can not guarantee the order either. + +## Handling errors + +When a consumer is running, and failure happens at the receiver, the default +behaviour is that the consumer will catch the exception, print an error log, and +continue consuming the next record instead of failing. + +Consumers could fail in other scenarios, such as network, deleted subscriptions, +etc. However, as a service, you may want the consumer to keep running, so you +can register a listener to handle a failed consumer: + +::: code-group + +```java +// add Listener for handling failed consumer +var threadPool = new ScheduledThreadPoolExecutor(1); +consumer.addListener( + new Service.Listener() { + public void failed(Service.State from, Throwable failure) { + System.out.println("consumer failed, with error: " + failure.getMessage()); + } + }, + threadPool); +``` + +::: diff --git a/docs/v0.17.0/receive/read.md b/docs/v0.17.0/receive/read.md new file mode 100644 index 0000000..0760ed8 --- /dev/null +++ b/docs/v0.17.0/receive/read.md @@ -0,0 +1,37 @@ +# Get Records from Shards of the Stream with Reader + +## What is a Reader + +To allow users to retrieve data from any stream shard, HStreamDB provides +readers for applications to manually manage the exact position of the record to +read from. Unlike subscription and consumption, a reader can be seen as a +lower-level API for getting records from streams. It gives users direct access +to any records in the stream, more precisely, any records from a specific shard +in the stream, and it does not require or rely on subscriptions and will not +send any acknowledgement back to the server. Therefore, the reader is helpful +for the case that requires better flexibility or rewinding of data reading. + +When a user creates a reader instance, it is required that the user needs to +specify which record and which shard the reader begins from. A reader provides +starting position with the following options: + +- The earliest available record in the shard +- The latest available record in the shard +- User-specified record location in the shard + +## Reader Example + +To read from the shards, users are required to get the desired shard id with +[`listShards`](../write/shards.md#listshards). + +The name of a reader should also follow the format specified by the [guidelines](../write/stream.md#guidelines-to-name-a-resource) + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ReadDataWithReaderExample.java [Java] + +<<< @/../examples/go/examples/ExampleReadDataWithReader.go [Go] + +@snippet examples/py/snippets/guides.py common read-reader + +::: diff --git a/docs/v0.17.0/receive/subscription.md b/docs/v0.17.0/receive/subscription.md new file mode 100644 index 0000000..d38972b --- /dev/null +++ b/docs/v0.17.0/receive/subscription.md @@ -0,0 +1,73 @@ +# Create and Manage Subscriptions + +## Attributes of a Subscription + +- ackTimeoutSeconds. + + Specifies the max amount of time for the server to mark the record as + unacknowledged, after which the record will be sent again. + +- maxUnackedRecords. + + The maximum amount of unacknowledged records allowed. After exceeding the size + set, the server will stop sending records to corresponding consumers. + +## Create a subscription + +Every subscription has to specify which stream to subscribe to, which means you +have to make sure the stream to be subscribed has already been created. + +For the subscription name, please refer to the [guidelines to name a resource](../write/stream.md#guidelines-to-name-a-resource) + +When creating a subscription, you can provide the attributes mentioned like +this: + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/CreateSubscriptionExample.java [Java] + +<<< @/../examples/go/examples/ExampleCreateSubscription.go [Go] + +@snippet examples/py/snippets/guides.py common create-subscription + +::: + +## Delete a subscription + +To delete a subscription without the force flag, you need to make sure that +there is no active subscription consumer. + +### Delete a subscription with the force flag + +If you do want to delete a subscription with running consumers, enable force +deletion. While force deleting a subscription, the subscription will be in +deleting state and closing running consumers, which means you will not be able +to join, delete or create a subscription with the same name. After the deletion +completes, you can create a subscription with the same name. However, this new +subscription will be a brand new subscription. Even if they subscribe to the +same stream, this new subscription will not share the consumption progress with +the deleted subscription. + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/DeleteSubscriptionExample.java [Java] + +<<< @/../examples/go/examples/ExampleDeleteSubscription.go [Go] + +@snippet examples/py/snippets/guides.py common delete-subscription + +::: + +## List subscriptions + +To list all subscriptions in HStream + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ListSubscriptionsExample.java [Java] + +<<< @/../examples/go/examples/ExampleListSubscriptions.go [Go] + +@snippet examples/py/snippets/guides.py common list-subscription + +::: diff --git a/docs/v0.17.0/reference/_index.md b/docs/v0.17.0/reference/_index.md new file mode 100644 index 0000000..7afc9a0 --- /dev/null +++ b/docs/v0.17.0/reference/_index.md @@ -0,0 +1,6 @@ +--- +order: ["architecture", "sql", "cli.md", "config.md", "metrics.md"] +collapsed: false +--- + +Reference diff --git a/docs/v0.17.0/reference/architecture/_index.md b/docs/v0.17.0/reference/architecture/_index.md new file mode 100644 index 0000000..87b56ec --- /dev/null +++ b/docs/v0.17.0/reference/architecture/_index.md @@ -0,0 +1,5 @@ +--- +order: ["overview.md", "hstore.md", "hserver.md"] +--- + +Architecture diff --git a/docs/v0.17.0/reference/architecture/hserver.md b/docs/v0.17.0/reference/architecture/hserver.md new file mode 100644 index 0000000..f319755 --- /dev/null +++ b/docs/v0.17.0/reference/architecture/hserver.md @@ -0,0 +1,34 @@ +# HStream Server + +HStream Server (HSQL), the core computation component of HStreamDB, is designed to be stateless. +The primary responsibility of HSQL is to support client connection management, security authentication, SQL parsing and optimization, +and operations for stream computation such as task creation, scheduling, execution, management, etc. + +## HStream Server (HSQL) top-down layered structures + +### Access Layer + +It is in charge of protocol processing, connection management, security authentication, +and access control for client requests. + +### SQL layer + +To perform most stream processing and real-time analysis tasks, clients interact with HStreamDB through SQL statements. +This layer is mainly responsible for compiling these SQL statements into logical data flow diagrams. +Like the classic database system model, it contains two core sub-components: SQL parser and SQL optimizer. +The SQL parser deals with the lexical and syntactic analysis and the compilation from SQL statements to relational algebraic expressions; +the SQL optimizer will optimize the generated execution plan based on various rules and contexts. + +### Stream Layer + +Stream layer includes the implementation of various stream processing operators, the data structures and DSL to express data flow diagrams, +and the support for user-defined functions as processing operators. +So, it is responsible for selecting the corresponding operator and optimization to generate the executable data flow diagram. + +### Runtime Layer + +It is the layer responsible for executing the computation task of data flow diagrams and returning the results. +The main components of the layer include task scheduler, state manager, and execution optimizer. +The schedule takes care of the tasks scheduling between available computation resources, +such as multiple threads of a single process, multiple processors of a single machine, +and multiple machines or containers of a distributed cluster. diff --git a/docs/v0.17.0/reference/architecture/hstore.md b/docs/v0.17.0/reference/architecture/hstore.md new file mode 100644 index 0000000..01532cb --- /dev/null +++ b/docs/v0.17.0/reference/architecture/hstore.md @@ -0,0 +1,48 @@ +# HStream Storage (HStore) + +HStream Storage (HStore), the core storage component of HStreamDB, is a low-latency +storage component explicitly designed for streaming data. +It can store large-scale real-time data in a distributed and persistent manner +and seamlessly interface with large-capacity secondary storage such as S3 through +the Auto-Tiering mechanism to achieve unified storage of historical and real-time data. + +The core storage model of HStore is a logging model that fits with streaming data. +Regard data stream as an infinitely growing log, the typical operations supported +include appending and reading by batches. +Also, since the data stream is immutable, it generally does not support update operations. + +## HStream Storage (HStore) consists of following layer + +### Streaming Data API layer + +This layer provides the core data stream management and read/write operations, +including stream creation/deletion and writing to/consuming data in the stream. +In the design of HStore, data streams are not stored as actual streams. +Therefore, the creation of a stream is a very light-weight operation. +There is no limit to the number of streams to be created in HStore. +Besides, it supports concurrent writes to numerous data streams and still maintains a stable low latency. +For the characteristics of data streams, HStore provides append operation to support fast data writing. +While reading from stream data, it gives a subscription-based operation +and pushes any new data written to the stream to the data consumer in real time. + +### Replicator Layer + +This layer implements the strongly consistent replication based on an optimized +Flexible Paxos consensus mechanism, +ensuring the fault tolerance and high availability to data, +and maximizes cluster availability through a non-deterministic data distribution policy. +Moreover, it supports replication groups reconfiguration online to achieve seamless +cluster data balancing and horizontal scaling. + +### Tier1 Local Storage Layer + +The layer fulfilled local persistent storage needs of data based on the optimized RocksDB storage engine, +which encapsulates the access interface of streaming data +and can support low-latency writing and reading a large amount of data. + +### Tier2 Offloader Layer + +This layer provides a unified interface encapsulation for various long-term storage systems, +such as HDFS, AWS S3, etc. +It supports automatic offloading of historical data to these secondary storage systems +and can also be accessed through a unified streaming data interface. diff --git a/docs/v0.17.0/reference/architecture/overview.md b/docs/v0.17.0/reference/architecture/overview.md new file mode 100644 index 0000000..29b7232 --- /dev/null +++ b/docs/v0.17.0/reference/architecture/overview.md @@ -0,0 +1,7 @@ +# Architecture Overview + +The figure below shows the overall architecture of HStreamDB. A single HStreamDB node consists of two core components, HStream Server (HSQL) and HStream Storage (HStorage). And an HStream cluster consists of several peer-to-peer HStreamDB nodes. Clients can connect to any HStreamDB node in the cluster and perform stream processing and analysis through your familiar SQL language. + +![](https://static.emqx.net/images/faab4a8b1d02f14bc5a4153fe37f21ca.png) + +
HStreamDB Structure Overview
diff --git a/docs/v0.17.0/reference/cli.md b/docs/v0.17.0/reference/cli.md new file mode 100644 index 0000000..df379be --- /dev/null +++ b/docs/v0.17.0/reference/cli.md @@ -0,0 +1,564 @@ +# HStream CLI + +We can run the following to use HStream CLI: + +```sh-vue +docker run -it --rm --name some-hstream-admin --network host hstreamdb/hstream:{{ $version() }} hstream --help +``` + +For ease of illustration, we execute an interactive bash shell in the HStream +container to use HStream admin, + +The following example usage is based on the cluster started in +[quick start](../start/quickstart-with-docker.md), please adjust +correspondingly. + +```sh +docker exec -it docker_hserver_1 bash +``` +``` +hstream --help +``` + +```txt +======= HStream CLI ======= + +Usage: hstream [--host SERVER-HOST] [--port INT] [--tls-ca STRING] + [--tls-key STRING] [--tls-cert STRING] [--retry-timeout INT] + [--service-url ARG] COMMAND + +Available options: + --host SERVER-HOST Server host value (default: "127.0.0.1") + --port INT Server port value (default: 6570) + --tls-ca STRING path name of the file that contains list of trusted + TLS Certificate Authorities + --tls-key STRING path name of the client TLS private key file + --tls-cert STRING path name of the client TLS public key certificate + file + --retry-timeout INT timeout to retry connecting to a server in seconds + (default: 60) + --service-url ARG The endpoint to connect to + -h,--help Show this help text + +Available commands: + sql Start HStream SQL Shell + node Manage HStream Server Cluster + init Init HStream Server Cluster + stream Manage Streams in HStreamDB + subscription Manage Subscriptions in HStreamDB +``` + +## Connection + +### HStream URL + +The HStream CLI Client supports connecting to the server cluster with a url in +the following format: + +``` +://: +``` + +| Components | Description | Required | +|------------|-------------|----------| +| scheme | The scheme of the connection. Currently, we have `hstream`. To enable security options, `hstreams` is also supported | Yes | +| endpoint | The endpoint of the server cluster, which can be the hostname or address of the server cluster. | If not given, the value will be set to the `--host` default `127.0.0.1` | +| port | The port of the server cluster. | If not given, the value will be set to the `--port` default `6570` | + +### Connection Parameters + +HStream commands accept connection parameters as separate command-line flags, in addition (or in replacement) to `--service-url`. + +::: tip + +In the cases where both `--service-url` and the options below are specified, the client will use the value in `--service-url`. + +::: + +| Option | Description | +|-|-| +| `--host` | The server host and port number to connect to. This can be the address of any node in the cluster. Default: `127.0.0.1` | +| `--port` | The server port to connect to. Default: `6570`| + +### Security Settings (optional) + +If the [security option](../security/overview.md) is enabled, here are +some options that should also be configured for CLI correspondingly. + +#### Encryption + +If [server encryption](../security/encryption.md) is enabled, the +`--tls-ca` option should be added to CLI connection options: + +```sh +hstream --tls-ca "" +``` + +### Authentication + +If [server authentication](../security/authentication.md) is enabled, +the `--tls-key` and `--tls-cert` options should be added to CLI connection +options: + +```sh +hstream --tls-key "" --tls-cert "" +``` + +## Check Cluster Status + +```sh +hstream node --help +``` +``` +Usage: hstream node COMMAND + Manage HStream Server Cluster + +Available options: + -h,--help Show this help text + +Available commands: + list List all running nodes in the cluster + status Show the status of nodes specified, if not specified + show the status of all nodes + check-running Check if all nodes in the the cluster are running, + and the number of nodes is at least as specified +``` + +```sh +hstream node list +``` +``` ++-----------+ +| server_id | ++-----------+ +| 100 | +| 101 | ++-----------+ +``` + +```sh +hstream node status +``` +``` ++-----------+---------+-------------------+ +| server_id | state | address | ++-----------+---------+-------------------+ +| 100 | Running | 192.168.64.4:6570 | +| 101 | Running | 192.168.64.5:6572 | ++-----------+---------+-------------------+ +``` + +```sh +hstream node check-running +``` +``` +All nodes in the cluster are running. +``` + +## Manage Streams + +We can also manage streams through the hstream command line tool. + +```sh +hstream stream --help +``` +``` +Usage: hstream stream COMMAND + Manage Streams in HStreamDB + +Available options: + -h,--help Show this help text + +Available commands: + list Get all streams + create Create a stream + describe Get the details of a stream + delete Delete a stream +``` + +### Create a stream + +```sh +Usage: hstream stream create STREAM_NAME [-r|--replication-factor INT] + [-b|--backlog-duration INT] [-s|--shards INT] + Create a stream + +Available options: + STREAM_NAME The name of the stream + -r,--replication-factor INT + The replication factor for the stream (default: 1) + -b,--backlog-duration INT + The backlog duration of records in stream in seconds + (default: 0) + -s,--shards INT The number of shards the stream should have + (default: 1) + -h,--help Show this help text +``` + +Example: Create a demo stream with the default settings. + +```sh +hstream stream create demo +``` +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| demo | 1 | 0 seconds | 1 | ++-------------+---------+----------------+-------------+ +``` + +### Show and delete streams + +```sh +hstream stream list +``` +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| demo2 | 1 | 0 seconds | 1 | ++-------------+---------+----------------+-------------+ +``` + +```sh +hstream stream delete demo +``` +``` +Done. +``` + +```sh +hstream stream list +``` +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +``` + +## Manage Subscription + +We can also manage streams through the hstream command line tool. + +```sh +hstream stream --help +``` +``` +Usage: hstream subscription COMMAND + Manage Subscriptions in HStreamDB + +Available options: + -h,--help Show this help text + +Available commands: + list Get all subscriptions + create Create a subscription + describe Get the details of a subscription + delete Delete a subscription +``` + +### Create a subscription + +```sh +Usage: hstream subscription create SUB_ID --stream STREAM_NAME + [--ack-timeout INT] + [--max-unacked-records INT] + [--offset [earliest|latest]] + Create a subscription + +Available options: + SUB_ID The ID of the subscription + --stream STREAM_NAME The stream associated with the subscription + --ack-timeout INT Timeout for acknowledgements in seconds + --max-unacked-records INT + Maximum number of unacked records allowed per + subscription + --offset [earliest|latest] + The offset of the subscription to start from + -h,--help Show this help text +``` + +Example: Create a subscription to the stream `demo` with the default settings. + +```sh +hstream subscription create --stream demo sub_demo +``` +``` ++-----------------+-------------+-------------+---------------------+ +| Subscription ID | Stream Name | Ack Timeout | Max Unacked Records | ++-----------------+-------------+-------------+---------------------+ +| sub_demo | demo | 60 seconds | 10000 | ++-----------------+-------------+-------------+---------------------+ +``` + +### Show and delete streams + +```sh +hstream subscription list +``` +``` ++-----------------+-------------+-------------+---------------------+ +| Subscription ID | Stream Name | Ack Timeout | Max Unacked Records | ++-----------------+-------------+-------------+---------------------+ +| sub_demo | demo | 60 seconds | 10000 | ++-----------------+-------------+-------------+---------------------+ +``` + +```sh +hstream subscription delete sub_demo +``` +``` +Done. +``` + +```sh +hstream subscription list +``` +``` ++-----------------+-------------+-------------+---------------------+ +| Subscription ID | Stream Name | Ack Timeout | Max Unacked Records | ++-----------------+-------------+-------------+---------------------+ +``` + +## HStream SQL + +HStreamDB also provides an interactive SQL shell for a series of operations, +such as the management of streams and views, data insertion and retrieval, etc. + +```sh +hstream sql --help +``` +``` +Usage: hstream sql [--update-interval INT] [--retry-timeout INT] + Start HStream SQL Shell + +Available options: + --update-interval INT interval to update available servers in seconds + (default: 30) + --retry-timeout INT timeout to retry connecting to a server in seconds + (default: 60) + -e,--execute STRING execute the statement and quit + --history-file STRING history file path to write interactively executed + statements + -h,--help Show this help text +``` + +Once you entered shell, you can see the following help info: + +```sh + __ _________________ _________ __ ___ + / / / / ___/_ __/ __ \/ ____/ | / |/ / + / /_/ /\__ \ / / / /_/ / __/ / /| | / /|_/ / + / __ /___/ // / / _, _/ /___/ ___ |/ / / / + /_/ /_//____//_/ /_/ |_/_____/_/ |_/_/ /_/ + + +Command + :h To show these help info + :q To exit command line interface + :help [sql_operation] To show full usage of sql statement + +SQL STATEMENTS: + To create a simplest stream: + CREATE STREAM stream_name; + + To create a query select all fields from a stream: + SELECT * FROM stream_name EMIT CHANGES; + + To insert values to a stream: + INSERT INTO stream_name (field1, field2) VALUES (1, 2); + +``` + +There are two kinds of commands: + +1. Basic shell commands, starting with `:` +2. SQL statements end with `;` + +### Basic CLI Operations + +To quit the current CLI session: + +```sh +:q +``` + +To print out help info overview: + +```sh +:h +``` + +To show the specific usage of some SQL statements: + +```sh +:help CREATE +``` +``` + CREATE STREAM [IF EXIST] [AS ] [ WITH ( {stream_options} ) ]; + CREATE {SOURCE|SINK} CONNECTOR [IF NOT EXIST] WITH ( {connector_options} ); + CREATE VIEW AS ; +``` + +Available SQL operations include: `CREATE`, `DROP`, `SELECT`, `SHOW`, `INSERT`, +`TERMINATE`. + +### SQL Statements + +All the processing and storage operations are done via SQL statements. + +#### Stream + +There are two ways to create a new data stream. + +1. Create an ordinary stream: + +```sql +CREATE STREAM stream_name; +``` + +This will create a stream with no particular function. You can `SELECT` data +from the stream and `INSERT` to via the corresponding SQL statement. + +2. Create a stream, and this stream will also run a query to select specified + data from some other stream. + +Adding a Select statement after Create with a keyword `AS` can create a stream +will create a stream that processes data from another stream. + +For example: + +```sql +CREATE STREAM stream_name AS SELECT * from demo; +``` + +In the example above, by adding an `AS` followed by a `SELECT` statement to the +normal `CREATE` operation, it will create a stream that will also select all the +data from the `demo`. + +After Creating the stream, we can insert values into the stream. + +```sql +INSERT INTO stream_name (field1, field2) VALUES (1, 2); +``` + +There is no restriction on the number of fields a query can insert. Also, the +type of value is not restricted. However, you need to make sure that the number +of fields and the number of values are aligned. + +The deletion command is `DROP STREAM ;`, which deletes a stream, +and terminates all the [queries](#queries) that depend on the stream. + +For example: + +```sql +SELECT * FROM demo EMIT CHANGES; +``` + +will be terminated if the stream demo is deleted; + +```sql +DROP STREAM demo; +``` + +If you try to delete a stream that does not exist, an error message will be +returned. To turn it off, you can use add `IF EXISTS` after the stream_name: + +```sql +DROP STREAM demo IF EXISTS; +``` + +#### Show all streams + +You can also show all streams by using the `SHOW STREAMS` command. + +```sql +SHOW STEAMS; +``` +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| demo | 3 | 0sec | 1 | ++-------------+---------+----------------+-------------+ +``` + +#### Queries + +Run a continuous query on the stream to select data from a stream: + +After creating a stream, we can select data from the stream in real-time. All +the data inserted after the select query is created will be printed out when the +insert operation happens. Select supports real-time processing of the data +inserted into the stream. + +For example, we can choose the field and filter the data selected from the +stream. + +```sql +SELECT a FROM demo EMIT CHANGES; +``` + +This will only select field `a` from the stream demo. + +How to terminate a query? + +A query can be terminated if we know the query id: + +```sql +TERMINATE QUERY ; +``` + +We can get all the query information by command `SHOW`: + +```sql +SHOW QUERIES; +``` + +output just for demonstration : + +``` ++------------------+------------+--------------------------+----------------------------------+ +| Query ID | Status | Created Time | SQL Text | ++------------------+------------+--------------------------+----------------------------------+ +| 1361978122003419 | TERMINATED | 2022-07-28T06:03:42+0000 | select * from demo emit changes; | ++------------------+------------+--------------------------+----------------------------------+ +``` + +Find the query to terminate, make sure is id not already terminated, and pass +the query id to `TERMINATE QUERY` + +Or under some circumstances, you can choose to `TERMINATE ALL;`. + +### View + +The view is a projection of specified data from streams. For example, + +```sql +CREATE VIEW v_demo AS SELECT SUM(a) FROM demo GROUP BY a; +``` + +the above command will create a view that keeps track of the sum of `a` (which +have the same values, because of groupby) and have the same value from the point +this query is executed. + +The operations on view are very similar to those on streams. + +Except we can not use `SELECT ... EMIT CHANGES` performed on streams because a +view is static and there are no changes to emit. Instead, for example, we select +from the view with: + +```sql +SELECT * FROM v_demo WHERE a = 1; +``` + +This will print the sum of `a` when `a` = 1. + +If we want to create a view to record the sum of `a`s, we can: + +```sql +CREATE STREAM demo2 AS SELECT a, 1 AS b FROM demo; +CREATE VIEW v_demo2 AS SELECT SUM(a) FROM demo2 GROUP BY b; +SELECT * FROM demo2 WHERE b = 1; +``` diff --git a/docs/v0.17.0/reference/config.md b/docs/v0.17.0/reference/config.md new file mode 100644 index 0000000..6b901da --- /dev/null +++ b/docs/v0.17.0/reference/config.md @@ -0,0 +1,91 @@ +# HStreamDB Configuration + +HStreamDB configuration file is located at path `/etc/hstream/config.yaml` in the docker image from v0.6.3. +or you can [download](https://raw.githubusercontent.com/hstreamdb/hstream/main/conf/hstream.yaml) the config file + +## Configuration Table + +### hserver + +| Name | Default Value | Description | +| ---- | ------------- | ----------- | +| id | | The identifier of a single HServer node, the value must be given and can be overwritten by cli option `--server-id | +| bind-address | "0.0.0.0" | The IP address or name of the host to which the HServer protocol handler is bound. The value can be overwritten by cli option `--bind-address` | +| advertised-address | "127.0.0.1" | Server listener address value, the value must be given and shouldn't be "0.0.0.0", if you intend to start a cluster or trying to connect to the server from a different network. This value can be overwritten by cli option `--address` | +| gossip-address | | The address used for server internal communication, if not specified, it uses the value of `advertised-address`. The value can be overwritten by cli option "--gossip-address" | +| port | 6570 | Server port value, the value must be given and can be overwritten by cli option `--port` +| internal-port | 6571 | Server port value for internal communications between server nodes, the value must be given and can be overwritten by cli option `--internal-port` | +| metastore-uri | | The server nodes in the same cluster shares an HMeta uniy, this is used for metadata storage and is essential for a server to start. Specify the HMeta protocal such as `zk://` or `rq://`, following with Comma separated host:port pairs, each corresponding to a hmeta server. e.g. zk://127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183. The value must be given and can be overwritten by cli option `--metastore-uri` | +| onnector-meta-store | | The metadata store for connectors (hstream io), the value must be given. | +| log-with-color | true | optional, The options used to control whether print logs with color by the server node, can be overwritten by cli option `--log-with-color` | +| log-level | info | optional, the setting control lof print level by the server node the default value can be overwritten by cli option `--log-level` | +| max-record-size | 1024*1024 (1MB) | The largest size of a record batch allowed by HStreamDB| +| enable-tls | false | TLS options: Enable tls, which requires tls-key-path and tls-cert-path options | +| tls-key-path | | TLS options: Key file path for tls, can be generated by openssl | +| tls-cert-path | | The signed certificate by CA for the key(tls-key-path) | +| advertise-listeners | | The advertised listeners for the server | + +### hstore + +The configuration for hstore is optional. When the values are not provided, hstreamdb will use the default values. + +| Name | Default Value | Description | +| ---- | ------------- | ----------- | +|log-level| info | optional | + +Store admin section specifies the client config when connecting to the storage admin server +| Name | Default Value | Description | +| ---- | ------------- | ----------- | +| host | "127.0.0.1" | optional | +| port | 6440 | optional | +| protocol-id | binaryProtocolId | optional | +| conn-timeout | 5000 | optional | +| send-timeout | 5000 | optional | +| recv-timeout | 5000 | optional | + +### hstream-io + +| Name | Description | +| ---- | ----------- | +| tasks-path | the io tasks work directory | +| tasks-network | io tasks run as docker containers, so the tasks-network should be the network that can connect to HStreamDB and external systems | +| source-images | key-value map specify the images used by the source connectors | +| sink-images | key-value map specify the images used by the sink connectors | + +## Resource Attributes + +### Stream + +| Name | Description | +| ---- | ----------- | +| name | The name of the stream | +| shard count | The number of shards in the stream | +| replication factor | The number of the replicas | +| backlog retention | The retention time of the records in the stream in seconds| + +### Subscription + +| Name | Description | +| ---- | ----------- | +| id | The id of the subscription | +| stream name | The name of the stream to subscribe | +| ackTimeoutSeconds | Maximum time in the server will wait for an acknowledgement | +| maxUnackedRecords | The maximum amount of unacknowledged records allowed | + +## Command-Line Options + +For ease of use, we allow users to pass some options to override the configuration in the configuration file when starting the server with `hstream-server` : + +| Option | Meta var | Description | +| ------ | -------- | ----------- | +| config-path | PATH | hstream config path | +| bind-address | HOST | server host value | +| advertised-address | HOST | server listener address value | +| gossip-address | HOST | server gossip address value | +| port | INT | server port value | +| internal-port | INT | server channel port value for internal communication | +| server-id | UINT32 | ID of the hstream server node | +| store-admin-port | INT | store admin port value | +| metastore-uri | STR | Specify the HMeta protocal such as `zk://` or `rq://`, following with Comma separated host:port pairs, each corresponding to a hmeta server. e.g. zk://127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183. | +| log-level | | Server log level | +| log-with-color | FLAG | Server log with color | diff --git a/docs/v0.17.0/reference/metrics.md b/docs/v0.17.0/reference/metrics.md new file mode 100644 index 0000000..c2cd3c1 --- /dev/null +++ b/docs/v0.17.0/reference/metrics.md @@ -0,0 +1,124 @@ +# HStream Metrics + +Note: For metrics with intervals, such as stats in categories like stream and +subscription, users can specify intervals (default intervals [1min, 5min, +10min]). The smaller the interval, the closer it gets to the rate in real-time. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Category
Metrics
Unit
Description
stream_counterappend_total
#
Total number of append requests of a stream
append_failed#
Total number of failed append request of a stream
append_in_bytes#
Total payload bytes successfully written to the stream
append_in_records#
Total payload records successfully written to the stream
streamappend_in_bytesB/s
+ Rate of bytes received and successfully written to the stream.
+
append_in_records#/s
Rate of records received and successfully written to the stream
append_in_requests#/s (QPS)Rate of append requests received per stream
append_failed_requests#/s (QPS)Rate of failed append requests received per stream
subscription_countersend_out_bytes#Number of bytes sent by the server per subscription
send_out_records#Number of records successfully sent by the server per subscription
send_out_records_failed#Number of records failed to send by the server per subscription
resend_records#Number of successfully resent records per subscription
resend_records_failed#Number of records failed to resend per subscription
received_acks#Number of acknowledgements received per subscription
request_messages#Number of streaming fetch requests received from clients per subscription
response_messages#Number of streaming send requests successfully sent to clients per subscription, including resends
subscriptionsend_out_bytesB/sRate of bytes sent by the server per subscription
acks / acknowledgements
#/sRate of acknowledgements received per subscription
request_messages#/sRate of streaming fetch requests received from clients per subscription
response_messages#/sRate of streaming send requests successfully sent to clients per subscription, including resends
diff --git a/docs/v0.17.0/reference/sql/_index.md b/docs/v0.17.0/reference/sql/_index.md new file mode 100644 index 0000000..97f7f65 --- /dev/null +++ b/docs/v0.17.0/reference/sql/_index.md @@ -0,0 +1,5 @@ +--- +order: ["sql-overview.md", "sql-quick-reference.md", "statements", "functions"] +--- + +HStream SQL diff --git a/docs/v0.17.0/reference/sql/appendix.md b/docs/v0.17.0/reference/sql/appendix.md new file mode 100644 index 0000000..e1f2eaa --- /dev/null +++ b/docs/v0.17.0/reference/sql/appendix.md @@ -0,0 +1,210 @@ +Appendix +======== + +## Data Types + +| type | examples | +|-----------|---------------------------------------| +| NULL | NULL | +| INTEGER | 1, -1, 1234567 | +| FLOAT | 2.3, -3.56, 232.4 | +| NUMERIC | 1, 2.3 | +| BOOLEAN | TRUE, FALSE | +| BYTEA | '0xaa0xbb' :: BYTEA | +| STRING | "deadbeef" | +| DATE | DATE '2020-06-10' | +| TIME | TIME '11:18:30' | +| TIMESTAMP | TIMESTAMP '2022-01-01T12:00:00+08:00' | +| INTERVAL | INTERVAL 10 SECOND | +| JSON | '{"a": 1, "b": 2}' :: JSONB | +| ARRAY | [1, 2, 3] | + +## Keywords + +| keyword | description | +|-------------------|------------------------------------------------------------------------------------------| +| `ABS` | absolute value | +| `ACOS` | arccosine | +| `ACOSH` | inverse hyperbolic cosine | +| `AND` | logical and operator | +| `ARRAY_CONTAIN` | given an array, checks if a search value is contained in the array | +| `ARRAY_DISTINCT` | returns an array of all the distinct values | +| `ARRAY_EXCEPT` | `ARRAY_DISTINCT` except for those also present in the second array | +| `ARRAY_INTERSECT` | returns an array of all the distinct elements from the intersection of both input arrays | +| `ARRAY_JOIN` | creates a flat string representation of all elements contained in the given array | +| `ARRAY_LENGTH` | return the length of the given array | +| `ARRAY_MAX` | returns the maximum value from the given array of primitive elements | +| `ARRAY_MIN` | returns the minimum value from the given array of primitive elements | +| `ARRAY_REMOVE` | removes all elements from the input array equal to the second argument | +| `ARRAY_SORT` | sort the given array | +| `ARRAY_UNION` | returns an array of all the distinct elements from the union of both input arrays | +| `AS` | stream or field name alias | +| `ASIN` | arcsine | +| `ASINH` | inverse hyperbolic sine | +| `ATAN` | arctangent | +| `ATANH` | inverse hyperbolic tangent | +| `AVG` | average function | +| `BETWEEN` | range operator, used with `AND` | +| `BY` | do something by certain conditions, used with `GROUP` or `ORDER` | +| `CEIL` | rounds a number UPWARDS to the nearest integer | +| `COS` | cosine | +| `COSH` | hyperbolic cosine | +| `COUNT` | count function | +| `CREATE` | create a stream / connector | +| `DATE` | prefix of date constant | +| `DAY` | interval unit | +| `DROP` | drop a stream | +| `EXP` | exponent | +| `FLOOR` | rounds a number DOWNWARDS to the nearest integer | +| `FROM` | specify where to select data from | +| `GROUP` | group values by certain conditions, used with `BY` | +| `HAVING` | filter select values by a condition | +| `HOPPING` | hopping window | +| `IFNULL` | if the first argument is `NULL` returns the second, else the first | +| `INSERT` | insert data into a stream, used with `INTO` | +| `INTERVAL` | prefix of interval constant | +| `INTO` | insert data into a stream, used with `INSERT` | +| `IS_ARRAY` | to determine if the given value is an array of values | +| `IS_BOOL` | to determine if the given value is a boolean | +| `IS_DATE` | to determine if the given value is a date value | +| `IS_FLOAT` | to determine if the given value is a float | +| `IS_INT` | to determine if the given value is an integer | +| `IS_NUM` | to determine if the given value is a number | +| `IS_STR` | to determine if the given value is a string | +| `IS_TIME` | to determine if the given value is a time value | +| `JOIN` | for joining two streams | +| `LEFT` | joining type, used with `JOIN` | +| `LEFT_TRIM` | trim spaces from the left end of a string | +| `LOG` | logarithm with base e | +| `LOG10` | logarithm with base 10 | +| `LOG2` | logarithm with base 2 | +| `MAX` | maximum function | +| `MIN` | minimum function | +| `MINUTE` | interval unit | +| `MONTH` | interval unit | +| `NOT` | logical not operator | +| `NULLIF` | returns `NULL` if the first argument is equal to the second, otherwise the first | +| `OR` | logical or operator | +| `ORDER` | sort values by certain conditions, used with `BY` | +| `OUTER` | joining type, used with `JOIN` | +| `REVERSE` | reverse a string | +| `RIGHT_TRIM` | trim spaces from the right end of a string | +| `ROUND` | rounds a number to the nearest integer | +| `SECOND` | interval unit | +| `SELECT` | query a stream | +| `SHOW` | show something to stdout | +| `SIGN` | return the sign of a numeric value as an INTEGER | +| `SIN` | sine | +| `SINH` | hyperbolic sine | +| `SLIDING` | sliding window | +| `SQRT` | square root | +| `STREAM` | specify a stream, used with `CREATE` | +| `STRLEN` | get the length of a string | +| `SUM` | sum function | +| `TAN` | tangent | +| `TANH` | hyperbolic tangent | +| `TIME` | prefix of the time constant | +| `TO_LOWER` | convert a string to lowercase | +| `TO_STR` | convert a value to string | +| `TO_UPPER` | convert a string to uppercase | +| `TRIM` | trim spaces from both ends of a string | +| `TUMBLING` | tumbling window | +| `VALUES` | specify inserted data, used with `INSERT INTO` | +| `WEEK` | interval unit | +| `WHERE` | filter selected values by a condition | +| `WITH` | specify properties when creating a stream | +| `WITHIN` | specify time window when joining two streams | +| `YEAR` | interval unit | + +## Operators + +| operator | description | +|----------|------------------------------| +| `=` | equal to | +| `<>` | not equal to | +| `<` | less than | +| `>` | greater than | +| `<=` | less than or equal to | +| `>=` | greater than or equal to | +| `+` | addition | +| `-` | subtraction | +| `*` | multiplication | +| `.` | access field of a stream | +| `[]` | access item of an array | +| `AND` | logical and operator | +| `OR` | logical or operator | +| `::` | type casting | +| `->` | JSON access(as JSON) by key | +| `->>` | JSON access(as text) by key | +| `#>` | JSON access(as JSON) by path | +| `#>>` | JSON access(as text) by path | + +## Scalar Functions + +| function | description | +|-------------------|------------------------------------------------------------------------------------------| +| `ABS` | absolute value | +| `ACOS` | arccosine | +| `ACOSH` | inverse hyperbolic cosine | +| `ARRAY_CONTAIN` | given an array, checks if a search value is contained in the array | +| `ARRAY_DISTINCT` | returns an array of all the distinct values | +| `ARRAY_EXCEPT` | `ARRAY_DISTINCT` except for those also present in the second array | +| `ARRAY_INTERSECT` | returns an array of all the distinct elements from the intersection of both input arrays | +| `ARRAY_JOIN` | creates a flat string representation of all elements contained in the given array | +| `ARRAY_LENGTH` | return the length of the given array | +| `ARRAY_MAX` | returns the maximum value from the given array of primitive elements | +| `ARRAY_MIN` | returns the minimum value from the given array of primitive elements | +| `ARRAY_REMOVE` | removes all elements from the input array equal to the second argument | +| `ARRAY_SORT` | sort the given array | +| `ARRAY_UNION` | returns an array of all the distinct elements from the union of both input arrays | +| `ASIN` | arcsine | +| `ASINH` | inverse hyperbolic sine | +| `ATAN` | arctangent | +| `ATANH` | inverse hyperbolic tangent | +| `CEIL` | rounds a number UPWARDS to the nearest integer | +| `COS` | cosine | +| `COSH` | hyperbolic cosine | +| `EXP` | exponent | +| `FLOOR` | rounds a number DOWNWARDS to the nearest integer | +| `IFNULL` | if the first argument is `NULL` returns the second, else the first | +| `NULLIF` | returns `NULL` if the first argument is equal to the second, otherwise the first | +| `IS_ARRAY` | to determine if the given value is an array of values | +| `IS_BOOL` | to determine if the given value is a boolean | +| `IS_DATE` | to determine if the given value is a date value | +| `IS_FLOAT` | to determine if the given value is a float | +| `IS_INT` | to determine if the given value is an integer | +| `IS_NUM` | to determine if the given value is a number | +| `IS_STR` | to determine if the given value is a string | +| `IS_TIME` | to determine if the given value is a time value | +| `LEFT_TRIM` | trim spaces from the left end of a string | +| `LOG` | logarithm with base e | +| `LOG10` | logarithm with base 10 | +| `LOG2` | logarithm with base 2 | +| `REVERSE` | reverse a string | +| `RIGHT_TRIM` | trim spaces from the right end of a string | +| `ROUND` | rounds a number to the nearest integer | +| `SIGN` | return the sign of a numeric value as an INTEGER | +| `SIN` | sine | +| `SINH` | hyperbolic sine | +| `SQRT` | square root | +| `STRLEN` | get the length of a string | +| `TAN` | tangent | +| `TANH` | hyperbolic tangent | +| `TO_LOWER` | convert a string to lowercase | +| `TO_STR` | convert a value to string | +| `TO_UPPER` | convert a string to uppercase | +| `TOPK` | topk aggregate function | +| `TOPKDISTINCT` | topkdistinct aggregate function | +| `TRIM` | trim spaces from both ends of a string | + +## Aggregate Functions + +| function | description | +|----------------|--------------------------------| +| `AVG` | average | +| `COUNT` | count | +| `MAX` | maximum | +| `MIN` | minimum | +| `SUM` | sum | +| `TOPK` | top k values as array | +| `TOPKDISTINCT` | distinct top k values as array | diff --git a/docs/v0.17.0/reference/sql/functions/_index.md b/docs/v0.17.0/reference/sql/functions/_index.md new file mode 100644 index 0000000..2817bb3 --- /dev/null +++ b/docs/v0.17.0/reference/sql/functions/_index.md @@ -0,0 +1,5 @@ +--- +order: ["aggregation.md", "scalar.md"] +--- + +Functions diff --git a/docs/v0.17.0/reference/sql/functions/aggregation.md b/docs/v0.17.0/reference/sql/functions/aggregation.md new file mode 100644 index 0000000..c6b61ab --- /dev/null +++ b/docs/v0.17.0/reference/sql/functions/aggregation.md @@ -0,0 +1,48 @@ +Aggregate Functions +=================== +Aggregate functions perform a calculation on a set of values and return a single value. + +```sql +COUNT(expression) +COUNT(*) +``` + +Return the number of rows. +When `expression` is specified, the count returned will be the number of matched rows. +When `*` is specified, the count returned will be the total number of rows. + +```sql +AVG(expression) +``` + +Return the average value of a given expression. + +```sql +SUM(expression) +``` + +Return the sum value of a given expression. + +```sql +MAX(expression) +``` + +Return the max value of a given expression. + +```sql +MIN(expression) +``` + +Return the min value of a given expression. + +```sql +TOPK(expression_value, expression_k) +``` + +Return the top `K`(specified by `expression_k`) values of `expression_value` in a array. + +```sql +TOPKDISTINCT(expression_value, expression_k) +``` + +Similar to `TOPK`, but only returns distinct values of `expression_value`. diff --git a/docs/v0.17.0/reference/sql/functions/scalar.md b/docs/v0.17.0/reference/sql/functions/scalar.md new file mode 100644 index 0000000..10ee4c2 --- /dev/null +++ b/docs/v0.17.0/reference/sql/functions/scalar.md @@ -0,0 +1,327 @@ +Scalar Functions +================ + +Scalar functions operate on one or more values and then return a single value. They can be used wherever a value expression is valid. + +Scalar functions are divided into serval kinds. + +### Type Casting Functions + +Our SQL supports explicit type casting in the form of `CAST(expr AS type)` or `expr :: type`. Target type can be one of the follows: + +- `INTEGER` +- `FLOAT` +- `NUMERIC` +- `BOOLEAN` +- `BYTEA` +- `STRING` +- `DATE` +- `TIME` +- `TIMESTAMP` +- `INTERVAL` +- `JSONB` +- `[]` (array) + +### JSON Functions + +To use JSON data conveniently, we support the following functions: + +- ` -> `, which gets the corresponded field and return as JSON format. +- ` ->> `, which gets the corresponded field and return as text format. +- ` #> `, which gets the corresponded field in the specified path and return as JSON format. +- ` #>> `, which gets the corresponded field in the specified path and return as text format. + +### Array Accessing Functions + +To access fields of arrays, we support the following functions: + +- ` []`, ` [:]`, ` [:]` and ` [:]` + +### Trigonometric Functions + +All trigonometric functions perform a calculation, operate on a single numeric value and then return a single numeric value. + +For values outside the domain, `NaN` is returned. + +```sql +SIN(num_expr) +SINH(num_expr) +ASIN(num_expr) +ASINH(num_expr) +COS(num_expr) +COSH(num_expr) +ACOS(num_expr) +ACOSH(num_expr) +TAN(num_expr) +TANH(num_expr) +ATAN(num_expr) +ATANH(num_expr) +``` + +### Arithmetic Functions + +The following functions perform a calculation, operate on a single numeric value and then return a single numeric value. + +```sql +ABS(num_expr) +``` + +Absolute value. + +```sql +CEIL(num_expr) +``` +The function application `CEIL(n)` returns the least integer not less than `n`. + +```sql +FLOOR(num_expr) +``` + +The function application `FLOOR(n)` returns the greatest integer not greater than `n`. + +```sql +ROUND(num_expr) +``` +The function application `ROUND(n)` returns the nearest integer to `n` the even integer if `n` is equidistant between two integers. + +```sql +SQRT(num_expr) +``` + +The square root of a numeric value. + +```sql +LOG(num_expr) +LOG2(num_expr) +LOG10(num_expr) +EXP(num_expr) +``` + +```sql +SIGN(num_expr) +``` +The function application `SIGN(n)` returns the sign of a numeric value as an Integer. + +- returns `-1` if `n` is negative +- returns `0` if `n` is exact zero +- returns `1` if `n` is positive +- returns `null` if `n` is exact `null` + +### Predicate Functions + +Function applications of the form `IS_A(x)` where `A` is the name of a type returns `TRUE` if the argument `x` is of type `A`, otherwise `FALSE`. + +```sql +IS_INT(val_expr) +IS_FLOAT(val_expr) +IS_NUM(val_expr) +IS_BOOL(val_expr) +IS_STR(val_expr) +IS_ARRAY(val_expr) +IS_DATE(val_expr) +IS_TIME(val_expr) +``` + +### String Functions + +```sql +TO_STR(val_expr) +``` + +Convert a value expression to a readable string. + +```sql +TO_LOWER(str) +``` +Convert a string to lower case, using simple case conversion. + +```sql +TO_UPPER(str) +``` + +Convert a string to upper case, using simple case conversion. + +```sql +TRIM(str) +``` + +Remove leading and trailing white space from a string. + +```sql +LEFT_TRIM(str) +``` + +Remove leading white space from a string. + +```sql +RIGHT_TRIM(str) +``` + +Remove trailing white space from a string. + +```sql +REVERSE(str) +``` + +Reverse the characters of a string. + +```sql +STRLEN(str) +``` + +Returns the number of characters in a string. + +```sql +TAKE(num_expr, str) +``` + +The function application `TAKE(n, s)` returns the prefix of the string of length `n`. + +```sql +TAKEEND(num_expr, str) +``` + +The function application `TAKEEND(n, s)` returns the suffix remaining after taking `n` characters from the end of the string. + +```sql +DROP(num_expr, str) +``` + +The function application `DROP(n, s)` returns the suffix of the string after the first `n` characters, or the empty string if n is greater than the length of the string. + +```sql +DROPEND(num_expr, str) +``` + +The function application `DROPEND(n, s)` returns the prefix remaining after dropping `n` characters from the end of the string. + +### Null Functions + +```sql +IFNULL(val_expr, val_expr) +``` + +The function application `IFNULL(x, y)` returns `y` if `x` is `NULL`, otherwise `x`. + +When the argument type is a complex type, for example, `ARRAY`, the contents of the complex type are not inspected. + +```sql +NULLIF(val_expr, val_expr) +``` + +The function application `NULLIF(x, y)` returns `NULL` if `x` is equal to `y`, otherwise `x`. + +When the argument type is a complex type, for example, `ARRAY`, the contents of the complex type are not inspected. + +### Time and Date Functions + +#### Time Format + +Formats are analogous to [strftime](https://man7.org/linux/man-pages/man3/strftime.3.html). + +| Format Name | Raw Format String | +| ----------------- | --------------------------- | +| simpleDateFormat | "%Y-%m-%d %H:%M:%S" | +| iso8061DateFormat | "%Y-%m-%dT%H:%M:%S%z" | +| webDateFormat | "%a, %d %b %Y %H:%M:%S GMT" | +| mailDateFormat | "%a, %d %b %Y %H:%M:%S %z" | + +```sql +DATETOSTRING(val_expr, str) +``` + +Formatting seconds since 1970-01-01 00:00:00 UTC to string in GMT with the second string argument as the given format name. + +```sql +STRINGTODATE(str, str) +``` + +Formatting string to seconds since 1970-01-01 00:00:00 UTC in GMT with the second string argument as the given format name. + +### Array Functions + +```sql +ARRAY_CONTAINS(arr_expr, val_expr) +``` + +Given an array, checks if the search value is contained in the array (of the same type). + +```sql +ARRAY_DISTINCT(arr_expr) +``` + +Returns an array of all the distinct values, including `NULL` if present, from the input array. The output array elements are in order of their first occurrence in the input. + +Returns `NULL` if the argument is `NULL`. + +```sql +ARRAY_EXCEPT(arr_expr, arr_expr) +``` + +Returns an array of all the distinct elements from an array, except for those also present in a second array. The order of entries in the first array is preserved but duplicates are removed. + +Returns `NULL` if either input is `NULL`. + +```sql +ARRAY_INTERSECT(arr_expr, arr_expr) +``` + +Returns an array of all the distinct elements from the intersection of both input arrays. If the first list contains duplicates, so will the result. If the element is found in both the first and the second list, the element from the first list will be used. + +Returns `NULL` if either input is `NULL`. + +```sql +ARRAY_UNION(arr_expr, arr_expr) +``` + +Returns the array union of the two arrays. Duplicates, and elements of the first list, are removed from the second list, but if the first list contains duplicates, so will the result. + +Returns `NULL` if either input is `NULL`. + +```sql +ARRAY_JOIN(arr_expr) +ARRAY_JOIN(arr_expr, str) +``` + +Creates a flat string representation of all the primitive elements contained in the given array. The elements in the resulting string are separated by the chosen delimiter, which is an optional parameter that falls back to a comma `,`. + +```sql +ARRAY_LENGTH(arr_expr) +``` + +Returns the length of a finite list. + +Returns `NULL` if the argument is `NULL`. + +```sql +ARRAY_MAX(arr_expr) +``` + +Returns the maximum value from within a given array of elements. + +Returns `NULL` if the argument is `NULL`. + +```sql +ARRAY_MIN(arr_expr) +``` + +Returns the minimum value from within a given array of elements. + +Returns `NULL` if the argument is `NULL`. + +```sql +ARRAY_REMOVE(arr_expr, val_expr) +``` + +Removes all elements from the input array equal to the second argument. + +Returns `NULL` if the first argument is `NULL`. + + +```sql +ARRAY_SORT(arr_expr) +``` + +Sort an array. Elements are arranged from lowest to highest, keeping duplicates in the order they appeared in the input. + +Returns `NULL` if the first argument is `NULL`. diff --git a/docs/v0.17.0/reference/sql/sql-overview.md b/docs/v0.17.0/reference/sql/sql-overview.md new file mode 100644 index 0000000..87d069d --- /dev/null +++ b/docs/v0.17.0/reference/sql/sql-overview.md @@ -0,0 +1,196 @@ +# SQL Overview + +SQL is a domain-specific language used in programming and designed for managing +data held in a database management system. A standard for the specification of +SQL is maintained by the American National Standards Institute (ANSI). Also, +there are many variants and extensions to SQL to express more specific programs. + +The +[SQL grammar of HStreamDB](https://github.com/hstreamdb/hstream/blob/main/hstream-sql/etc/SQL-v1.cf) +is based on a subset of standard SQL with some extensions to support stream +operations. + +## Syntax + +SQL inputs are made up of a series of statements. Each statement is made up of a +series of tokens and ends in a semicolon (`;`). + +A token can be a keyword argument, an identifier, a literal, an operator, or a +special character. The details of the rules can be found in the +[BNFC grammar file](https://github.com/hstreamdb/hstream/blob/main/hstream-sql/etc/SQL-v1.cf). +Normally, tokens are separated by whitespace. + +The following examples are syntactically valid SQL statements: + +```sql +SELECT * FROM my_stream; + +CREATE STREAM abnormal_weather AS SELECT * FROM weather WHERE temperature > 30 AND humidity > 80 WITH (REPLICATE = 3); + +INSERT INTO weather (cityId, temperature, humidity) VALUES (11254469, 12, 65); +``` + +## Keywords + +Some tokens such as `SELECT`, `INSERT` and `WHERE` are reserved _keywords_, +which have specific meanings in SQL syntax. Keywords are case insensitive, which +means that `SELECT` and `select` are equivalent. A keyword can not be used as an +identifier. + +For a complete list of keywords, see the [appendix](appendix.md). + +## Identifiers + +Identifiers are tokens that represent user-defined objects such as streams, +fields, and other ones. For example, `my_stream` can be used as a stream name, +and `temperature` can represent a field in the stream. + +By now, identifiers only support C-style naming rules. It means that an +identifier name can only have letters (both uppercase and lowercase letters), +digits, and the underscore. Besides, the first letter of an identifier should be +either a letter or an underscore. + +By now, identifiers are case-sensitive, which means that `my_stream` and +`MY_STREAM` are different identifiers. + +## Expressions + +An expression is a value that can exist almost everywhere in a SQL query. It can +be both a constant whose value is known before execution (such as an integer or +a string literal) and a variable whose value is known during execution (such as +a field of a stream). + +### Integer + +Integers are in the form of `digits`, where `digits` are one or more +single-digit integers (0 through 9). Negatives such as `-1` are also supported. +**Note that scientific notation is not supported yet**. + +### Float + +Floats are in the form of `.`. Negative floats such as `-11.514` +are supported. Note that + +- **scientific notation is not supported yet**. +- **Forms such as `1.` and `.99` are not supported yet**. + +### Boolean + +A boolean value is either `TRUE` or `FALSE`. + +### String + +Strings are arbitrary character series surrounded by single quotes (`'`), such +as `'anyhow'`. + +### Date + +Dates represent a date exact to a day in the form of +`DATE '--'`, where ``, `` and `` are all +integer constants. Note that the leading `DATE` should not be omitted. + +Example: `DATE '2021-01-02'` + +### Time + +Time constants represent time exact to a second or a microsecond in the form of +`TIME '--'` or +`TIME '--.'`, where ``, ``, +`` and `` are all integer constants. Note that the leading +`TIME` should not be omitted. + +Example: `TIME '10:41:03'`, `TIME '01:02:03.456'` + +### Timestamp + +Timestamp constants represent values that contain both date and time parts. It +can also contain an optional timezone part for convenience. A timestamp is in +the form of `TIMESTAMP ''`. For more information, please refer to +[ISO 8601](https://en.wikipedia.org/wiki/ISO_8601). + +Example: `TIMESTAMP '2023-06-30T12:30:45+02:00'` + +### Interval + +Intervals represent a time section in the form of +`INTERVAL ` or. Note that the leading `INTERVAL` should +not be omitted. + +Example: `INTERVAL 5 SECOND`(5 seconds) + +### Array + +Arrays represent a list of values, where each one of them is a valid expression. +It is in the form of `[, ...]`. + +Example: `["aa", "bb", "cc"]`, `[1, 2]` + +### Column(Field) + +A column(or a field) represents a part of a value in a stream or materialized +view. It is similar to column of a table in traditional relational databases. A +column is in the form of `` or +`.`. When a column name is ambiguous(for +example it has the same name as a function application) the double quote `` " `` +can be used. + +Example: `temperature`, `stream_test.humidity`, `` "SUM(a)" `` + +### Subquery + +A subquery is a SQL clause start with `SELECT`, see +[here](./statements/select-stream.md). + +### Function or Operator Application + +An expression can also be formed by other expressions by applying functions or +operators on them. The details of function and operator can be found in the +following parts. + +Example: `SUM(stream_test.cnt)`, (`raw_stream::jsonb)->>'value'` + +## Operators and Functions + +Functions are special keywords that mean some computation, such as `SUM` and +`MIN`. And operators are infix functions composed of special characters, such as +`>=` and `<>`. + +For a complete list of functions and operators, see the [appendix](appendix.md). + +## Special Characters + +There are some special characters in the SQL syntax with particular meanings: + +- Parentheses (`()`) are used outside an expression for controlling the order of + evaluation or specifying a function application. +- Brackets (`[]`) are used with maps and arrays for accessing their + substructures, such as `some_map[temp]` and `some_array[1]`. **Note that it is + not supported yet**. +- Commas (`,`) are used for delineating a list of objects. +- The semicolons (`;`) represent the end of a SQL statement. +- The asterisk (`*`) represents "all fields", such as + `SELECT * FROM my_stream;`. +- The period (`.`) is used for accessing a field in a stream, such as + `my_stream.humidity`. +- The double quote (`` " ``) represents an "raw column name" in the `SELECT` + clause to distinguish a column name with functions from actual function + applications. For example, `SELECT SUM(a) FROM s;` means applying `SUM` + function on the column `a` from stream `s`. However if the stream `s` actually + contains a column called `SUM(a)` and you want to take it out, you can use + back quotes like `` SELECT "SUM(a)" FROM s; ``. + +## Comments + +A single-line comment begins with `//`: + +``` +// This is a comment +``` + +Also, C-style multi-line comments are supported: + +``` +/* This is another + comment +*/ +``` diff --git a/docs/v0.17.0/reference/sql/sql-quick-reference.md b/docs/v0.17.0/reference/sql/sql-quick-reference.md new file mode 100644 index 0000000..998d37f --- /dev/null +++ b/docs/v0.17.0/reference/sql/sql-quick-reference.md @@ -0,0 +1,68 @@ +SQL quick reference +=================== + +## CREATE STREAM + +Create a new HStreamDB stream with the stream name given. +An exception will be thrown if the stream is already created. +See [CREATE STREAM](statements/create-stream.md). + +```sql +CREATE STREAM stream_name [AS select_query] [WITH (stream_option [, ...])]; +``` + +## CREATE VIEW + +Create a new view with the view name given. A view is a physical object like a stream and it is updated with time. +An exception will be thrown if the view is already created. The name of a view can either be the same as a stream. +See [CREATE VIEW](statements/create-view.md). + +```sql +CREATE VIEW view_name AS select_query; +``` + +## SELECT + +Get records from a materialized view or a stream. Note that `SELECT` from streams can only used as a part of `CREATE STREAM` or `CREATE VIEW`. When you want to get results in a command-line session, create a materialized view first and then `SELECT` from it. +See [SELECT (Stream)](statements/select-stream.md). + +```sql +SELECT <* | expression [ AS field_alias ] [, ...]> + FROM stream_ref + [ WHERE expression ] + [ GROUP BY field_name [, ...] ] + [ HAVING expression ]; +``` + +## INSERT + +Insert data into the specified stream. It can be a data record, a JSON value or binary data. +See [INSERT](statements/insert.md). + +```sql +INSERT INTO stream_name (field_name [, ...]) VALUES (field_value [, ...]); +INSERT INTO stream_name VALUES CAST ('json_value' AS JSONB); +INSERT INTO stream_name VALUES CAST ('binary_value' AS BYTEA); +INSERT INTO stream_name VALUES 'json_value' :: JSONB; +INSERT INTO stream_name VALUES 'binary_value' :: BYTEA; +``` + +## DROP + +Delete a given stream or view. There can be an optional `IF EXISTS` config to only delete the given category if it exists. + +```sql +DROP STREAM stream_name [IF EXISTS]; +DROP VIEW view_name [IF EXISTS]; +``` + +## SHOW + +Show the information of all streams, queries, views or connectors. + +```sql +SHOW STREAMS; +SHOW QUERIES; +SHOW VIEWS; +SHOW CONNECTORS; +``` diff --git a/docs/v0.17.0/reference/sql/statements/_index.md b/docs/v0.17.0/reference/sql/statements/_index.md new file mode 100644 index 0000000..1f96acc --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/_index.md @@ -0,0 +1,18 @@ +--- +order: + [ + 'create-stream.md', + 'create-view.md', + 'create-connector.md', + 'drop-stream.md', + 'drop-view.md', + 'drop-connector.md', + 'select-stream.md', + 'insert.md', + 'show.md', + 'pause.md', + 'resume.md', + ] +--- + +Statements diff --git a/docs/v0.17.0/reference/sql/statements/create-connector.md b/docs/v0.17.0/reference/sql/statements/create-connector.md new file mode 100644 index 0000000..164790f --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/create-connector.md @@ -0,0 +1,37 @@ +CREATE CONNECTOR +================ + +Create a new connector for fetching data from or writing data to an external system. A connector can be either a source or a sink one. + + +## Synopsis + +Create source connector: + +```sql +CREATE SOURCE CONNECTOR connector_name FROM source_name WITH (connector_option [, ...]); +``` + +Create sink connector: + +```sql +CREATE SINK CONNECTOR connector_name TO sink_name WITH (connector_option [, ...]); +``` + +## Notes + +- `connector_name` is a valid identifier. +- `source_name` is a valid identifier(`mysql`, `postgresql` etc.). +- There is are some connector options in the `WITH` clause separated by commas. + +check [Connectors](https://hstream.io/docs/en/latest/io/connectors.html) to find the connectors and their configuration options . + +## Examples + +```sql +create source connector source01 from mysql with ("host" = "mysql-s1", "port" = 3306, "user" = "root", "password" = "password", "database" = "d1", "table" = "person", "stream" = "stream01"); +``` + +```sql +create sink connector sink01 to postgresql with ("host" = "pg-s1", "port" = 5432, "user" = "postgres", "password" = "postgres", "database" = "d1", "table" = "person", "stream" = "stream01"); +``` diff --git a/docs/v0.17.0/reference/sql/statements/create-stream.md b/docs/v0.17.0/reference/sql/statements/create-stream.md new file mode 100644 index 0000000..8aea808 --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/create-stream.md @@ -0,0 +1,25 @@ +CREATE STREAM +============= + +Create a new hstream stream with the given name. An exception will be thrown if a stream with the same name already exists. + +## Synopsis + +```sql +CREATE STREAM stream_name [ AS select_query ] WITH ([ REPLICATE = INT, DURATION = INTERVAL ]); +``` + +## Notes + +- `stream_name` is a valid identifier. +- `select_query` is an optional `SELECT` (Stream) query. For more information, see `SELECT` section. When `` is specified, the created stream will be filled with records from the `SELECT` query continuously. Otherwise, the stream will only be created and kept empty. +- `WITH` clause contains some stream options. Only `REPLICATE` and `DURATION` options are supported now, which represents the replication factor and the retention time of the stream. If it is not specified, they will be set to default value. +- Sources in `select_query` can be both stream(s) and materialized view(s). + +## Examples + +```sql +CREATE STREAM foo; + +CREATE STREAM abnormal_weather AS SELECT * FROM weather WHERE temperature > 30 AND humidity > 80; +``` diff --git a/docs/v0.17.0/reference/sql/statements/create-view.md b/docs/v0.17.0/reference/sql/statements/create-view.md new file mode 100644 index 0000000..681a15a --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/create-view.md @@ -0,0 +1,37 @@ +CREATE VIEW +=========== + +Create a new hstream view with the given name. An exception will be thrown if a view or stream with the same name already exists. + +A view is **NOT** just an alias but physically maintained in the memory and is updated incrementally. Thus queries on a view are really fast and do not require extra resources. + +## Synopsis + +```sql +CREATE VIEW view_name AS select_query; +``` +## Notes +- `view_name` is a valid identifier. +- `select_query` is a valid `SELECT` query. For more information, see `SELECT` section. There is no extra restrictions on `select_query` but we recommend using at least one aggregate function and a `GROUP BY` clause. Otherwise, the query may be a little weird and consumes more resources. See the following examples: + +``` +// CREATE VIEW v1 AS SELECT id, SUM(sales) FROM s GROUP BY id; +// what the view contains at time +// [t1] [t2] [t3] +// {"id":1, "SUM(sales)": 10} -> {"id":1, "SUM(sales)": 10} -> {"id":1, "SUM(sales)": 30} +// {"id":2, "SUM(sales)": 8} {"id":2, "SUM(sales)": 15} + +// CREATE VIEW AS SELECT id, sales FROM s; +// what the view contains at time +// [t1] [t2] [t3] +// {"id":1, "sales": 10} -> {"id":1, "sales": 10} -> {"id":1, "sales": 10} +// {"id":2, "sales": 8} {"id":1, "sales": 20} +// {"id":2, "sales": 8} +// {"id":2, "sales": 7} +``` + +## Examples + +```sql +CREATE VIEW foo AS SELECT a, SUM(a), COUNT(*) FROM s1 GROUP BY b; +``` diff --git a/docs/v0.17.0/reference/sql/statements/drop-connector.md b/docs/v0.17.0/reference/sql/statements/drop-connector.md new file mode 100644 index 0000000..0038bf7 --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/drop-connector.md @@ -0,0 +1,20 @@ +DROP CONNECTOR +=========== + +Drop a connector with the given name. + +## Synopsis + +```sql +DROP CONNECTOR connector_name; +``` + +## Notes + +- `connector_name` is a valid identifier. + +## Examples + +```sql +DROP CONNECTOR foo; +``` diff --git a/docs/v0.17.0/reference/sql/statements/drop-stream.md b/docs/v0.17.0/reference/sql/statements/drop-stream.md new file mode 100644 index 0000000..a25b397 --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/drop-stream.md @@ -0,0 +1,23 @@ +DROP STREAM +=========== + +Drop a stream with the given name. If `IF EXISTS` is present, the statement won't fail if the stream does not exist. + +## Synopsis + +```sql +DROP STREAM stream_name [ IF EXISTS ]; +``` + +## Notes + +- `stream_name` is a valid identifier. +- `IF EXISTS` annotation is optional. + +## Examples + +```sql +DROP STREAM foo; + +DROP STREAM foo IF EXISTS; +``` diff --git a/docs/v0.17.0/reference/sql/statements/drop-view.md b/docs/v0.17.0/reference/sql/statements/drop-view.md new file mode 100644 index 0000000..ed81a93 --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/drop-view.md @@ -0,0 +1,23 @@ +DROP VIEW +=========== + +Drop a view with the given name. If `IF EXISTS` is present, the statement won't fail if the view does not exist. + +## Synopsis + +```sql +DROP VIEW view_name [ IF EXISTS ]; +``` + +## Notes + +- `view_name` is a valid identifier. +- `IF EXISTS` annotation is optional. + +## Examples + +```sql +DROP VIEW foo; + +DROP VIEW foo IF EXISTS; +``` diff --git a/docs/v0.17.0/reference/sql/statements/insert.md b/docs/v0.17.0/reference/sql/statements/insert.md new file mode 100644 index 0000000..bcb48b5 --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/insert.md @@ -0,0 +1,30 @@ +INSERT +====== + +Insert a record into specified stream. + +## Synopsis + +```sql +INSERT INTO stream_name (field_name [, ...]) VALUES (field_value [, ...]); +INSERT INTO stream_name VALUES CAST ('json_value' AS JSONB); +INSERT INTO stream_name VALUES CAST ('binary_value' AS BYTEA); +INSERT INTO stream_name VALUES 'json_value' :: JSONB; +INSERT INTO stream_name VALUES 'binary_value' :: BYTEA; +``` + +## Notes + +- `field_value` represents the value of corresponding field, which is a [constant](../sql-overview.md#literals-constants). The correspondence between field type and inserted value is maintained by users themselves. +- `json_value` should be a valid JSON expression. And when inserting a JSON value, remember to put `'`s around it. +- `binary_value` can be any value in the form of a string. It will not be processed by HStreamDB and can only be fetched by certain client API. Remember to put `'`s around it. + +## Examples + +```sql +INSERT INTO weather (cityId, temperature, humidity) VALUES (11254469, 12, 65); +INSERT INTO foo VALUES CAST ('{"a": 1, "b": "abc"}' AS JSONB); +INSERT INTO foo VALUES '{"a": 1, "b": "abc"}' :: JSONB; +INSERT INTO bar VALUES CAST ('some binary value \x01\x02\x03' AS BYTEA); +INSERT INTO bar VALUES 'some binary value \x01\x02\x03' :: BYTEA; +``` diff --git a/docs/v0.17.0/reference/sql/statements/pause.md b/docs/v0.17.0/reference/sql/statements/pause.md new file mode 100644 index 0000000..f529522 --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/pause.md @@ -0,0 +1,22 @@ +PAUSE +================ + +Pause a running task(e.g. connector). + +## Synopsis + +Pause a task: + +```sql +PAUSE name; +``` + +## Notes + +- `name` is a valid identifier. + +## Examples + +```sql +PAUSE CONNECTOR source01; +``` diff --git a/docs/v0.17.0/reference/sql/statements/resume.md b/docs/v0.17.0/reference/sql/statements/resume.md new file mode 100644 index 0000000..3acf9a4 --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/resume.md @@ -0,0 +1,22 @@ +RESUME +================ + +Resume a paused task(e.g. connector). + +## Synopsis + +resume a paused task: + +```sql +RESUME name; +``` + +## Notes + +- `name` is a valid identifier. + +## Examples + +```sql +RESUME CONNECTOR source01; +``` diff --git a/docs/v0.17.0/reference/sql/statements/select-stream.md b/docs/v0.17.0/reference/sql/statements/select-stream.md new file mode 100644 index 0000000..a91978a --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/select-stream.md @@ -0,0 +1,121 @@ +# SELECT (Stream) + +Get records from a materialized view or a stream. Note that `SELECT` from +streams can only be used as a part of `CREATE STREAM` or `CREATE VIEW`. + +::: tip +Unless when there are cases you would want to run an interactive query from the command +shell, you could add `EMIT CHANGES` at the end of the following examples. +::: + +## Synopsis + +```sql +SELECT <* | identifier.* | expression [ AS field_alias ] [, ...]> + FROM stream_ref + [ WHERE expression ] + [ GROUP BY field_name [, ...] ] + [ HAVING expression ]; +``` + +## Notes + +### About `expression` + +`expression` can be any expression described +[here](../sql-overview.md#Expressions), such as `temperature`, +`weather.humidity`, `42`, `1 + 2`, `SUM(productions)`, `'COUNT(*)'` and +even subquery `SELECT * FROM stream_test WHERE a > 1`. In `WHERE` and `HAVING` +clauses, `expression` should have a value of boolean type. + +### About `stream_ref` + +`stream_ref` specifies a source stream or materialized view: + +``` + stream_ref ::= + | AS + | WITHIN Interval + | ( ) +``` + +It seems quite complex! Do not worry. In a word, a `stream_ref` is something you +can retrieve data from. A `stream_ref` can be an identifier, a join +of two `stream_ref`s, a `stream_ref` with a time window or a `stream_ref` with an +alias. We will describe them in detail. + +#### JOIN + +Fortunately, the `JOIN` in our SQL query is the same as the SQL standard, which +is used by most of your familiar databases such as MySQL and PostgreSQL. It can +be one of: + +- `CROSS JOIN`, which produces the Cartesian product of two streams and/or + materialized view(s). It is equivalent to `INNER JOIN ON TRUE`. +- `[INNER] JOIN`, which produces all data in the qualified Cartesian product by + the join condition. Note a join condition must be specified. +- `LEFT [OUTER] JOIN`, which produces all data in the qualified Cartesian + product by the join condition plus one copy of each row in the left-hand + `stream_ref` for which there was no right-hand row that passed the join + condition(extended with nulls on the right). Note a join condition must be + specified. +- `RIGHT [OUTER] JOIN`, which produces all data in the qualified Cartesian + product by the join condition plus one copy of each row in the right-hand + `stream_ref` for which there was no left-hand row that passed the join + condition(extended with nulls on the left). Note a join condition must be + specified. +- `FULL [OUTER] JOIN`, which produces all data in the qualified Cartesian + product by the join condition, plus one row for each unmatched left-hand row + (extended with nulls on the right), plus one row for each unmatched right-hand + row (extended with nulls on the left). Note a join condition must be + specified. + +A join condition can be any of + +- `ON `. The condition passes when the value of the expression is + `TRUE`. +- `USING(column[, ...])`. The specified column(s) is matched. +- `NATURAL`. The common columns of two `stream_ref`s are matched. It is + equivalent to `USING(common_columns)`. + +#### Time Windows + +A `stream_ref` can also have a time window. Currently, we support the following 3 +time-window functions: + +``` +Tumble( , ) +HOP( , , ) +SLIDE( , ) +``` + +Note that + +- `some_interval` represents a period of time. See + [Intervals](../sql-overview.md#intervals). + +## Examples + +- A simple query: + +```sql +SELECT * FROM my_stream; +``` + +- Filtering rows: + +```sql +SELECT temperature, humidity FROM weather WHERE temperature > 10 AND humidity < 75; +``` + +- Joining streams: + +```sql +SELECT stream1.temperature, stream2.humidity FROM stream1 JOIN stream2 USING(humidity) WITHIN (INTERVAL 1 HOUR); +``` + +- Grouping records: + +```sql +SELECT COUNT(*) FROM TUMBLE(weather, INTERVAL 10 SECOND) GROUP BY cityId; +``` diff --git a/docs/v0.17.0/reference/sql/statements/show.md b/docs/v0.17.0/reference/sql/statements/show.md new file mode 100644 index 0000000..3e92a13 --- /dev/null +++ b/docs/v0.17.0/reference/sql/statements/show.md @@ -0,0 +1,18 @@ +SHOW +================ + +Show resources(e.g. streams, connnectors). + +## Synopsis + +show resources: + +```sql +RESUME ; +``` + +## Examples + +```sql +SHOW CONNECTORS; +``` diff --git a/docs/v0.17.0/release-notes.md b/docs/v0.17.0/release-notes.md new file mode 100644 index 0000000..d527b29 --- /dev/null +++ b/docs/v0.17.0/release-notes.md @@ -0,0 +1,728 @@ +# Release Notes + +## v0.17.0 [2023-08-23] + +### HServer + +- Add read-stream command for cli to read data from specific stream +- Add append command for cli to write data to specific stream +- Add ZookeeperSlotAlloc +- Add trimStream and trimShard RPC +- Add stream-v2 experimental feature +- Add doesStreamPartitionValExist to check if a shard is belong to a stream +- Add checkpointStore benchmark +- Add readStreamByKey RPC +- Add LookupKey RPC +- Add support for ASan +- Add support internal admin command +- Add read stream metrics +- Add trimShards RPC +- Improve call trim concurrently in trimShards and trimStream rpcs +- Improve free disk space in ci +- Improve docker-compose.yaml to use a zk cluster +- Improve reduce memory usage in reader services +- Improve bump third-part dependency version +- Improve ensure enable-tls take effect consistently across all scenarios +- Refactor the format of ReceivedRecord +- Refactor hstream configuration +- Refactor use deriving instead of template-haskell to deriving Aeson instance +- Refactor cli ShardOffset parser +- Fix redirect stderr to stdout for iotask +- Fix modify default store replication factor to 1 +- Fix correctly handle the SubscribeStateStopped state in the sendRecords method +- Fix getShardId method may cause cache inconsistency issue +- Fix interrupt all consumers when a subscription is deleting or failed +- Fix invalid consumer properly +- Fix wrong ShardReader exit status code for grpc-haskell +- Fix wrong behavior of LATEST OFFSET for shardReader +- Fix incorrect number of record entries delivered by readStreamByKey +- Fix memory leak in logIdHasGroup +- Fix incorrect encode HRecord by cli append +- Fix the main server should wait all AdvertisedListeners +- Fix readStream rpc can not consume multi-shard stream +- Fix zoovisitor memory leak +- Fix dead lock when client send init before server already bootstrap +- Fix incorrect calculateShardId function +- Fix make PEER_UNAVAILABLE a retryable exception +- Fix ci build with hstream_enable_asan +- Fix Ord instance for Rid + +### SQL && Processing Engine + +- Add schema with hstream_enable_schema flag +- Add support creating stream with schema +- Add support matastore-based schema +- Add support schema when creating stream by select & creating view +- Add support window_start and window_end in binder and planner +- Add Unknown type in binder +- Improve display BoundDataType +- Improve use column name to lookupColumn in planner +- Improve polish type check +- Improve schema not found error message +- Fix refine array syntax +- Fix clean up processing threads properly on exceptions +- Fix stream.column in binder & planner +- Fix errors caused by SomeSQLException +- Fix incorrect catalog id in planner +- Fix incorrect result on joining after group by +- Fix a bug may produce incorrect view result + +### Connector +- Add sink-elasticsearch connector image in hstream config file +- Add extra-docker-args option ++ Add ui:group, ui:condition schema options ++ Add error stream for error handling ++ Add skip strategy for error handling ++ Add retry strategy for error handling ++ Add normal table support for sink-las ++ Add extra datetime field for sink-las ++ Add LATEST offset option ++ Add parallel writing ++ Add buffer options +- Improve connector robustness + +## v0.16.0 [2023-07-07] + +### HServer + +- Add ReadStream and ReadSingleShardStream RPC to read data from a stream +- Add a new RPC for get tail recordId of specific shard +- Add validation when lookup resource +- Add readShardStream RPC for grpc-haskell +- Add `meta`, `lookup`, `query`, `connector` subcommand for hadmin +- Add command for cli to get hstream version +- Add benchmark for logdevice create LogGroups +- Add dockerfile for arm64 build +- Add readShard command in hstream cli +- Add stats command for connector, query and view in hadmin cli +- Improve readShardStream RPC to accept max read records nums and until offset +- Improve read-shard cli command to support specify max read records nums and until offset +- Improve sql cli help info +- Improve dockerfile to speed up build +- Improve error messages in case of cli errors +- Improve the output of cli +- Imporve add more logs in streaming fetch handler +- Improve delete resource relaed stats in deleteHandlers +- Improve change some connector log level +- Refactor data structures of inflight recordIds in subscription +- Refactor replace SubscriptionOnDifferentNode exception to WrongServer exception +- Fix hs-grpc memory leak and core dump problem +- Fix error handling in streaming fetch handler +- Fix checking waitingConsumer list when invalid a consumer +- Fix redundant recordIds deletion +- Fix remove stream when deleting a query +- Fix check whether query exists before creating a new query +- Fix stop related threads after a subscription is deleted +- Fix bug that can cause CheckedRecordIds pile +- Fix check meta store when listConsumer +- Fix subscription created by query can be never acked +- Fix getSubscription with non-existent checkpoint logId + +### SQL && Processing Engine + +- Add `BETWEEN`、`NOT` operators +- Add `HAVING` clause in views +- Add `INSERT INTO SELECT` statement +- Add extra label for JSON data +- Add syntax tests +- Add planner tests +- Improve syntax for quotes +- Improve remove duplicate aggregates +- Improve restore with refined AST +- Refactor remove _view postfix of a view +- Refactor create connector syntax +- Fix alias problems in aggregates and `GROUP BY` statements +- Fix refine string literal +- Fix grammar conflicts +- Fix `IFNULL` operator not work +- Fix runtime error caused by no aggregate with group by +- Fix batched messaged stuck +- Fix incorrect view name && aggregate result +- Fix cast operation +- Fix json related operation not work +- Fix mark the state as TERMINATED if a source is missing on resuming + +### Connector +- Add sink-las connector +- Add sink-elasticsearch connector +- Add Connection, Primary Keys checking for sink-jdbc +- Add retry for sink connectors +- Add Batch Receiver for sinks +- Add full-featured JSON-schema for source-generator +- Replace Subscription with StreamShardReader +- Fix source-debezium offsets + +## v0.15.0 [2023-04-28] + +### HServer + +- Add support for automatic recovery of computing tasks(query, connector) on other nodes when a node in the cluster fails +- Add support for reading data from a given timestamp +- Add support for reconnecting nodes that were previously determined to have failed in the cluster +- Add a new RPC for reading stream shards +- Add metrics for query, view, connector +- Add support for fetching logs of connectors +- Add retry read from hstore when the subscription do resend +- Improve the storage of checkpoints for subscriptions +- Improve read performance of hstore reader +- Improve error handling of RPC methods +- Improve the process of nodes restart +- Improve requests validation in handlers +- Imporve the timestamp of records +- Improve the deletion of queries +- Refactor logging modules +- Fix the load distribution logic in case of cluster members change + +### SQL && Processing Engine + +- The v1 engine is used by default +- Add states saving and restoration of a query +- Add validation for select statements with group by clause +- Add retention time option for ``create stream`` statement +- Add a window_end column for aggregated results based on time window +- Add time window columns to the result stream when using time windows +- Improve the syntax of time windows in SQL +- Improve the syntax of time interval in SQL +- Improve the process of creating the result stream of a query +- Fix `as` in `join` clause +- Fix creating a view without a group by clause +- Fix an issue which can cause incomplete aggregated columns +- Fix alias of an aggregation expr not work +- Fix aggregation queries on views +- Fix errors when joining multiple streams (3 or more) +- Disable subqueries temporarily + +## v0.14.0 [2023-02-28] + +- HServer now uses the in-house Haskell GRPC framework by default +- Add deployment support for CentOS 7 +- Add stats for failed record delivery in subscriptions +- Remove `pushQuery` RPC from the protocol +- Fix the issue causing client stalls when multiple clients consume the same + subscription, and one fails to acknowledge +- Fix possible memory leaks caused by STM +- Fix cluster bootstrap issue causing incorrect status display +- Fix the issue that allows duplicate consumer names on the same subscription +- Fix the issue that allows readers to be created on non-existent shards +- Fix the issue causing the system to stall with the io check command + +## v0.13.0 [2023-01-18] + +- hserver is built with ghc 9.2 by default now +- Add support for getting the IP of the proxied client +- Add support for overloading the client's `user-agent` by setting `proxy-agent` +- Fix the statistics of retransmission and response metrics of subscriptions +- Fix some issues of the processing engine +- CLI: add `service-url` option + +## v0.12.0 [2022-12-29] + +- Add a new RPC interface for getting information about clients connected to the + subscription (including IP, type and version of client SDK, etc.) +- Add a new RPC interface for getting the progress of consumption on a + subscription +- Add a new RPC interface for listing the current `ShardReader`s +- Add TLS support for `advertised-listener`s +- Add support for file-based metadata storage, mainly for simplifying deployment + in local development and testing environments +- Add support for configuring the number of copies of the internal stream that + stores consumption progress +- Fix the problem that the consumption progress of subscriptions was not saved + correctly in some cases +- Improve the CLI tool: + - simplify some command options + - improve cluster interaction + - add retry for requests + - improve delete commands +- Switch to a new planner implementation for HStream SQL + - Improve stability and performance + - Improve the support for subqueries in the FROM clause + - add a new `EXPLAIN` statement for viewing logical execution plans + - more modular design for easy extension and optimization + +## v0.11.0 [2022-11-25] + +- Add support for getting the creation time of streams and subscriptions +- Add `subscription` subcommand in hstream CLI +- [**Breaking change**]Remove the compression option on the hserver side(should + use end-to-end compression instead) +- Remove logid cache +- Unify resource naming rules and improve the corresponding resource naming + checks +- [**Breaking change**]Rename hserver's startup parameters `host` and `address` + to `bind-address` and `advertised-address` +- Fix routing validation for some RPC requests +- Fix a possible failure when saving the progress of a subscription +- Fix incorrect results of `JOIN .. ON` +- Fix the write operation cannot be retried after got a timeout error + +## v0.10.0 [2022-10-28] + +### Highlights + +#### End-to-end compression + +In this release we have introduced a new feature called end-to-end compression, +which means data will be compressed in batches at the client side when it is +written, and the compressed data will be stored directly by HStore. In addition, +the client side can automatically decompress the data when it is consumed, and +the whole process is not perceptible to the user. + +In high-throughput scenarios, enabling end-to-end data compression can +significantly alleviate network bandwidth bottlenecks and improve read and write +performance.Our benchmark shows more than 4x throughput improvement in this +scenario, at the cost of increased CPU consumption on the client side. + +#### HStream SQL Enhancements + +In this release we have introduced many enhancements for HStream SQL, see +[here](#hstream-sql) for details. + +#### HServer based on a new gRPC library + +In this release we replaced the gRPC-haskell library used by HServer with a new +self-developed gRPC library, which brings not only better performance but also +improved long-term stability. + +#### Rqlite Based MetaStore + +In this release we have refactored the MetaStore component of HStreamDB to make +it more scalable and easier to use. We also **experimentally** support the use +of Rqlite instead of Zookeeper as the default MetaStore implementation, which +will make the deployment and maintenance of HStreamDB much easier. Now HServer, +HStore and HStream IO all use a unified MetaStore to store metadata. + +### HServer + +#### New Features + +- Add [e2e compression](#end-to-end-compression) + +#### Enhancements + +- Refactor the server module with a new grpc library +- Adpate to the new metastore and add support for rqlite +- Improve the mechanism of cluster resources allocation +- Improve the cluster startup and initialization process +- Improve thread usage and scheduling for the gossip module + +#### Bug fixes + +- Fix a shard can be assigned to an invalid consumer +- Fix memory leak caused by the gossip module +- Add existence check for dependent streams when creating a view +- Fix an issue where new nodes could fail when joining a cluster +- Fix may overflow while decoding batchedRecord +- Check metadata first before initializing sub when recving fetch request to + avoid inconsistency +- Fix max-record-size option validation + +### HStream SQL + +- Full support of subqueries. A subquery can replace almost any expression now. +- Refinement of data types. It supports new types such as date, time, array and + JSON. It also supports explicit type casting and JSON-related operators. +- Adjustment of time windows. Now every source stream can have its own time + window rather than a global one. +- More general queries on materialized views. Now any SQL clauses applicable to + a stream can be performed on a materialized view, including nested subqueries + and time windows. +- Optimized JOIN clause. It supports standard JOINs such as CROSS, INNER, OUTER + and NATURAL. It also allows JOIN between streams and materialized views. + +### HStream IO + +- Add MongoDB source and sink +- Adapt to the new metastore + +### Java Client + +[hstream-java v0.10.0](https://github.com/hstreamdb/hstreamdb-java/releases/tag/v0.10.0) +has been released: + +#### New Features + +- Add support for e2e compression: zstd, gzip +- Add `StreamBuilder ` + +#### Enhancements + +- Use `directExecutor` as default executor for `grpcChannel` + +#### Bug fixes + +- Fix `BufferedProducer` memory is not released in time +- Fix missing `RecordId` in `Reader`'s results +- Fix dependency conflicts when using hstreamdb-java via maven + +### Go Client + +[hstream-go v0.3.0](https://github.com/hstreamdb/hstreamdb-go/releases/tag/v0.3.0) +has been released: + +- Add support for TLS +- Add support for e2e compression: zstd, gzip +- Improve tests + +### Python Client + +[hstream-py v0.3.0](https://github.com/hstreamdb/hstreamdb-py/releases/tag/v0.3.0) +has been released: + +- Add support for e2e compression: gzip +- Add support for hrecord in BufferedProducer + +### Rust Client + +Add a new [rust client](https://github.com/hstreamdb/hstreamdb-rust) + +### HStream CLI + +- Add support for TLS +- Add -e, --execute options for non-interactive execution of SQL statements +- Add support for keeping the history of entered commands +- Improve error messages +- Add stream subcommands + +### Other Tools + +- Add a new tool [hdt](https://github.com/hstreamdb/deployment-tool) for + deployment + +## v0.9.0 [2022-07-29] + +### HStreamDB + +#### Highlights + +- [Shards in Streams](#shards-in-streams) +- [HStream IO](#hstream-io) +- [New Stream Processing Engine](#new-stream-processing-engine) +- [Gossip-based HServer Clusters](#gossip-based-hserver-clusters) +- [Advertised Listeners](#advertised-listeners) +- [Improved HStream CLI](#improved-hstream-cli) +- [Monitoring with Grafana](#monitoring-with-grafana) +- [Deployment on K8s with Helm](#deployment-on-k8s-with-helm) + +#### Shards in Streams + +We have extended the sharding model in v0.8, which provides direct access and +management of the underlying shards of a stream, allowing a finer-grained +control of data distribution and stream scaling. Each shard will be assigned a +range of hashes in the stream, and every record whose hash of `partitionKey` +falls in the range will be stored in that shard. + +Currently, HStreamDB supports: + +- set the initial number of shards when creating a stream +- distribute written records among shards of the stream with `partitionKey`s +- direct access to records from any shard of the specified position +- check the shards and their key range in a stream + +In future releases, HStreamDB will support dynamic scaling of streams through +shard splitting and merging + +#### HStream IO + +HStream IO is the built-in data integration framework for HStreamDB, composed of +source connectors, sink connectors and the IO runtime. It allows interconnection +with various external systems and empowers more instantaneous unleashing of the +value of data with the facilitation of efficient data flow throughout the data +stack. + +In particular, this release provides connectors listed below: + +- Source connectors: + - [source-mysql](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_mysql_spec.md) + - [source-postgresql](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_postgresql_spec.md) + - [source-sqlserver](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_sqlserver_spec.md) +- Sink connectors: + - [sink-mysql](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_mysql_spec.md) + - [sink-postgresql](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_postgresql_spec.md) + +You can refer to [the documentation](./ingest-and-distribute/overview.md) to learn more about +HStream IO. + +#### New Stream Processing Engine + +We have re-implemented the stream processing engine in an interactive and +differential style, which reduces the latency and improves the throughput +magnificently. The new engine also supports **multi-way join**, **sub-queries**, +and **more** general materialized views. + +The feature is still experimental. For try-outs, please refer to +[the SQL guides](./process/sql.md). + +#### Gossip-based HServer Clusters + +We refactor the hserver cluster with gossip-based membership and failure +detection based on [SWIM](https://ieeexplore.ieee.org/document/1028914), +replacing the ZooKeeper-based implementation in the previous version. The new +mechanism will improve the scalability of the cluster and as well as reduce +dependencies on external systems. + +#### Advertised Listeners + +The deployment and usage in production could involve a complex network setting. +For example, if the server cluster is hosted internally, it would require an +external IP address for clients to connect to the cluster. The use of docker and +cloud-hosting can make the situation even more complicated. To ensure that +clients from different networks can interact with the cluster, HStreamDB v0.9 +provides configurations for advertised listeners. With advertised listeners +configured, servers can return the corresponding address for different clients, +according to the port to which the client sent the request. + +#### Improved HStream CLI + +To make CLI more unified and more straightforward, we have migrated the old +HStream SQL Shell and some other node management functionalities to the new +HStream CLI. HStream CLI currently supports operations such as starting an +interacting SQL shell, sending bootstrap initiation and checking server node +status. You can refer to [the CLI documentation](./reference/cli.md) for +details. + +#### Monitoring with Grafana + +We provide a basic monitoring solution based on Prometheus and Grafana. Metrics +collected by HStreamDB will be stored in Prometheus by the exporter and +displayed on the Grafana board. + +#### Deployment on K8s with Helm + +We provide a helm chart to support deploying HStreamDB on k8s using Helm. You +can refer to [the documentation](./deploy/deploy-helm.md) for +details. + +### Java Client + +The +[Java Client v0.9.0](https://github.com/hstreamdb/hstreamdb-java/releases/tag/v0.9.0) +has been released, with support for HStreamDB v0.9. + +### Golang Client + +The +[Go Client v0.2.0](https://github.com/hstreamdb/hstreamdb-go/releases/tag/v0.2.0) +has been released, with support for HStreamDB v0.9. + +### Python Client + +The +[Python Client v0.2.0](https://github.com/hstreamdb/hstreamdb-py/releases/tag/v0.2.0) +has been released, with support for HStreamDB v0.9. + +## v0.8.0 [2022-04-29] + +### HServer + +#### New Features + +- Add [mutual TLS support](./security/overview.md) +- Add `maxUnackedRecords` option in Subscription: The option controls the + maximum number of unacknowledged records allowed. When the amount of unacked + records reaches the maximum setting, the server will stop sending records to + consumers, which can avoid the accumulation of unacked records impacting the + server's and consumers' performance. We suggest users adjust the option based + on the consumption performance of their application. +- Add `backlogDuration` option in Streams: the option determines how long + HStreamDB will store the data in the stream. The data will be deleted and + become inaccessible when it exceeds the time set. +- Add `maxRecordSize` option in Streams: Users can use the option to control the + maximum size of a record batch in the stream when creating a stream. If the + record size exceeds the value, the server will return an error. +- Add more metrics for HStream Server. +- Add compression configuration for HStream Server. + +#### Enhancements + +- [breaking changes] Simplify protocol, refactored codes and improve the + performance of the subscription +- Optimise the implementation and improve the performance of resending +- Improve the reading performance for the HStrore client. +- Improve how duplicated acknowledges are handled in the subscription +- Improve subscription deletion +- Improve stream deletion +- Improve the consistent hashing algorithm of the cluster +- Improve the handling of internal exceptions for the HStream Server +- Optimise the setup steps of the server +- Improve the implementation of the stats module + +#### Bug fixes + +- Fix several memory leaks caused by grpc-haskell +- Fix several zookeeper client issues +- Fix the problem that the checkpoint store already exists during server startup +- Fix the inconsistent handling of the default key during the lookupStream + process +- Fix the problem of stream writing error when the initialisation of hstore + loggroup is incompleted +- Fix the problem that hstore client writes incorrect data +- Fix an error in allocating to idle consumers on subscriptions +- Fix the memory allocation problem of hstore client's `appendBatchBS` function +- Fix the problem of losing retransmitted data due to the unavailability of the + original consumer +- Fix the problem of data distribution caused by wrong workload sorting + +### Java Client + +#### New Features + +- Add TLS support +- Add `FlowControlSetting` setting for `BufferedProducer` +- Add `maxUnackedRecords` setting for subscription +- Add `backlogDurantion` setting for stream +- Add force delete support for subscription +- Add force delete support for stream + +#### Enhancements + +- [Breaking change] Improve `RecordId` as opaque `String` +- Improve the performance of `BufferedProducer` +- Improve `Responder` with batched acknowledges for better performance +- Improve `BufferedProducerBuilder` to use `BatchSetting` with unified + `recordCountLimit`, `bytesCountLimit`, `ageLimit` settings +- Improve the description of API in javadoc + +#### Bug fixes + +- Fix `streamingFetch` is not canceled when `Consumer` is closed +- Fix missing handling for grpc exceptions in `Consumer` +- Fix the incorrect computation of accumulated record size in `BufferedProducer` + +### Go Client + +- hstream-go v0.1.0 has been released. For a more detailed introduction and + usage, please check the + [Github repository](https://github.com/hstreamdb/hstreamdb-go). + +### Admin Server + +- a new admin server has been released, see + [Github repository](https://github.com/hstreamdb/http-services) + +### Tools + +- Add [bench tools](https://github.com/hstreamdb/bench) +- [dev-deploy] Support limiting resources of containers +- [dev-deploy] Add configuration to restart containers +- [dev-deploy] Support uploading all configuration files in deploying +- [dev-deploy] Support deployments with Prometheus Integration + +## v0.7.0 [2022-01-28] + +### Features + +#### Add transparent sharding support + +HStreamDB has already supported the storage and management of large-scale data +streams. With the newly added cluster support in the last release, we decided to +improve a single stream's scalability and reading/writing performance with a +transparent sharding strategy. In HStreamDB v0.7, every stream is spread across +multiple server nodes, but it appears to users that a stream with partitions is +managed as an entity. Therefore, users do not need to specify the number of +shards or any sharding logic in advance. + +In the current implementation, each record in a stream should contain an +ordering key to specify a logical partition, and the HStream server will be +responsible for mapping these logical partitions to physical partitions when +storing data. + +#### Redesign load balancing with the consistent hashing algorithm + +We have adapted our load balancing with a consistent hashing algorithm in this +new release. Both write and read requests are currently allocated by the +ordering key of the record carried in the request. + +In the previous release, our load balancing was based on the hardware usage of +the nodes. The main problem with this was that it relied heavily on a leader +node to collect it. At the same time, this policy requires the node to +communicate with the leader to obtain the allocation results. Overall the past +implementation was too complex and inefficient. Therefore, we have +re-implemented the load balancer, which simplifies the core algorithm and copes +well with redistribution when cluster members change. + +#### Add HStream admin tool + +We have provided a new admin tool to facilitate the maintenance and management +of HStreamDB. HAdmin can be used to monitor and manage the various resources of +HStreamDB, including Stream, Subscription and Server nodes. The HStream Metrics, +previously embedded in the HStream SQL Shell, have been migrated to the new +HAdmin. In short, HAdmin is for HStreamDB operators, and SQL Shell is for +HStreamDB end-users. + +#### Deployment and usage + +- Support quick deployment via the script, see: + [Manual Deployment with Docker](./deploy/deploy-docker.md) +- Support config HStreamDB with a configuration file, see: + [HStreamDB Configuration](./reference/config.md) +- Support one-step docker-compose for quick-start: + [Quick Start With Docker Compose](./start/quickstart-with-docker.md) + +**To make use of HStreamDB v0.7, please use +[hstreamdb-java v0.7.0](https://github.com/hstreamdb/hstreamdb-java) and above** + +## v0.6.0 [2021-11-04] + +### Features + +#### Add HServer cluster support + +As a cloud-native distributed streaming database, HStreamDB has adopted a +separate architecture for computing and storage from the beginning of design, to +support the independent horizontal expansion of the computing layer and storage +layer. In the previous version of HStreamDB, the storage layer HStore already +has the ability to scale horizontally. In this release, the computing layer +HServer will also support the cluster mode so that the HServer node of the +computing layer can be expanded according to the client request and the scale of +the computing task. + +HStreamDB's computing node HServer is designed to be stateless as a whole, so it +is very suitable for rapid horizontal expansion. The HServer cluster mode of +v0.6 mainly includes the following features: + +- Automatic node health detection and failure recovery +- Scheduling and balancing client requests or computing tasks according to the + node load conditions +- Support dynamic joining and exiting of nodes + +#### Add shared-subscription mode + +In the previous version, one subscription only allowed one client to consume +simultaneously, which limited the client's consumption capacity in the scenarios +with a large amount of data. Therefore, in order to support the expansion of the +client's consumption capacity, HStreamDB v0.6 adds a shared-subscription mode, +which allows multiple clients to consume in parallel on one subscription. + +All consumers included in the same subscription form a Consumer Group, and +HServer will distribute data to multiple consumers in the consumer group through +a round-robin manner. The consumer group members can be dynamically changed at +any time, and the client can join or exit the current consumer group at any +time. + +HStreamDB currently supports the "at least once" consumption semantics. After +the client consumes each data, it needs to reply to the ACK. If the Ack of a +certain piece of data is not received within the timeout, HServer will +automatically re-deliver the data to the available consumers. + +Members in the same consumer group share the consumption progress. HStream will +maintain the consumption progress according to the condition of the client's +Ack. The client can resume consumption from the previous location at any time. + +It should be noted that the order of data is not maintained in the shared +subscription mode of v0.6. Subsequent shared subscriptions will support a +key-based distribution mode, which can support the orderly delivery of data with +the same key. + +#### Add statistical function + +HStreamDB v0.6 also adds a basic data statistics function to support the +statistics of key indicators such as stream write rate and consumption rate. +Users can view the corresponding statistical indicators through HStream CLI, as +shown in the figure below. + +![](./statistics.png) + +#### Add REST API for data writing + +HStreamDB v0.6 adds a REST API for writing data to HStreamDB. diff --git a/docs/v0.17.0/security/_index.md b/docs/v0.17.0/security/_index.md new file mode 100644 index 0000000..d944a7d --- /dev/null +++ b/docs/v0.17.0/security/_index.md @@ -0,0 +1,6 @@ +--- +order: ["overview.md", "encryption.md", "authentication.md"] +collapsed: false +--- + +Security diff --git a/docs/v0.17.0/security/authentication.md b/docs/v0.17.0/security/authentication.md new file mode 100644 index 0000000..6a79ec9 --- /dev/null +++ b/docs/v0.17.0/security/authentication.md @@ -0,0 +1,83 @@ +# Authentication + +After enabling TLS, clients can verify connecting servers and keep messages encrypted, +but servers can not verify clients, +so authentication is designed to provide a mechanism that servers can authenticate trusted clients. + +Authentication provides another feature that gives a client a role name, +then hstream will be based on the role to implement authorization. + +hstream only support TLS authentication, which is an extension of default TLS, +to enable TLS authentication, +you need to create the corresponding key and certificate for a role, +then give them to trusted clients, +clients use the key and certificate(binding to a role) to connect to servers. + +## Create a trusted role + +Generate a key: + +```shell +openssl genrsa -out role01.key.pem 2048 +``` + +Convert it to PKCS 8 format(Java client require that): + +```shell +openssl pkcs8 -topk8 -inform PEM -outform PEM \ + -in role01.key.pem -out role01.key-pk8.pem -nocrypt +``` + +Generate the certificate request(Common Name is the role name): + +```shell +openssl req -config openssl.cnf \ + -key role01.key.pem -new -sha256 -out role01.csr.pem +``` + +Generate the signed certificate: + +```shell +openssl ca -config openssl.cnf -extensions usr_cert \ + -days 1000 -notext -md sha256 \ + -in role01.csr.pem -out signed.role01.cert.pem +``` + +## Configuration + +For hstream server, you can set `tls-ca-path` to enable TLS authentication, e.g.: + +```yaml +# TLS options +# +# enable tls, which requires tls-key-path and tls-cert-path options +enable-tls: true +# +# key file path for tls, can be generated by openssl +tls-key-path: /path/to/the/server.key.pem +# +# the signed certificate by CA for the key(tls-key-path) +tls-cert-path: /path/to/the/signed.server.cert.pem +# +# optional for tls, if tls-ca-path is not empty, then enable TLS authentication, +# in the handshake phase, +# the server will request and verify the client's certificate. +tls-ca-path: /path/to/the/ca.cert.pem +``` + +For Java client: + +```java +HStreamClient.builder() + .serviceUrl(serviceUrl) + // enable tls + .enableTLS() + .tlsCaPath("/path/to/ca.pem") + + // for authentication + .enableTlsAuthentication() + .tlsKeyPath("path/to/role01.key-pk8.pem") + .tlsCertPath("path/to/signed.role01.cert.pem") + + .build() +``` diff --git a/docs/v0.17.0/security/encryption.md b/docs/v0.17.0/security/encryption.md new file mode 100644 index 0000000..160c0b5 --- /dev/null +++ b/docs/v0.17.0/security/encryption.md @@ -0,0 +1,110 @@ +# Encryption + +hstream supported encryption between servers and clients using TLS, +in this chapter, we will not introduce more details about TLS, +instead, we will only show steps and configurations to enable it. + +## Steps + +If you don't have any existed CA(Certificate Authority), +you can create one locally, +and TLS requires that each server have a key +and the corresponding signed certificate, +openssl is a good tool to generate them, +after that, you need to configure the files paths +in the servers and clients sides to enable it. + +### Create a local CA + +Create or choose a directory for storing keys and certificates: + +```shell +mkdir tls +cd tls +``` + +Create a database file and serial number file: + +```shell +touch index.txt +echo 1000 > serial +``` + +Get the template openssl.cnf file(**the template file is intended for testing and development, +do not use it in the production environment directly**): + +```shell +wget https://raw.githubusercontent.com/hstreamdb/hstream/main/conf/openssl.cnf +``` + +Generate the CA key file: + +```shell +openssl genrsa -aes256 -out ca.key.pem 4096 +``` + +Generate the CA certificate file: + +```shell +openssl req -config openssl.cnf -key ca.key.pem \ + -new -x509 -days 7300 -sha256 -extensions v3_ca \ + -out ca.cert.pem +``` + +### Create key pair and sign certificate for a server + +Here we only generate a key and certificate for one server, +you should create them for all hstream servers that have a different hostname, +or create a certificate including all hostnames(IP or DNS) in SANs. + +Generate the server key file: + +```shell +openssl genrsa -out server01.key.pem 2048 +``` + +Generate the server certificate request, +when you input Common Name, +you should write the correct hostname(e.g., localhost): + +```shell +openssl req -config openssl.cnf \ + -key server01.key.pem -new -sha256 -out server01.csr.pem +``` + +generate the server certificate with the generated CA: + +```shell +openssl ca -config openssl.cnf -extensions server_cert \ + -days 1000 -notext -md sha256 \ + -in server01.csr.pem -out signed.server01.cert.pem +``` + +### Configure the server and clients + +The options for servers: + +```yaml +# TLS options +# +# enable tls, which requires tls-key-path and tls-cert-path options +enable-tls: true + +# +# key file path for tls, can be generated by openssl +tls-key-path: /path/to/the/server01.key.pem + +# the signed certificate by CA for the key(tls-key-path) +tls-cert-path: /path/to/the/signed.server01.cert.pem +``` + +Java client: +```java +HStreamClient.builder() + .serviceUrl(serviceUrl) + // optional, enable tls + .enableTls() + .tlsCaPath("/path/to/ca.cert.pem") + + .build() +``` diff --git a/docs/v0.17.0/security/overview.md b/docs/v0.17.0/security/overview.md new file mode 100644 index 0000000..9607924 --- /dev/null +++ b/docs/v0.17.0/security/overview.md @@ -0,0 +1,10 @@ +# Overview + +Considering performance and convenience, +hstream will not enable security features(encryption, authentication, etc.) by default, +but if your clients communicate with hstream servers by an insecure network, +you should enable them. + +hstream supported security features: ++ Encryption: to prevent eavesdropping and tampering by man-in-the-middle attacks between clients and servers. ++ Authentication: to provide a mechanism that servers can authenticate trusted clients and an interface for authorization. diff --git a/docs/v0.17.0/start/_index.md b/docs/v0.17.0/start/_index.md new file mode 100644 index 0000000..ed9ee03 --- /dev/null +++ b/docs/v0.17.0/start/_index.md @@ -0,0 +1,9 @@ +--- +order: + - try-out-hstream-platform.md + - quickstart-with-docker.md + - hstream-console.md +collapsed: false +--- + +Get started diff --git a/docs/v0.17.0/start/hstream-console-screenshot.png b/docs/v0.17.0/start/hstream-console-screenshot.png new file mode 100644 index 0000000..9490fa4 Binary files /dev/null and b/docs/v0.17.0/start/hstream-console-screenshot.png differ diff --git a/docs/v0.17.0/start/hstream-console.md b/docs/v0.17.0/start/hstream-console.md new file mode 100644 index 0000000..62a73c0 --- /dev/null +++ b/docs/v0.17.0/start/hstream-console.md @@ -0,0 +1,33 @@ +# Get Started on HStream Console + +HStream Console is a web-based management tool for HStreamDB. It provides a graphical user interface to manage HStreamDB clusters. +With HStream Console, you can easily create and manage streams, and write SQL queries to process data in real time. Besides operating HStreamDB, +HStream Console also provides metrics for each resource in the cluster, which helps you to monitor the cluster status. + +![HStream Console Overview](./hstream-console-screenshot.png) + +## Features + +### Manage HStreamDB resources directly + +HStream Console provides a graphical user interface to manage HStreamDB resources, including streams, subscriptions, and queries. +You can easily create and delete resources in the cluster, write data to streams, and write SQL queries to process data. + +It can also help you to search for resources in the cluster, and provide a detailed view of each resource. + +### Monitor resources in the cluster + +In every resource view, HStream Console provides a metrics panel to monitor the resource status in real-time. With the metrics panel, +you can intuitively visualize the resource status, and easily find out the bottleneck of the cluster. + +### Data synchronization + +With connectors in HStream Console, you can gain the ability to synchronize data between HStreamDB and other data sources, such as MySQL, PostgreSQL, and Elasticsearch. +Check out [HStream IO Overview](../ingest-and-distribute/overview.md) to learn more about connectors. + +## Next steps + +To learn more about HStreamDB's resources, follow the links below: + +- [Streams](../write/stream.md) +- [Subscriptions](../receive/subscription.md) diff --git a/docs/v0.17.0/start/quickstart-with-docker.md b/docs/v0.17.0/start/quickstart-with-docker.md new file mode 100644 index 0000000..3185e09 --- /dev/null +++ b/docs/v0.17.0/start/quickstart-with-docker.md @@ -0,0 +1,255 @@ +# Quickstart with Docker-Compose + +## Requirement + +For optimal performance, we suggest utilizing a Linux kernel version of 4.14 or +higher when initializing an HStreamDB Cluster. + +::: tip +In the case it is not possible for the user to use a Linux kernel version of +4.14 or above, we recommend adding the option `--enable-dscp-reflection=false` +to HStore while starting the HStreamDB Cluster. +::: + +## Installation + +### Install docker + +::: tip +If you have already installed docker, you can skip this step. +::: + +See [Install Docker Engine](https://docs.docker.com/engine/install/), and +install it for your operating system. Please carefully check that you have met +all prerequisites. + +Confirm that the Docker daemon is running: + +```sh +docker version +``` + +::: tip +On Linux, Docker needs root privileges. You can also run Docker as a +non-root user, see [Post-installation steps for Linux][non-root-docker]. +::: + +### Install docker compose + +::: tip +If you have already installed docker compose, you can skip this step. +::: + +See [Install Docker Compose](https://docs.docker.com/compose/install/), and +install it for your operating system. Please carefully check that you met all +prerequisites. + +```sh +docker-compose version +``` + +## Start HStreamDB Services + +::: warning +Do NOT use this configuration in your production environment! +::: + +Create a docker-compose.yaml file for docker compose, you can +[download][quick-start.yaml] or paste the following contents: + +<<< @/../assets/quick-start.yaml.template{yaml-vue} + +then run: + +```sh +docker-compose -f quick-start.yaml up +``` + +If you see some thing like this, then you have a running hstream: + +```txt +hserver_1 | [INFO][2021-11-22T09:15:18+0000][app/server.hs:137:3][thread#67]************************ +hserver_1 | [INFO][2021-11-22T09:15:18+0000][app/server.hs:145:3][thread#67]Server started on port 6570 +hserver_1 | [INFO][2021-11-22T09:15:18+0000][app/server.hs:146:3][thread#67]************************* +``` + +::: tip +You can also run in background: +```sh +docker-compose -f quick-start.yaml up -d +``` +::: + +::: tip +If you want to show logs of server, run: +```sh +docker-compose -f quick-start.yaml logs -f hserver +``` +::: + +## Connect HStreamDB with HSTREAM CLI + +HStreamDB can be directly managed using the `hstream` command-line interface (CLI), which is included in the `hstreamdb/hstream` image. + +Start an instance of `hstreamdb/hstream` using Docker: + +```sh-vue +docker run -it --rm --name some-hstream-cli --network host hstreamdb/hstream:{{ $version() }} bash +``` + +## Create stream + +To create a stream, you can use `hstream stream create` command. Now we will create a stream with 2 shard + +```sh +hstream stream create demo --shards 2 +``` + +```sh ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| demo | 1 | 604800 seconds | 2 | ++-------------+---------+----------------+-------------+ +``` + +## Write data to streams + +The `hstream stream append` command can be used to write data to a stream in a interactive shell. +```sh +hstream stream append demo --separator "@" +``` +- With the `--separator` option, you can specify the separator for key. The default separator is "@". Using the separator, you can assign a key to each record. Record with same key will be append into same shard of the stream. + +```sh +key1@{"temperature": 22, "humidity": 80} +key1@{"temperature": 32, "humidity": 21, "tag": "test1"} +hello world! +``` +Here we have written three pieces of data. The first two are in JSON format and are associated with key1. The last one does not specify a key. + +For additional information, you can use `hstream stream append -h`. + +## Read data from a stream + +To read data from a particular stream, the `hstream stream read-stream` command is used. + +```sh +hstream stream read-stream demo +``` + +```sh +timestamp: "1692774821444", id: 1928822601796943-8589934593-0, key: "key1", record: {"humidity":80.0,"temperature":22.0} +timestamp: "1692774844649", id: 1928822601796943-8589934594-0, key: "key1", record: {"humidity":21.0,"tag":"test1","temperature":32.0} +timestamp: "1692774851017", id: 1928822601796943-8589934595-0, key: "", record: hello world! +``` + +You can also set a read offset, which can be one of the following types: + +- `earliest`: This seeks to the first record of the stream. +- `latest`: This seeks to the end of the stream. +- `timestamp`: This seeks to a record with a specific creation timestamp. + +For instance: + +```sh +hstream stream read-stream demo --from 1692774844649 --total 1 +``` + +```sh +timestamp: "1692774844649", id: 1928822601796943-8589934594-0, key: "key1", record: {"humidity":21.0,"tag":"test1","temperature":32.0} +``` + +## Start HStreamDB's interactive SQL CLI + +```sh-vue +docker run -it --rm --name some-hstream-cli --network host hstreamdb/hstream:{{ $version() }} hstream --port 6570 sql +``` + +If everything works fine, you will enter an interactive CLI and see help +information like + +```txt + __ _________________ _________ __ ___ + / / / / ___/_ __/ __ \/ ____/ | / |/ / + / /_/ /\__ \ / / / /_/ / __/ / /| | / /|_/ / + / __ /___/ // / / _, _/ /___/ ___ |/ / / / + /_/ /_//____//_/ /_/ |_/_____/_/ |_/_/ /_/ + +Command + :h To show these help info + :q To exit command line interface + :help [sql_operation] To show full usage of sql statement + +SQL STATEMENTS: + To create a simplest stream: + CREATE STREAM stream_name; + + To create a query select all fields from a stream: + SELECT * FROM stream_name EMIT CHANGES; + + To insert values to a stream: + INSERT INTO stream_name (field1, field2) VALUES (1, 2); + +> +``` + +## Run a continuous query over the stream + +Now we can run a continuous query over the stream we just created by `SELECT` +query. + +The query will output all records from the `demo` stream whose humidity is above +70 percent. + +```sql +SELECT * FROM demo WHERE humidity > 70 EMIT CHANGES; +``` + +It seems that nothing happened. But do not worry because there is no data in the +stream now. Next, we will fill the stream with some data so the query can +produce output we want. + +## Start another CLI session + +Start another CLI session, this CLI will be used for inserting data into the +stream. + +```sh +docker exec -it some-hstream-cli hstream --port 6570 sql +``` + +## Insert data into the stream + +Run each of the given `INSERT` statement in the new CLI session and keep an eye +on the CLI session created in (2). + +```sql +INSERT INTO demo (temperature, humidity) VALUES (22, 80); +INSERT INTO demo (temperature, humidity) VALUES (15, 20); +INSERT INTO demo (temperature, humidity) VALUES (31, 76); +INSERT INTO demo (temperature, humidity) VALUES ( 5, 45); +INSERT INTO demo (temperature, humidity) VALUES (27, 82); +INSERT INTO demo (temperature, humidity) VALUES (28, 86); +``` + +If everything works fine, the continuous query will output matching records in +real time: + +```json +{"humidity":{"$numberLong":"80"},"temperature":{"$numberLong":"22"}} +{"humidity":{"$numberLong":"76"},"temperature":{"$numberLong":"31"}} +{"humidity":{"$numberLong":"82"},"temperature":{"$numberLong":"27"}} +{"humidity":{"$numberLong":"86"},"temperature":{"$numberLong":"28"}} +``` + +[non-root-docker]: https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user +[quick-start.yaml]: https://raw.githubusercontent.com/hstreamdb/docs-next/main/assets/quick-start.yaml + +## Start Discovery HStreamDB using CONSOLE + +The HStreamDB Console is the management panel for HStreamDB. You can use it to manage most resources of HStreamDB, perform data reading and writing, execute SQL queries, and more. + +You can open the Console panel by entering http://localhost:5177 into your browser, for more details about the Console, please check [Get Started on HStream Console](./hstream-console.md). + +Now, you can start exploring HStreamDB with joy. diff --git a/docs/v0.17.0/start/try-out-hstream-platform.md b/docs/v0.17.0/start/try-out-hstream-platform.md new file mode 100644 index 0000000..e86c49d --- /dev/null +++ b/docs/v0.17.0/start/try-out-hstream-platform.md @@ -0,0 +1,85 @@ +# Get Started on HStream Platform + +This page guides you on how to try out the HStream Platform quickly from scratch. +You will learn how to create a stream, write records to the stream, and query records from the stream. + +## Apply for a Trial + +Before starting, you need to apply for a trial account for the HStream Platform. +If you already have an account, you can skip this step. + +### Create a new account + + + +::: info +By creating an account, you agree to the [Terms of Service](https://www.emqx.com/en/policy/terms-of-use) and [Privacy Policy](https://www.emqx.com/en/policy/privacy-policy). +::: + +To create a new account, please fill in the required information on the form provided on the [Sign Up](https://account.hstream.io/signup) page, all fields are shown below: + +- **Username**: Your username. +- **Email**: Your email address. This email address will be used for the HStream Platform login. +- **Password**: Your password. The password must be at least eight characters long. +- **Company (Optional)**: Your company name. + +After completing the necessary fields, click the **Sign Up** button to proceed with creating your new account. In case of a successful account creation, you will be redirected to the login page. + +### Log in to the HStream Platform + +To log in to the HStream Platform after creating an account, please fill in the required information on the form provided on the [Log In](https://account.hstream.io/login) page, all fields are shown below: + +- **Email**: Your email address. +- **Password**: Your password. + +Once you have successfully logged in, you will be redirected to the home of HStream Platform. + +## Create a stream + +To create a new stream, follow the steps below: + +1. Head to the **Streams** page and locate the **New stream** button. +2. Once clicked, you will be directed to the **New stream** page. +3. Here, simply provide a name for the stream and leave the other fields as default. +4. Finally, click on the **Create** button to finalize the stream creation process. + +The stream will be created immediately, and you will see the stream listed on the **Streams** page. + +::: tip +For more information about how to create a stream, see [Create a Stream](../platform/stream-in-platform.md#create-a-stream). +::: + +## Write records to the stream + +After creating a stream, you can write records to the stream. Go to the stream details page by clicking the stream name in the table and +then click the **Write records** button. A drawer will appear, and you can write records to the stream in the drawer. + +In this example, we will write the following record to the stream: + +```json +{ "name": "Alice", "age": 18 } +``` + +Please fill it in the **Value** Field and click the **Produce** button. + +If the record is written successfully, you will see a success message and the response +of the request. + +Next, we can query this record from the stream. + +::: tip +For more information about how to write records to a stream, see [Write Records to Streams](../platform/write-in-platform.md). +::: + +## Get records from the stream + +After writing records to the stream, you can get records from the stream. Go back +to the stream page, click the **Records** tab, you will see a empty table. + +Click the **Get records** button, and then the record written in the previous step will be displayed. + +## Next steps + +- Explore the [stream in details](../platform/stream-in-platform.md#view-stream-details). +- [Create a subscription](../platform/subscription-in-platform.md#create-a-subscription) to consume records from the stream. +- [Query records](../platform/write-in-platform.md#query-records) from streams. diff --git a/docs/v0.17.0/statistics.png b/docs/v0.17.0/statistics.png new file mode 100644 index 0000000..adaf375 Binary files /dev/null and b/docs/v0.17.0/statistics.png differ diff --git a/docs/v0.17.0/write/_index.md b/docs/v0.17.0/write/_index.md new file mode 100644 index 0000000..e2d4f89 --- /dev/null +++ b/docs/v0.17.0/write/_index.md @@ -0,0 +1,6 @@ +--- +order: ['stream.md', 'shards.md', 'write.md'] +collapsed: false +--- + +Write data diff --git a/docs/v0.17.0/write/shards.md b/docs/v0.17.0/write/shards.md new file mode 100644 index 0000000..5724a95 --- /dev/null +++ b/docs/v0.17.0/write/shards.md @@ -0,0 +1,39 @@ +# Manage Shards of the Stream + +## Sharding in HStreamDB + +A stream is a logical concept for producer and consumer, and under the hood, +these data passing through are stored in the shards of the stream in an +append-only fashion. + +A shard is essentially the primary storage unit which contains all the +corresponding records with some partition keys. Every stream will contain +multiple shards spread across multiple server nodes. Since we believe that +stream on itself is a sufficiently concise and powerful abstraction, the +sharding logic is minimally visible to the user. For example, during writing or +consumption, each stream appears to be managed as an entity as far as the user +is concerned. + +However, for the cases where the user needs more fine-grained control and better +flexibility, we offer interfaces to get into the details of shards of the stream +and other interfaces to work with shards like Reader. + +## Specify the Number of Shards When Creating a Stream + +To decide the number of shards which a stream should have, an attribute +shardCount is provided when creating a +[stream](./stream.md#attributes-of-a-stream). + +## List Shards + +To list all the shards of one stream. + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ListShardsExample.java [Java] + +<<< @/../examples/go/examples/ExampleListShards.go [Go] + +@snippet examples/py/snippets/guides.py common list-shards + +::: diff --git a/docs/v0.17.0/write/stream.md b/docs/v0.17.0/write/stream.md new file mode 100644 index 0000000..67e26f3 --- /dev/null +++ b/docs/v0.17.0/write/stream.md @@ -0,0 +1,92 @@ +# Create and Manage Streams + +## Guidelines to name a resource + +An HStream resource name uniquely identifies an HStream resource, such as a +stream, a subscription or a reader. The resource name must fit the following +requirements: + +- Start with a letter +- Length must be no longer than 255 characters +- Contain only the following characters: Letters `[A-Za-z]`, numbers `[0-9]`, underscores `_` + +\*For the cases where the resource name is used as a part of a SQL statement, +such as in [HStream SQL Shell](../reference/cli.md#hstream-sql-shell), there +will be situations where the resource name cannot be parsed properly (such as +conflicts with Keywords etc.), enclose the resource name with double quotes `"`. + +## Attributes of a Stream + +- Replication factor + + For fault tolerance and higher availability, every stream can be replicated + across nodes in the cluster. A typical production setting is a replication + factor of 3, i.e., there will always be three copies of your data, which is + helpful just in case things go wrong or you want to do maintenance on the + brokers. This replication is performed at the level of the stream. + +- Backlog retention + + The configuration controls how long streams of HStreamDB retain records after + being appended. HStreamDB will discard the message regardless of whether it is + consumed when it exceeds the backlog retention duration. + + - Default = 7 days + - Minimum value = 1 seconds + - Maximum value = 21 days + +- Shard Count + + The number of shards that a stream will have. + +## Create a Stream + +Create a stream before you write records or create a subscription. + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/CreateStreamExample.java [Java] + +<<< @/../examples/go/examples/ExampleCreateStream.go [Go] + +@snippet examples/py/snippets/guides.py common create-stream + +::: + +## Delete a Stream + +Deletion is only allowed when a stream has no subsequent subscriptions unless +the force flag is set. + +### Delete a stream with the force flag + +If you need to delete a stream with subscriptions, enable force deletion. +Existing stream subscriptions can still read from the backlog after deleting a +stream with the force flag enabled. However, these subscriptions will have +stream name `__deleted_stream__`, no new subscription creation on the deleted +stream would be allowed, nor new records would be allowed to be written to the +stream. + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/DeleteStreamExample.java [Java] + +<<< @/../examples/go/examples/ExampleDeleteStream.go [Go] + +@snippet examples/py/snippets/guides.py common delete-stream + +::: + +## List Streams + +To get all streams in HStreamDB: + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ListStreamsExample.java [Java] + +<<< @/../examples/go/examples/ExampleListStreams.go [Go] + +@snippet examples/py/snippets/guides.py common list-streams + +::: diff --git a/docs/v0.17.0/write/write.md b/docs/v0.17.0/write/write.md new file mode 100644 index 0000000..13ee0af --- /dev/null +++ b/docs/v0.17.0/write/write.md @@ -0,0 +1,98 @@ +# Write Records to Streams + +This document provides information about how to write data to streams in +HStreamDB using hstreamdb-java or clients implemented in other languages. + +You can also read the following pages to get a more thorough understanding: + +- How to [create and manage Streams](./stream.md). +- How to [consume the data written to a Stream](../receive/consume.md). + +To write data to HStreamDB, we need to pack messages as HStream Records and a +producer that creates and sends messages to servers. + +## HStream Record + +All data in streams are in the form of an HStream Record. There are two kinds of +HStream Record: + +- **HRecord**: You can think of an hrecord as a piece of JSON data, just like + the document in some NoSQL databases. +- **Raw Record**: Arbitrary binary data. + +## End-to-End Compression + +To reduce transfer overhead and maximize bandwidth utilization, HStreamDB +supports the compression of written HStream records. Users can set the +compression algorithm when creating a `BufferedProducer`. Currently, HStreamDB +supports both `gzip` and `zstd` compression algorithms. Compressed records are +automatically decompressed by the client when they are consumed from HStreamDB. + +## Write HStream Records + +There are two ways to write records to servers. For simplicity, you can use +`Producer` from `client.newProducer()` to start with. `Producer`s do not provide +any configure options, it simply sends records to servers as soon as possible, +and all these records are sent in parallel, which means they are unordered. In +practice, `BufferedProducer` from the `client.newBufferedProducer()` would +always be better. `BufferedProducer` will buffer records in order as a batch and +send the batch to servers. When a record is written to the stream, HStream +Server will generate a corresponding record id for the record and send it back +to the client. The record id is unique in the stream. + +## Write Records Using a Producer + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/WriteDataSimpleExample.java [Java] + +<<< @/../examples/go/examples/ExampleWriteProducer.go [Go] + +@snippet examples/py/snippets/guides.py common append-records + +::: + +## Write Records Using a Buffered Producer + +In almost all scenarios, we would recommend using `BufferedProducer` whenever +possible because it offers higher throughput and provides a very flexible +configuration that allows you to adjust between throughput and latency as +needed. You can configure the following two settings of BufferedProducer to +control and set the trigger and the buffer size. With `BatchSetting`, you can +determine when to send the batch based on the maximum number of records, byte +size in the batch and the maximum age of the batch. By configuring +`FlowControlSetting`, you can set the buffer for all records. The following code +example shows how you can use `BatchSetting` to set responding triggers to +notify when the producer should flush and `FlowControlSetting` to limit maximum +bytes in a BufferedProducer. + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/WriteDataBufferedExample.java [Java] + +<<< @/../examples/go/examples/ExampleWriteBatchProducer.go [Go] + +@snippet examples/py/snippets/guides.py common buffered-append-records + +::: + +## Write Records with Partition Keys + +Partition keys are optional, and if not given, the server will automatically +assign a default key. Records with the same partition key can be guaranteed to +be written orderly in BufferedProducer. + +Another important feature of HStreamDB, sharding, uses these partition keys to +decide which shards the record will be allocated to and improve write/read +performance. See [Manage Shards of a Stream](./shards.md) for a more detailed +explanation. + +You can easily write records with keys using the following example: + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/WriteDataWithKeyExample.java [Java] + +<<< @/../examples/go/examples/ExampleWriteBatchProducerMultiKey.go [Go] + +::: diff --git a/docs/zh/v0.17.0/_index.md b/docs/zh/v0.17.0/_index.md new file mode 100644 index 0000000..fa62533 --- /dev/null +++ b/docs/zh/v0.17.0/_index.md @@ -0,0 +1,16 @@ +--- +order: + [ + 'overview', + 'start', + 'platform', + 'write', + 'receive', + 'process', + 'ingest-and-distribute', + 'deploy', + 'security', + 'reference', + 'release-notes.md', + ] +--- diff --git a/docs/zh/v0.17.0/deploy/_index.md b/docs/zh/v0.17.0/deploy/_index.md new file mode 100644 index 0000000..a909e7a --- /dev/null +++ b/docs/zh/v0.17.0/deploy/_index.md @@ -0,0 +1,6 @@ +--- +order: ["deploy-helm", "deploy-k8s", "deploy-docker", "quick-deploy-ssh"] +collapsed: false +--- + +部署 diff --git a/docs/zh/v0.17.0/deploy/deploy-docker.md b/docs/zh/v0.17.0/deploy/deploy-docker.md new file mode 100644 index 0000000..40ced03 --- /dev/null +++ b/docs/zh/v0.17.0/deploy/deploy-docker.md @@ -0,0 +1,254 @@ +# 用 Docker 手动部署 + +本文描述了如何用 Docker 运行 HStreamDB 集群。 + +::: warning + +本教程只展示了用 Docker 启动 HStreamDB 集群的主要过程。参数的配置没有考虑到任何安全问题,所以请 +请不要在部署时直接使用它们 + +::: + +## 设置一个 ZooKeeper 集群 + +`HServer` 和 `HStore` 需要 ZooKeeper 来存储一些元数据,所以首先我们需要配置一个 ZooKeeper 集群。 + +你可以在网上找到关于如何建立一个合适的 ZooKeeper 集群的教程。 + +这里我们只是通过 docker 快速启动一个单节点的 ZooKeeper 为例。 + +```shell +docker run --rm -d --name zookeeper --network host zookeeper +``` + +## 在存储节点上创建数据文件夹 + +存储节点会把数据存储分片(Shard)中。通常情况下,每个分片映射到不同的物理磁盘。 +假设你的数据盘被挂载(mount)在`/mnt/data0` 上 + +```shell +# creates the root folder for data +sudo mkdir -p /data/logdevice/ + +# writes the number of shards that this box will have +echo 1 | sudo tee /data/logdevice/NSHARDS + +# creates symlink for shard 0 +sudo ln -s /mnt/data0 /data/logdevice/shard0 + +# adds the user for the logdevice daemon +sudo useradd logdevice + +# changes ownership for the data directory and the disk +sudo chown -R logdevice /data/logdevice/ +sudo chown -R logdevice /mnt/data0/ +``` + +- See + [Create data folders](https://logdevice.io/docs/FirstCluster.html#4-create-data-folders-on-storage-nodes) + for details + +## 创建配置文件 + +这里是一个配置文件的最小示例。 + +在使用它之前,请根据你的情况进行修改。 + +```json +{ + "server_settings": { + "enable-nodes-configuration-manager": "true", + "use-nodes-configuration-manager-nodes-configuration": "true", + "enable-node-self-registration": "true", + "enable-cluster-maintenance-state-machine": "true" + }, + "client_settings": { + "enable-nodes-configuration-manager": "true", + "use-nodes-configuration-manager-nodes-configuration": "true", + "admin-client-capabilities": "true" + }, + "cluster": "logdevice", + "internal_logs": { + "config_log_deltas": { + "replicate_across": { + "node": 3 + } + }, + "config_log_snapshots": { + "replicate_across": { + "node": 3 + } + }, + "event_log_deltas": { + "replicate_across": { + "node": 3 + } + }, + "event_log_snapshots": { + "replicate_across": { + "node": 3 + } + }, + "maintenance_log_deltas": { + "replicate_across": { + "node": 3 + } + }, + "maintenance_log_snapshots": { + "replicate_across": { + "node": 3 + } + } + }, + "metadata_logs": { + "nodeset": [], + "replicate_across": { + "node": 3 + } + }, + "zookeeper": { + "zookeeper_uri": "ip://10.100.2.11:2181", + "timeout": "30s" + } +} +``` + +- 如果你有一个多节点的 ZooKeeper,修改 `zookeeper_uri`部分为 ZooKeeper 集群的节点和端口列表: + + ```json + "zookeeper": { + "zookeeper_uri": "ip://10.100.2.11:2181,10.100.2.12:2181,10.100.2.13:2181", + "timeout": "30s" + } + ``` + +- 所有属性的详细解释可以在[集群配置](https://logdevice.io/docs/Config.html) 文档中找到。 + +## 存储配置文件 + +你可以将配置文件存储在 ZooKeeper 中,或存储在每个存储节点上。 + +### 在 ZooKeeper 中存储配置文件 + +假设你的一个 ZooKeeper 节点上有一个路径为 `~/logdevice.conf` 的配置文件。 + +通过运行以下命令将配置文件保存到 ZooKeeper 中: + +```shell +docker exec zookeeper zkCli.sh create /logdevice.conf "`cat ~/logdevice.conf`" +``` + +通过以下命令验证创建是否成功: + +```shell +docker exec zookeeper zkCli.sh get /logdevice.conf +``` + +## 配置 HStore 集群 + +对于存储在 ZooKeeper 中的配置文件,假设配置文件中 `zookeeper_uri` 字段的值是 `"ip:/10.100.2.11:2181"` ,ZooKeeper 中配置文件的路径是 `/logdevice.conf` 。 + +对于存储在每个节点上的配置文件,假设你的文件路径是 `/data/logdevice/logdevice.conf'。 + +### 在单个节点上启动 admin 服务器 + +- 配置文件存储在 ZooKeeper 中: + + ```shell-vue + docker run --rm -d --name storeAdmin --network host -v /data/logdevice:/data/logdevice \ + hstreamdb/hstream:{{ $version() }} /usr/local/bin/ld-admin-server \ + --config-path zk:10.100.2.11:2181/logdevice.conf \ + --enable-maintenance-manager \ + --maintenance-log-snapshotting \ + --enable-safety-check-periodic-metadata-update + ``` + + - 如果你有一个多节点的 ZooKeeper,请将`--config-path`替换为: + `--config-path zk:10.100.2.11:2181,10.100.2.12:2181,10.100.2.13:2181/logdevice.conf` + +- 存储在每个节点的配置文件: + + 更改 `--config-path` 参数为 `--config-path /data/logdevice/logdevice.conf` + +### 在每个节点上启动 logdeviced + +- 存储在 ZooKeeper 中的配置文件: + + ```shell-vue + docker run --rm -d --name hstore --network host -v /data/logdevice:/data/logdevice \ + hstreamdb/hstream:{{ $version() }} /usr/local/bin/logdeviced \ + --config-path zk:10.100.2.11:2181/logdevice.conf \ + --name store-0 \ + --address 192.168.0.3 \ + --local-log-store-path /data/logdevice + ``` + + - 对于每个节点,你应该将`--name`更新为一个不同的值,并将`--address`更新为该节点的 IP 地址。 + +- 存储在每个节点的配置文件: + + 更改 `--config-path` 参数为 `--config-path /data/logdevice/logdevice.conf` + +### Bootstrap 集群 + +在启动管理服务器和每个存储节点的 logdeviced 之后,现在我们可以 bootstrap 我们的集群。 + +在管理服务器节点上,运行。 + +```shell +docker exec storeAdmin hadmin store nodes-config bootstrap --metadata-replicate-across 'node:3' +``` + +你应该看到像这样的信息: + +``` +Successfully bootstrapped the cluster, new nodes configuration version: 7 +Took 0.019s +``` + +你可以通过运行以下命令来检查集群的状态: + +```shell +docker exec storeAdmin hadmin store status +``` + +而结果应该是: + +``` ++----+---------+----------+-------+-----------+---------+---------------+ +| ID | NAME | PACKAGE | STATE | UPTIME | SEQ. | HEALTH STATUS | ++----+---------+----------+-------+-----------+---------+---------------+ +| 0 | store-0 | 99.99.99 | ALIVE | 2 min ago | ENABLED | HEALTHY | +| 1 | store-2 | 99.99.99 | ALIVE | 2 min ago | ENABLED | HEALTHY | +| 2 | store-1 | 99.99.99 | ALIVE | 2 min ago | ENABLED | HEALTHY | ++----+---------+----------+-------+-----------+---------+---------------+ +Took 7.745s +``` + +现在我们完成了对 `HStore` 集群的设置。 + +## 配置 HServer 集群 + +要启动一个单一的 `HServer` 实例,你可以修改启动命令以适应你的情况。 + +```shell-vue +docker run -d --name hstream-server --network host \ + hstreamdb/hstream:{{ $version() }} /usr/local/bin/hstream-server \ + --bind-address $SERVER_HOST \ + --advertised-address $SERVER_HOST \ + --seed-nodes $SERVER_HOST \ + --metastore-uri zk://$ZK_ADDRESS \ + --store-config zk:$ZK_ADDRESS/logdevice.conf \ + --store-admin-host $ADMIN_HOST \ + --replicate-factor 3 \ + --server-id 1 +``` + +- `$SERVER_HOST`:你的服务器节点的主机 IP 地址,例如 `192.168.0.1`。 +- `metastore-uri`: 你的元信息存储 HMeta 地址,例如使用 `zk://$ZK_ADDRESS` 指定 zookeeper 存储元数据。同时实现性支持使用 rqlite `rq://$RQ_ADDRESS`。 +- `$ZK_ADDRESS` :你的 ZooKeeper 集群地址列表,例如 `10.100.2.11:2181,10.100.2.12:2181,10.100.2.13:2181`。 +- `--store-config`:你的 `HStore` 配置文件的路径。应该与启动 `HStore` 集群 `--config-path` 参数的值一致。 +- `--store-admin-host`:`HStore Admin Server` 节点的 IP 地址。 +- `--server-id`:你应该为每个服务器实例设置一个的**唯一标识符** + +你可以以同样的方式在不同的节点上启动多个服务器实例。 diff --git a/docs/zh/v0.17.0/deploy/deploy-helm.md b/docs/zh/v0.17.0/deploy/deploy-helm.md new file mode 100644 index 0000000..e6a4505 --- /dev/null +++ b/docs/zh/v0.17.0/deploy/deploy-helm.md @@ -0,0 +1,94 @@ +# 在 Kubernetes 上通过 helm 部署 + +本文档描述了如何使用我们提供的 helm chart 来运行 HStreamDB kubernetes。该文档假设读者 +有基本的 kubernetes 知识。在本节结束时,你将拥有一个完全运行在 kubernetes 上的 +HStreamDB 集群,它已经准备就绪,可以接收读/写,处理数据,等等。 + +## 建立你的 Kubernetes 集群 + +第一步是要有一个正在运行的 kubernetes 集群。你可以使用一个托管的集群(由你的云提 +供商提供),一个自我托管的集群或一个本地的 kubernetes 集群,比如 minikube。请确 +保 kubectl 指向你计划使用的任何集群。 + +另外,你需要一个存储类,你可以通过 "kubectl "创建。或者通过你的云服务提供商的网页来创建,如果它有的话。 +minikube 默认提供一个名为 `standard` 的存储类,helm chart 默认使用此存储类。 + +## 通过 Helm 部署 HStreamDB + +### Clone 代码并获取 helm 依赖 + +```sh +git clone https://github.com/hstreamdb/hstream.git +cd hstream/deploy/chart/hstream/ +helm dependency build . +``` + +### 通过 Helm 部署 HStreamDB + +```sh +helm install my-hstream . +``` + +Helm chart 还提供了 `value.yaml` 文件,你可以在这个文件中修改你的配置,比如当你希望使用其他的存储类来部署集群时,你可以修改 `value.yaml` 中的 `logdevice.persistence.storageClass` 和 `zookeeper.persistence.storageClass`,并使用 `helm install my-hstream -f values.yaml .` 来署。 + +### 检查集群状态 + +`helm install` 命令会部署 zookeeper 集群、logdevice 集群和 hstream 集群,这可能会花费一定的时间,你可以通过 `kubectl get pods` 来检查集群的状况,在集群部署的过程中,会有一些 `Error` 和 `CrashLoopBackOff` 的状态,这些状态会在一定时间后消失,最终你会看到类似如下的内容: + +``` +NAME READY STATUS RESTARTS AGE +my-hstream-0 1/1 Running 3 (16h ago) 16h +my-hstream-1 1/1 Running 2 (16h ago) 16h +my-hstream-2 1/1 Running 0 16h +my-hstream-logdevice-0 1/1 Running 3 (16h ago) 16h +my-hstream-logdevice-1 1/1 Running 3 (16h ago) 16h +my-hstream-logdevice-2 1/1 Running 0 16h +my-hstream-logdevice-3 1/1 Running 0 16h +my-hstream-logdevice-admin-server-6867fd9494-bk5mf 1/1 Running 3 (16h ago) 16h +my-hstream-zookeeper-0 1/1 Running 0 16h +my-hstream-zookeeper-1 1/1 Running 0 16h +my-hstream-zookeeper-2 1/1 Running 0 16h +``` + +你可以通过 `hadmin server` 命令来检查 HStreamDB 集群的状态。 + +```sh +kubectl exec -it hstream-1 -- bash -c "hadmin server status" +``` +``` ++---------+---------+------------------+ +| node_id | state | address | ++---------+---------+------------------+ +| 100 | Running | 172.17.0.4:6570 | +| 101 | Running | 172.17.0.10:6570 | +| 102 | Running | 172.17.0.12:6570 | ++---------+---------+------------------+ +``` + +## 管理存储集群 + +现在你可以运行 `hadmin store` **来管理这个集群**, +例如要检查 store 集群的状态,你可以运行: + +```sh +kubectl exec -it my-hstream-0 -- bash -c "hadmin store --host my-hstream-logdevice-admin-server status" +``` +``` ++----+------------------------+----------+-------+--------------+----------+ +| ID | NAME | PACKAGE | STATE | UPTIME | LOCATION | ++----+------------------------+----------+-------+--------------+----------+ +| 0 | my-hstream-logdevice-0 | 99.99.99 | ALIVE | 16 hours ago | | +| 1 | my-hstream-logdevice-1 | 99.99.99 | DEAD | 16 hours ago | | +| 2 | my-hstream-logdevice-2 | 99.99.99 | DEAD | 16 hours ago | | +| 3 | my-hstream-logdevice-3 | 99.99.99 | DEAD | 16 hours ago | | ++----+------------------------+----------+-------+--------------+----------+ ++---------+-------------+---------------+------------+---------------+ +| SEQ. | DATA HEALTH | STORAGE STATE | SHARD OP. | HEALTH STATUS | ++---------+-------------+---------------+------------+---------------+ +| ENABLED | HEALTHY(1) | READ_WRITE(1) | ENABLED(1) | HEALTHY | +| ENABLED | HEALTHY(1) | READ_WRITE(1) | ENABLED(1) | HEALTHY | +| ENABLED | HEALTHY(1) | READ_WRITE(1) | ENABLED(1) | HEALTHY | +| ENABLED | HEALTHY(1) | READ_WRITE(1) | ENABLED(1) | HEALTHY | ++---------+-------------+---------------+------------+---------------+ +Took 16.727s +``` diff --git a/docs/zh/v0.17.0/deploy/deploy-k8s.md b/docs/zh/v0.17.0/deploy/deploy-k8s.md new file mode 100644 index 0000000..3b6e9c0 --- /dev/null +++ b/docs/zh/v0.17.0/deploy/deploy-k8s.md @@ -0,0 +1,221 @@ +# 在 Kubernetes 上运行 + +本文档描述了如何使用我们提供的 specs 来运行 HStreamDB kubernetes。该文档假设读者 +有基本的 kubernetes 知识。在本节结束时,你将拥有一个完全运行在 kubernetes 上的 +HStreamDB 集群,它已经准备就绪,可以接收读/写,处理数据,等等。 + +## 建立你的 Kubernetes 集群 + +第一步是要有一个正在运行的 kubernetes 集群。你可以使用一个托管的集群(由你的云提 +供商提供),一个自我托管的集群或一个本地的 kubernetes 集群,比如 minikube。请确 +保 kubectl 指向你计划使用的任何集群。 + +另外,你需要一个名为 "hstream-store "的存储类,你可以通过 "kubectl "创建。或者通 +过你的云服务提供商的网页来创建,如果它有的话。 + +::: tip + +对于使用 minikube 的用户, 你可以用默认的存储类 `standard`. + +::: + +## 安装 Zookeeper + +HStreamDB 依赖于 Zookeeper 来存储查询信息和一些内部的存储配置,所以我们需要提供 +一个 Zookeeper 集群,以便 HStreamDB 能够访问。 + +在这个演示中,我们将使用[helm](https://helm.sh/)(一个用于 kubernetes 的软件包管 +理器)来安装 zookeeper。安装完 helm 后,运行: + +```sh +helm repo add bitnami https://charts.bitnami.com/bitnami +helm repo update + +helm install zookeeper bitnami/zookeeper \ + --set image.tag=3.6 \ + --set replicaCount=3 \ + --set persistence.storageClass=hstream-store \ + --set persistence.size=20Gi +``` + +``` +NAME: zookeeper +LAST DEPLOYED: Tue Jul 6 10:51:37 2021 +NAMESPACE: test +STATUS: deployed +REVISION: 1 +TEST SUITE: None +NOTES: +** Please be patient while the chart is being deployed ** + +ZooKeeper can be accessed via port 2181 on the following DNS name from within your cluster: + + zookeeper.svc.cluster.local + +To connect to your ZooKeeper server run the following commands: + + export POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=zookeeper,app.kubernetes.io/component=zookeeper" -o jsonpath="{.items[0].metadata.name}") + kubectl exec -it $POD_NAME -- zkCli.sh + +To connect to your ZooKeeper server from outside the cluster execute the following commands: + + kubectl port-forward svc/zookeeper 2181:2181 & + zkCli.sh 127.0.0.1:2181 +WARNING: Rolling tag detected (bitnami/zookeeper:3.6), please note that it is strongly recommended to avoid using rolling tags in a production environment. ++info https://docs.bitnami.com/containers/how-to/understand-rolling-tags-containers/ +``` + +这将默认安装一个 3 个节点的 zookeeper 集合。等到所有的节点标记为 Ready。 + +```sh +kubectl get pods +``` + +``` +NAME READY STATUS RESTARTS AGE +zookeeper-0 1/1 Running 0 22h +zookeeper-1 1/1 Running 0 4d22h +zookeeper-2 1/1 Running 0 16m +``` + +## 配置和启动 HStreamDB + +一旦所有的 zookeeper pods 都准备好了,我们就可以开始安装 HStreamDB 集群。 + +### 拿到 k8s spec + +```sh +git clone git@github.com:hstreamdb/hstream.git +cd hstream/deploy/k8s +``` + +### 更新配置 + +如果你使用了不同的方式来安装 zookeeper,请确保更新存储配置文件 `config.json` 中 +的 zookeeper 连接字符串和服务器配置文件`hstream-server.yaml`。 + +它应该看起来像这样: + +```sh +cat config.json | grep -A 2 zookeeper +``` +``` + "zookeeper": { + "zookeeper_uri": "ip://zookeeper-0.zookeeper-headless:2181,zookeeper-1.zookeeper-headless:2181,zookeeper-2.zookeeper-headless:2181", + "timeout": "30s" + } +``` + +```sh +cat hstream-server.yaml | grep -A 1 metastore-uri +``` +``` +- "--metastore-uri" +- "zk://zookeeper-0.zookeeper-headless:2181,zookeeper-1.zookeeper-headless:2181,zookeeper-2.zookeeper-headless:2181" +``` + +::: tip + +Storage 配置文件和服务文件中的 zookeeper 连接字符串可以是不同的。但对于正常情况 +下,它们是一样的。 + +::: + +在默认情况下。本规范安装了一个 3 个节点的 HStream 服务器集群和 4 个节点的存储集 +群。如果你想要一个更大的集群,修改 `hstream-server.yaml` 和 +`logdevice-statefulset.yaml` 文件,并将复制的数量增加到你想要的集群中的节点数。 +另外,默认情况下,我们给节点附加一个 40GB 的持久性存储,如果你想要更多,你可以在 +volumeClaimTemplates 部分进行修改。 + +### 启动集群 + +```sh +kubectl apply -k . +``` + +当你运行`kubectl get pods`时,你应该看到类似如下: + +``` +NAME READY STATUS RESTARTS AGE +hstream-server-0 1/1 Running 0 6d18h +hstream-server-1 1/1 Running 0 6d18h +hstream-server-2 1/1 Running 0 6d18h +logdevice-0 1/1 Running 0 6d18h +logdevice-1 1/1 Running 0 6d18h +logdevice-2 1/1 Running 0 6d18h +logdevice-3 1/1 Running 0 6d18h +logdevice-admin-server-deployment-5c5fb9f8fb-27jlk 1/1 Running 0 6d18h +zookeeper-0 1/1 Running 0 6d22h +zookeeper-1 1/1 Running 0 10d +zookeeper-2 1/1 Running 0 6d +``` + +### Bootstrap 集群 + +一旦所有的 logdevice pods 运行并准备就绪,你将需要 Bootstrap 集群以启用所有的存 +储节点。要做到这一点,请运行: + +```sh-vue +kubectl run hstream-admin -it --rm --restart=Never --image=hstreamdb/hstream:{{ $version() }} -- \ + hadmin store --host logdevice-admin-server-service \ + nodes-config \ + bootstrap --metadata-replicate-across 'node:3' +``` + +这将启动一个 hstream-admin pod,它连接到管理服务器并调用 +`nodes-config bootstrap` hadmin store 命令,并将集群的元数据复制属性设置为跨三个 +不同的节点进行复制。 + +成功后,你应该看到类似如下: + +```txt +Successfully bootstrapped the cluster +pod "hstream-admin" deleted +``` + +现在,你可以 bootstrap server 节点: + +```sh-vue +kubectl run hstream-admin -it --rm --restart=Never --image=hstreamdb/hstream:{{ $version() }} -- \ + hadmin server --host hstream-server-0.hstream-server init +``` + +成功后,你应该看到类似如下: + +```txt +Cluster is ready! +pod "hstream-admin" deleted +``` + +注意:取决于硬件条件,存储节点可能没有及时准备就绪,所以运行 `hadmin init` 可能 +会返回失败。这时需要等待几秒,再次运行即可。 + +## 管理存储集群 + +```sh-vue +kubectl run hstream-admin -it --rm --restart=Never --image=hstreamdb/hstream:{{ $version() }} -- bash +``` + +现在你可以运行 `hadmin store` 来管理这个集群: + +```sh +hadmin store --help +``` + +要检查集群的状态,你可以运行: + +```sh +hadmin store --host logdevice-admin-server-service status +``` + +```txt ++----+-------------+-------+---------------+ +| ID | NAME | STATE | HEALTH STATUS | ++----+-------------+-------+---------------+ +| 0 | logdevice-0 | ALIVE | HEALTHY | +| 1 | logdevice-1 | ALIVE | HEALTHY | +| 2 | logdevice-2 | ALIVE | HEALTHY | +| 3 | logdevice-3 | ALIVE | HEALTHY | ++----+-------------+-------+---------------+ +Took 2.567s +``` diff --git a/docs/zh/v0.17.0/deploy/quick-deploy-ssh.md b/docs/zh/v0.17.0/deploy/quick-deploy-ssh.md new file mode 100644 index 0000000..75a6557 --- /dev/null +++ b/docs/zh/v0.17.0/deploy/quick-deploy-ssh.md @@ -0,0 +1,460 @@ +# Deployment with hdt + +This document provides a way to start an HStreamDB cluster quickly using the deployment tool `hdt`. + +## Pre-Require + +- The local host needs to be able to connect to the remote server via SSH + +- Make sure remote server has docker installed. + +- Make sure that the logged-in user has `sudo` execute privileges and configure `sudo` to run without prompting for a password. + +- For nodes that deploy `HStore` instances, mount the data disks to `/mnt/data*`, where `*` matches an incremental number starting from zero. + - Each disk should be mounted to a separate directory. For example, if there are two data disks, `/dev/vdb` and `/dev/vdc`, then `/dev/vdb` should be mounted to `/mnt/data0`, and `/dev/vdc` should be mounted to `/mnt/data1`. + +## Deploy `hdt` on the control machine + +We'll use a deployment tool `hdt` to help us set up the cluster. The binaries are available here: https://github.com/hstreamdb/deployment-tool/releases. + +1. Log in to the control machine and download the binaries. + +2. Generate configuration template with command: + + ```shell + ./hdt init + ``` + + The current directory structure will be as follows after running the `init` command: + + ```markdown + ├── hdt + └── template + ├── config.yaml + ├── logdevice.conf + ├── alertmanager + | └── alertmanager.yml + ├── grafana + │   ├── dashboards + │   └── datasources + ├── prometheus + ├── hstream_console + ├── filebeat + ├── kibana + │   └── export.ndjson + └── script + ``` + +## Update `Config.yaml` + +`template/config.yaml` contains the template for the configuration file. Refer to the description of the fields in the file and modify the template according to your actual needs. + +As a simple example, we will be deploying a cluster on three nodes, each consisting of an HServer instance, an HStore instance, and a Meta-Store instance. In addition, we will deploy HStream Console, Prometheus, and HStream Exporter on another node. For hstream monitor stack, refer to [monitor components config](./quick-deploy-ssh.md#monitor-stack-components). + +The final configuration file may looks like: + +```yaml +global: + user: "root" + key_path: "~/.ssh/id_rsa" + ssh_port: 22 + +hserver: + - host: 172.24.47.175 + - host: 172.24.47.174 + - host: 172.24.47.173 + +hstore: + - host: 172.24.47.175 + enable_admin: true + - host: 172.24.47.174 + - host: 172.24.47.173 + +meta_store: + - host: 172.24.47.175 + - host: 172.24.47.174 + - host: 172.24.47.173 + +hstream_console: + - host: 172.24.47.172 + +prometheus: + - host: 172.24.47.172 + +hstream_exporter: + - host: 172.24.47.172 +``` + +## Set up cluster + +### set up cluster with ssh key-value pair + +```shell +./hdt start -c template/config.yaml -i ~/.ssh/id_rsa -u root +``` + +### set up cluster with passwd + +```shell +./hdt start -c template/config.yaml -p -u root +``` + +then type your password. + +use `./hdt start -h` for more information + +## Remove cluster + +remove cluster will stop cluster and remove ***ALL*** related data. + +### remove cluster with ssh key-value pair + +```shell +./hdt remove -c template/config.yaml -i ~/.ssh/id_rsa -u root +``` + + ### remove cluster with passwd + +```shell +./hdt remove -c template/config.yaml -p -u root +``` + +then type your password. + +## Detailed configuration items + +This section describes the meaning of each field in the configuration file in detail. The configuration file is divided into three main sections: global configuration items, monitoring component configuration items, and other component configuration items. + +### Global + +```yaml +global: + # # Username to login via SSH + user: "root" + # # The path of SSH identity file + key_path: "~/.ssh/hstream-aliyun.pem" + # # SSH service monitor port + ssh_port: 22 + # # Replication factors of store metadata + meta_replica: 1 + # # Local path to MetaStore config file + meta_store_config_path: "" + # # Local path to HStore config file + hstore_config_path: "" + # # HStore config file can be loaded from network filesystem, for example, the config file + # # can be stored in meta store and loaded via network request. Set this option to true will + # # force store load config file from its local filesystem. + disable_store_network_config_path: true + # # Local path to HServer config file + hserver_config_path: "" + # # use grpc-haskell framework + enable_grpc_haskell: false + # # Local path to ElasticSearch config file + elastic_search_config_path: "" + # # Only enable for linux kernel which support dscp reflection(linux kernel version + # # greater and equal than 4.x) + enable_dscp_reflection: false + # # Global container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The Global section is used to set the default configuration values for all other configuration items. + +- `meta_replica` set the replication factors of HStreamDB metadata logs. This value should not exceed the number of `hstore` instances. +- `meta_store_config_path`、`hstore_config_path` and `hserver_config_path` are configuration file path for `meta_store`、`hstore` and `hserver` in the control machine. If the paths are set, these configuration files will be synchronized to the specified location on the node where the respective instance is located, and the corresponding configuration items will be updated when the instance is started. +- `enable_grpc_haskell`: use `grpc-haskell` framework. The default value is false, which will use `hs-grpc` framework. +- `enable_dscp_reflection`: if your operation system version is greater and equal to linux 4.x, you can set this field to true. +- `container_config` let you set resource limitations for all containers. + +### Monitor + +```yaml +monitor: + # # Node exporter port + node_exporter_port: 9100 + # # Node exporter image + node_exporter_image: "prom/node-exporter" + # # Cadvisor port + cadvisor_port: 7000 + # # Cadvisor image + cadvisor_image: "gcr.io/cadvisor/cadvisor:v0.39.3" + # # List of nodes that won't be monitored. + excluded_hosts: [] + # # root directory for all monitor related config files. + remote_config_path: "/home/deploy/monitor" + # # root directory for all monitor related data files. + data_dir: "/home/deploy/data/monitor" + # # Set up grafana without login + grafana_disable_login: true + # # Global container configuration for monitor stacks. + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The Monitor section is used to specify the configuration options for the `cadvisor` and `node-exporter`. + +### HServer + +```yaml +hserver: + # # The ip address of the HServer + - host: 10.1.0.10 + # # HServer docker image + image: "hstreamdb/hstream" + # # The listener is an adderss that a server advertises to its clients so they can connect to the server. + # # Each listener is specified as "listener_name:hstream://host_name:port_number". The listener_name is + # # a name that identifies the listener, and the "host_name" and "port_number" are the IP address and + # # port number that reachable from the client's network. Multi listener will split by comma. + # # For example: public_ip:hstream://39.101.190.70:6582 + advertised_listener: "" + # # HServer listen port + port: 6570 + # # HServer internal port + internal_port: 6571 + # # HServer configuration + server_config: + # # HServer log level, valid values: [critical|error|warning|notify|info|debug] + server_log_level: info + # # HStore log level, valid values: [critical|error|warning|notify|info|debug|spew] + store_log_level: info + # # Specific server compression algorithm, valid values: [none|lz4|lz4hc] + compression: lz4 + # # Root directory of HServer config files + remote_config_path: "/home/deploy/hserver" + # # Root directory of HServer data files + data_dir: "/home/deploy/data/hserver" + # # HServer container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The HServer section is used to specify the configuration options for the `hserver` instance. + +### HAdmin + +```yaml +hadmin: + - host: 10.1.0.10 + # # HAdmin docker image + image: "hstreamdb/hstream" + # # HAdmin listen port + admin_port: 6440 + # # Root directory of HStore config files + remote_config_path: "/home/deploy/hadmin" + # # Root directory of HStore data files + data_dir: "/home/deploy/data/hadmin" + # # HStore container configuration + container_config: + cpu_limit: 2.00 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The HAdmin section is used to specify the configuration options for the `hadmin` instance. + +- Hadmin is not a necessary component. You can configure `hstore` instance to take on the functionality of `hadmin` by setting the configuration option `enable_admin: true` within the hstore. + +- If you have both a HAdmin instance and a HStore instance running on the same node, please note that they cannot both use the same `admin_port` for monitoring purposes. To avoid conflicts, you will need to assign a unique `admin_port` value to each instance. + +### HStore + +```yaml +hstore: + - host: 10.1.0.10 + # # HStore docker image + image: "hstreamdb/hstream" + # # HStore admin port + admin_port: 6440 + # # Root directory of HStore config files + remote_config_path: "/home/deploy/hstore" + # # Root directory of HStore data files + data_dir: "/home/deploy/data/store" + # # Total used disks + disk: 1 + # # Total shards + shards: 2 + # # The role of the HStore instance. + role: "Both" # [Storage|Sequencer|Both] + # # When Enable_admin is turned on, the instance can receive and process admin requests + enable_admin: true + # # HStore container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The HStore section is used to specify the configuration options for the `hstore` instance. + +- `admin_port`: HStore service will listen on this port. +- `disk` and `shards`: Set total used disks and total shards. For example, `disk: 2` and `shards: 4` means the hstore will persistant data in two disks, and each disk will contain 2 shards. +- `role`: a HStore instance can act as a Storage, a Sequencer or both, default is both. +- `enable_admin`: If the 'true' value is assigned to this setting, the current hstore instance will be able to perform the same functions as hadmin. + +### Meta-store + +```yaml +meta_store: + - host: 10.1.0.10 + # # Meta-store docker image + image: "zookeeper:3.6" + # # Meta-store port, currently only works for rqlite. zk will + # # monitor on 4001 + port: 4001 + # # Raft port used by rqlite + raft_port: 4002 + # # Root directory of Meta-Store config files + remote_config_path: "/home/deploy/metastore" + # # Root directory of Meta-store data files + data_dir: "/home/deploy/data/metastore" + # # Meta-store container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +The Meta-store section is used to specify the configuration options for the `meta-store` instance. + +- `port` and `raft_port`: these are used by `rqlite` + +### Monitor stack components + +```yaml +prometheus: + - host: 10.1.0.15 + # # Prometheus docker image + image: "prom/prometheus" + # # Prometheus service monitor port + port: 9090 + # # Root directory of Prometheus config files + remote_config_path: "/home/deploy/prometheus" + # # Root directory of Prometheus data files + data_dir: "/home/deploy/data/prometheus" + # # Prometheus container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +grafana: + - host: 10.1.0.15 + # # Grafana docker image + image: "grafana/grafana-oss:main" + # # Grafana service monitor port + port: 3000 + # # Root directory of Grafana config files + remote_config_path: "/home/deploy/grafana" + # # Root directory of Grafana data files + data_dir: "/home/deploy/data/grafana" + # # Grafana container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +alertmanager: + # # The ip address of the Alertmanager Server. + - host: 10.0.1.15 + # # Alertmanager docker image + image: "prom/alertmanager" + # # Alertmanager service monitor port + port: 9093 + # # Root directory of Alertmanager config files + remote_config_path: "/home/deploy/alertmanager" + # # Root directory of Alertmanager data files + data_dir: "/home/deploy/data/alertmanager" + # # Alertmanager container configuration + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +hstream_exporter: + - host: 10.1.0.15 + # # hstream_exporter docker image + image: "hstreamdb/hstream-exporter" + # # hstream_exporter service monitor port + port: 9250 + # # Root directory of hstream_exporter config files + remote_config_path: "/home/deploy/hstream-exporter" + # # Root directory of hstream_exporter data files + data_dir: "/home/deploy/data/hstream-exporter" + container_config: + cpu_limit: 200 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + +Currently, HStreamDB monitor stack contains the following components:`node-exporter`, `cadvisor`, `hstream-exporter`, `grafana`, `alertmanager` and `hstream-exporter`. The global configuration of the monitor stack is available in [monitor](./quick-deploy-ssh.md#monitor) field. + +### Elasticsearch, Kibana and Filebeat + +```yaml +elasticsearch: + - host: 10.1.0.15 + # # Elasticsearch service monitor port + port: 9200 + # # Elasticsearch docker image + image: "docker.elastic.co/elasticsearch/elasticsearch:8.5.0" + # # Root directory of Elasticsearch config files + remote_config_path: "/home/deploy/elasticsearch" + # # Root directory of Elasticsearch data files + data_dir: "/home/deploy/data/elasticsearch" + # # Elasticsearch container configuration + container_config: + cpu_limit: 2.00 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +kibana: + - host: 10.1.0.15 + # # Kibana service monitor port + port: 5601 + # # Kibana docker image + image: "docker.elastic.co/kibana/kibana:8.5.0" + # # Root directory of Kibana config files + remote_config_path: "/home/deploy/kibana" + # # Root directory of Kibana data files + data_dir: "/home/deploy/data/kibana" + # # Kibana container configuration + container_config: + cpu_limit: 2.00 + memory_limit: 8G + disable_restart: true + remove_when_exit: true + +filebeat: + - host: 10.1.0.10 + # # Filebeat docker image + image: "docker.elastic.co/beats/filebeat:8.5.0" + # # Root directory of Filebeat config files + remote_config_path: "/home/deploy/filebeat" + # # Root directory of Filebeat data files + data_dir: "/home/deploy/data/filebeat" + # # Filebeat container configuration + container_config: + cpu_limit: 2.00 + memory_limit: 8G + disable_restart: true + remove_when_exit: true +``` + diff --git a/docs/zh/v0.17.0/index.md b/docs/zh/v0.17.0/index.md new file mode 100644 index 0000000..f1ab673 --- /dev/null +++ b/docs/zh/v0.17.0/index.md @@ -0,0 +1,76 @@ + +# Introduction to HStreamDB + +## Overview + +**HStreamDB is a streaming database designed for streaming data, with complete +lifecycle management for accessing, storing, processing, and distributing +large-scale real-time data streams**. It uses standard SQL (and its stream +extensions) as the primary interface language, with real-time as the main +feature, and aims to simplify the operation and management of data streams and +the development of real-time applications. + +## Why HStreamDB? + +Nowadays, data is continuously being generated from various sources, e.g. sensor +data from the IoT, user-clicking events on the Internet, etc.. We want to build +low-latency applications that respond quickly to these incoming streaming data +to provide a better user experience, real-time data insights and timely business +decisions. + +However, currently, it is not easy to build such stream processing applications. +To construct a basic stream processing architecture, we always need to combine +multiple independent components. For example, you would need at least a +streaming data capture subsystem, a message/event storage component, a stream +processing engine, and multiple derived data systems for different queries. + +None of these should be so complicated, and this is where HStreamDB comes into +play. Just as you can easily build a simple CRUD application based on a +traditional database, with HStreamDB, you can easily build a basic streaming +application without any other dependencies. + +## Key Features + +### Reliable, low-latency streaming data storage + +With an optimized storage engine design, HStreamDB provides low latency persistent storage of streaming data and replicates written data to multiple storage nodes to ensure data reliability. + +It also supports hierarchical data storage and can automatically dump historical data to lower-cost storage services such as object storage, distributed file storage, etc. The storage capacity is infinitely scalable, enabling permanent storage of data. + +### Easy support and management of large scale data streams + +HStreamDB uses a stream-native design where data is organized and accessed as streams, supporting creating and managing large data streams. Stream creation is a very lightweight operation in HStreamDB, maintaining stable read and write latency despite large numbers of streams being read and written concurrently. + +The performance of HStreamDB streams is excellent thanks to its native design, supporting millions of streams in a single cluster. + +### Real-time, orderly data subscription delivery + +HStreamDB is based on the classic publish-subscribe model, providing low-latency data subscription delivery for data consumption and the ability to deliver data subscriptions in the event of cluster failures and errors. + +It also guarantees the orderly delivery of machines in the event of cluster failures and errors. + +### Powerful stream processing support built-in + +HStreamDB has designed a complete processing solution based on event time. It supports basic filtering and conversion operations, aggregations by key, calculations based on various time windows, joining between data streams, and processing disordered and late messages to ensure the accuracy of calculation results. Simultaneously, the stream processing solution of HStream is highly extensible, and users can extend the interface according to their own needs. + +### Real-time analysis based on materialized views + +HStreamDB will offer materialized view to support complex query and analysis operations on continuously updated data streams. The incremental computing engine updates the materialized view instantly according to the changes of data streams, and users can query the materialized view through SQL statements to get real-time data insights. + +### Easy integration with multiple external systems + +The stream-native design of HStreamDB and the powerful stream processing capabilities built-in make it ideally suited as a data hub for the enterprise, responsible for all data access and flow, connecting multiple upstream and downstream services and data systems. + +For this reason, HStreamDB also provides Connector components for interfacing with various external systems, such as MySQL, ClickHouse, etc., making it easy to integrate with external data systems. + +### Cloud-native architecture, unlimited horizontal scaling + +HStreamDB is built with a Cloud-Native architecture, where the compute and storage layers are separated and can be horizontally scaled independently. + +It also supports online cluster scaling, dynamic expansion and contraction, and is efficient in scaling without data repartitioning, mass copying, etc. + +### Fault tolerance and high availability + +HStreamDB has built-in automatic node failure detection and error recovery mechanisms to ensure high availability while using an optimized consistency model based on Paxos. + +Data is always securely replicated to multiple nodes, ensuring consistency and orderly delivery even in errors and failures. diff --git a/docs/zh/v0.17.0/ingest-and-distribute/_index.md b/docs/zh/v0.17.0/ingest-and-distribute/_index.md new file mode 100644 index 0000000..e8eb26c --- /dev/null +++ b/docs/zh/v0.17.0/ingest-and-distribute/_index.md @@ -0,0 +1,6 @@ +--- +order: ["overview.md", "user_guides.md", "connectors.md"] +collapsed: false +--- + +Ingest and Distribute data diff --git a/docs/zh/v0.17.0/ingest-and-distribute/connectors.md b/docs/zh/v0.17.0/ingest-and-distribute/connectors.md new file mode 100644 index 0000000..869c906 --- /dev/null +++ b/docs/zh/v0.17.0/ingest-and-distribute/connectors.md @@ -0,0 +1,20 @@ +# Connectors + +Sources: + +| Name | Configuration | Image | +| ----------------- | --------------------------------------------------------------------------------------------------------------- | -------------------------------------------- | +| source-mysql | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_mysql_spec.md) | hstreamdb/connector:source-mysql:latest | +| source-postgresql | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_postgresql_spec.md) | hstreamdb/connector:source-postgresql:latest | +| source-sqlserver | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_sqlserver_spec.md) | hstreamdb/connector:source-sqlserver:latest | +| source-mongodb | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_mongodb_spec.md) | hstreamdb/connector:source-mongodb:latest | +| source-generator | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_generator_spec.md) | hstreamdb/connector:source-generator:latest | + +Sinks: + +| Name | Configuration | Image | +| --------------- | ------------------------------------------------------------------------------------------------------------- | ------------------------------------------ | +| sink-mysql | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_mysql_spec.md) | hstreamdb/connector:sink-mysql:latest | +| sink-postgresql | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_postgresql_spec.md) | hstreamdb/connector:sink-postgresql:latest | +| sink-mongodb | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_mongodb_spec.md) | hstreamdb/connector:sink-mongodb:latest | +| sink-blackhole | [configuration](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_blackhole_spec.md) | hstreamdb/connector:sink-blackhole:latest | diff --git a/docs/zh/v0.17.0/ingest-and-distribute/overview.md b/docs/zh/v0.17.0/ingest-and-distribute/overview.md new file mode 100644 index 0000000..42d5a03 --- /dev/null +++ b/docs/zh/v0.17.0/ingest-and-distribute/overview.md @@ -0,0 +1,121 @@ +# HStream IO Overview + +HStream IO is an internal data integration framework for HStreamDB, composed of connectors and IO runtime. +It allows interconnection with various external systems, +facilitating the efficient flow of data across the enterprise data stack and thereby unleashing the value of real-time-ness. + +## Motivation + +HStreamDB is a streaming database, +we want to build a reliable data integration framework to connect HStreamDB with external systems easily, +we also want to use HStreamDB to build a real-time data synchronization service (e.g. synchronizes data from MySQL to PostgreSQL). + +Here are our goals for HStream IO: + +* easy to use +* scalability +* fault-tolerance +* extensibility +* streaming and batch +* delivery semantics + +HStream IO is highly inspired by Kafka Connect, Pulsar IO, Airbyte, etc. frameworks, +we will introduce the architecture and workflow of HStream IO, +and compare it with other frameworks to describe how HStream IO achieves the goals listed above. + +## Architect and Workflow + +HStream IO consists of two components: + +* IO Runtime: IO Runtime is a part of HStreamDB managing and empowering scalability, fault-tolerance, and load-balancing for connectors. +* Connectors: Connectors are used to synchronize data between HStreamDB and external systems. + +HStream IO provides two types of connectors: +* Source Connector - A source connector subscribes to data from other systems such as MySQL, and PostgreSQL, making the data available for data processing in HStreamDB. +* Sink Connector - A sink connector writes data to other systems from HStreamDB streams. + +For a clear understanding, +we would name a running connector process to be a task and the Docker image for the connector is a connector plugin. + +Here is a summary workflow of creating a source connector: + +1. Users can send a CREATE SOURCE CONNECTOR SQL to HStreamDB to create a connector +2. HStreamDB dispatches the request to a correct node +3. HStream IO Runtime handles the request to launch a connector task +4. the connector task will fetch data from source systems and store them in HStreamDB. + +## Design and Implement + +### Easy to use + +HStream IO is a part of HStreamDB, +so if you want to create a connector, +do not need to deploy an HStream IO cluster like Kafka Connect, +just send a SQL to HStreamDB, e.g.: + +``` +create source connector source01 from mysql with + ( "host" = "mysql-s1" + , "port" = 3306 + , "user" = "root" + , "password" = "password" + , "database" = "d1" + , "table" = "person" + , "stream" = "stream01" + ); +``` + +### Scalability, Availability, and Delivery Semantics + +Connectors are resources for HStreamDB Cluster, +HStreamDB Cluster provides high scalability and fault-tolerance for HStream IO, +for more details, please check HStreamDB docs. + +Users can manually create multiple connectors for sources or streams to use parallel synchronization to achieve better performance, +we will support a connector scheduler for dynamical parallel synchronization like Kafka Connect and Pulsar IO soon. + +When a connector is running, the offsets of the connector will be recorded in HStreamDB, +so if the connector failed unexpectedly, +HStream IO Runtime will detect the failure and recover it by recent offsets, +even if the node crashed, +HStreamDB cluster will rebalance the connectors on the node to other nodes and recover them. + +HStream IO supported at-least-once delivery semantics now, +we will support more delivery semantics(e.g. exactly-once delivery semantic) for some connectors later. + +### Streaming and Batch + +Many ELT frameworks like Airbyte are designed for batch systems, +they can not handle streaming data efficiently, +HStreamDB is a streaming database, +and a lot of streaming data need to be loaded into HStreamDB, +so HStream IO is designed to support both streaming and batch data, +and users can use it to build a real-time streaming data synchronization service. + +### Extensibility + +We want to establish a great ecosystem like Kafka Connect and Airbyte, +so an accessible connector API for deploying new connectors is necessary. + +Kafka Connect design a java connector API, +you can not develop connectors in other languages easily, +Airbyte and Pulsar IO inspired us to build a connector plugin as a Docker image to support multiple languages +and design a protocol between HStream IO Runtime and connectors, +but it brings more challenges to simplify the connector API, +you can not implement a couple of Java interfaces to build a connector easily like Kafka Connect, +you have to care about how to build a Docker image, +handle command line arguments, +implement the protocol interfaces correctly, etc. + +So to avoid that we split the connector API into two parts: + +* HStream IO Protocol +* Connector Toolkit + +Compared with Airbyte's heavy protocol, +HStream IO Protocol is designed as simple as possible, +it provides basic management interfaces for launching and stopping connectors, +does not need to exchange record messages(it will bring more latencies), +the Connector Toolkit is designed to handle heaviest jobs(e.g. fetch data from source systems, write data into HStreamDB, recorded offsets, etc.) +to provide the simplest connector API, +so developers can use Connector Toolkit to implement new connectors easily like Kafka Connect. diff --git a/docs/zh/v0.17.0/ingest-and-distribute/user_guides.md b/docs/zh/v0.17.0/ingest-and-distribute/user_guides.md new file mode 100644 index 0000000..f9e5f5f --- /dev/null +++ b/docs/zh/v0.17.0/ingest-and-distribute/user_guides.md @@ -0,0 +1,146 @@ +# HStream IO User Guides + +Data synchronization service is used to synchronize data between systems(e.g. databases) in real time, +which is useful for many cases, for example, MySQL is a widely-used database, +if your application is running on MySQL, and: + +* You found its query performance is not enough. + + You want to migrate your data to another database (e.g. PostgreSQL), but you need to switch your application seamlessly. + + Your applications highly depended on MySQL, migrating is difficult, so you have to migrate gradually. + + You don't need to migrate the whole MySQL data, instead, just copy some data from MySQL to other databases(e.g. HStreamDB) for data analysis. +* You need to upgrade your MySQL version for some new features seamlessly. +* You need to back up your MySQL data in multiple regions in real time. + +In those cases, you will find a data synchronization service is helpful, +HStream IO is an internal data integration framework for HStreamDB, +and it can be used as a data synchronization service, +this document will show you how to use HStream IO to build a data synchronization service from a MySQL table to a PostgreSQL table, +you will learn: + +* How to create a source connector that synchronizes records from a MySQL table to an HStreamDB stream. +* How to create a sink connector that synchronizes records from an HStreamDB stream to a PostgreSQL table. +* How to use HStream SQLs to manage the connectors. + +## Set up an HStreamDB Cluster + +You can check +[quick-start](https://hstream.io/docs/en/latest/start/quickstart-with-docker.html) +to find how to set up an HStreamDB cluster and connect to it. + +## Set up a MySQL + +Set up a MySQL instance with docker: + +```shell +docker run --network=hstream-quickstart --name mysql-s1 -e MYSQL_ROOT_PASSWORD=password -d mysql +``` + +Here we use the `hstream-quickstart` network if you set up your HStreamDB +cluster based on +[quick-start](https://hstream.io/docs/en/latest/start/quickstart-with-docker.html). + +Connect to the MySQL instance: + +```shell +docker exec -it mysql-s1 mysql -uroot -h127.0.0.1 -P3306 -ppassword +``` + +Create a database `d1`, a table `person` and insert some records: + +```sql +create database d1; +use d1; +create table person (id int primary key, name varchar(256), age int); +insert into person values (1, "John", 20), (2, "Jack", 21), (3, "Ken", 33); +``` + +the table `person` must include a primary key, or the `DELETE` statement may not +be synced correctly. + +## Set up a PostgreSQL + +Set up a PostgreSQL instance with docker: + +```shell +docker run --network=hstream-quickstart --name pg-s1 -e POSTGRES_PASSWORD=postgres -d postgres +``` + +Connect to the PostgreSQL instance: + +```shell +docker exec -it pg-s1 psql -h 127.0.0.1 -U postgres +``` + +`sink-postgresql` doesn't support the automatic creation of a table yet, so you +need to create the database `d1` and the table `person` first: + +```sql +create database d1; +\c d1; +create table person (id int primary key, name varchar(256), age int); +``` + +The table `person` must include a primary key. + +## Download Connector Plugins + +A connector plugin is a docker image, so before you can set up the connectors, +you should download and update their plugins with `docker pull`: + +```shell +docker pull hstreamdb/source-mysql:latest +docker pull hstreamdb/sink-postgresql:latest +``` + +[Here](https://hstream.io/docs/en/latest/io/connectors.html) is a table of all +available connectors. + +## Create Connectors + +After connecting an HStream Server, you can use create source/sink connector +SQLs to create connectors. + +Connect to the HStream server: + +```shell-vue +docker run -it --rm --network host hstreamdb/hstream:{{ $version() }} hstream sql --port 6570 +``` + +Create a source connector: + +```sql +create source connector source01 from mysql with ("host" = "mysql-s1", "port" = 3306, "user" = "root", "password" = "password", "database" = "d1", "table" = "person", "stream" = "stream01"); +``` + +The source connector will run an HStream IO task, which continually synchronizes +data from MySQL table `d1.person` to stream `stream01`. Whenever you update +records in MySQL, the change events will be recorded in stream `stream01` if the +connector is running. + +You can use `SHOW CONNECTORS` to check connectors and their status and use +`PAUSE` and `RESUME` to stop/restart connectors: + +```sql +PAUSE connector source01; +RESUME connector source01; +``` + +If resume the connector task, the task will not fetch data from the beginning, +instead, it will continue from the point when it was paused. + +Then you can create a sink connector that consumes the records from the stream +`stream01` to PostgreSQL table `d1.public.person`: + +```sql +create sink connector sink01 to postgresql with ("host" = "pg-s1", "port" = 5432, "user" = "postgres", "password" = "postgres", "database" = "d1", "table" = "person", "stream" = "stream01"); +``` + +With both `source01` and `sink01` connectors running, you get a synchronization +service from MySQL to PostgreSQL. + +You can use the `DROP CONNECTOR` statement to delete the connectors: + +```sql +drop connector source01; +drop connector sink01; +``` diff --git a/docs/zh/v0.17.0/overview/_index.md b/docs/zh/v0.17.0/overview/_index.md new file mode 100644 index 0000000..602ec8e --- /dev/null +++ b/docs/zh/v0.17.0/overview/_index.md @@ -0,0 +1,6 @@ +--- +order: ["concepts.md"] +collapsed: false +--- + +概览 diff --git a/docs/zh/v0.17.0/overview/concepts.md b/docs/zh/v0.17.0/overview/concepts.md new file mode 100644 index 0000000..1192788 --- /dev/null +++ b/docs/zh/v0.17.0/overview/concepts.md @@ -0,0 +1,41 @@ +# Concepts + +This page explains key concepts in HStream, which we recommend you to understand before you start. + +## Record + +In HStream, a record is a unit of data that may contain arbitrary user data and is immutable. Each record is assigned a unique recordID in a stream. Additionally, a partition key is included in every record, represented as a string, and used to determine the stream shard where the record is stored. + +## Stream + +All records live in streams. A stream is essentially an unbound, append-only dataset. A stream can contain multiple shards and each shard can be located in different nodes. There are some attributes of a stream, such as: +- Replicas: how many replicas the data in the stream has +- Retention period: how long the data in the stream is retained + +## Subscription + +Clients can obtain the latest data in streams in real time by subscriptions. A subscription can automatically track and save the progress of clients processing records: clients indicate that a record has been successfully received and processed by replying to the subscription with a corresponding ACK, and the subscription will not continue to deliver records to clients that have already been acked. If the subscription has not received ACKs from clients after the specified time, it will redeliver last records. + +A subscription is immutable, which means you cannot reset its internal delivery progress. Multiple subscriptions can be created on the same stream, and they are independent of each other. + +Multiple clients can join the same subscription, and the system will distribute records accross clients based on different subscription modes. Currently, the default subscription mode is shard-based, which means that records in a shard will be delivered to the same client, and different shards can be assigned to different clients. + +## Query + +Unlike queries in traditional databases that operate on static datasets, return limited results, and immediately terminate execution, queries in HStream operate on unbound data streams, continuously update results as the source streams changes, and keep running until the user stops it explicitly. This kind of long-running queries against unbound streams is also known as the streaming query. + +By default, a query will write its computing results to a new stream continuously. Clients can subscribe to the result stream to obtain real-time updates of the computing results. + +## Materialized View + +Queries are also usually used to create materialized views. Unlike streams that store records in an append-only way, materialized views are more similar to tables in relational databases that hold results in a compacted form, which means that you can query them directly and get the latest results immediately. + +Traditional databases also have materialized views, but the results are often stale and they have a significant impact on database performance, so they are generally rarely used. However, in HStream, the results saved in materialized views are automatically updated in real-time with changes in the source streams, which is very useful in many scenarios, such as building real-time dashboards. + +## Connector + +The connectors are responsible for the streaming data integration between HStream and external systems and can be divided into two categories according to the direction of integration: source connectors and sink connectors. + +Source connectors are used for continuously ingesting data from other systems into HStream, and sink connectors are responsible for continuously distributing data from HStream to other systems.There are also different types of connectors for different systems,such as PostgreSQL connector, MongoDB connector… + +The running of connectors are supervised, managed and scheduled by HStream itself, without relying on any other systems. diff --git a/docs/zh/v0.17.0/platform/_index.md b/docs/zh/v0.17.0/platform/_index.md new file mode 100644 index 0000000..d7f40da --- /dev/null +++ b/docs/zh/v0.17.0/platform/_index.md @@ -0,0 +1,12 @@ +--- +order: + - stream-in-platform.md + - write-in-platform.md + - subscription-in-platform.md + - create-queries-in-platform.md + - create-views-in-platform.md + - create-connectors-in-platform.md +collapsed: false +--- + +HStream Platform diff --git a/docs/zh/v0.17.0/platform/create-connectors-in-platform.md b/docs/zh/v0.17.0/platform/create-connectors-in-platform.md new file mode 100644 index 0000000..be1258d --- /dev/null +++ b/docs/zh/v0.17.0/platform/create-connectors-in-platform.md @@ -0,0 +1,108 @@ +# 创建和管理 Connector + +This page describes how to create and manage connectors in HStream Platform. + +## Create a connector + +Connector has two types: source connector and sink connector. The source connector is used to ingest data from external systems into HStream Platform, while the sink connector is used to distribute data from HStream Platform to external systems. + +### Create a source connector + +First, navigate to the **Sources** page and click the **New source** button to go to the **Create a new source** page. + +In this page, you should first select the **Connector type**. Currently, HStream Platform supports the following source connectors: + +- MongoDB +- MySQL +- PostgreSQL +- SQL Server +- Generator + +Click one of them to select it, and then the page will display the corresponding configuration form. + +After filling in the configuration, click the **Create** button to create the source connector. + +::: tip + +For more details about the configuration of each source connector, please refer to [Connectors](../ingest-and-distribute/connectors.md). + +::: + +### Create a sink connector + +Create a sink connector is similar to create a source connector. First, navigate to the **Sinks** page and click the **New sink** button to go to the **Create a new sink** page. + +Then the next steps are the same as creating a source connector. + +Currently, HStream Platform supports the following sink connectors: + +- MongoDB +- MySQL +- PostgreSQL +- Blackhole + +## View connectors + +The **Sources** and **Sinks** pages display all the connectors in your account. For each connector, you can view the following information: + +- The **Name** of the connector. +- The **Created time** of the connector. +- The **Status** of the connector. +- The **Type** of the connector. +- **Actions**, which for the extra operations of the connector: + + - **Duplicate**: Duplicate the connector. + - **Delete**: Delete the connector. + +To view a specific connector, you can click the **Name** of the connector to go to the [details page](#view-connector-details). + +## View connector details + +The details page displays the detailed information of a connector: + +1. All the information in the [connectors](#view-connectors) page. +2. Different tabs are provided to display the related information of the connector: + + - [**Overview**](#view-connector-overview): Besides the basic information, also can view the metrics of the connector. + - **Config**: View the configuration of the connector. + - [**Consumption Process**](#view-connector-consumption-process): View the consumption process of the connector. + - **Logs**: View the tasks of the connector. + +## View connector overview + +The **Overview** page displays the metrics of a connector. The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the connector include (with last 5 minutes as an example), from left to right: + +- **Processed records throughput**: The number of records that the connector processes per second. +- **Processed bytes throughput**: The number of bytes that the connector processes per second. +- **Total records** (Sink): The number of records that the connector processes in the last 5 minutes. + +## View connector consumption process + +The **Consumption Process** page displays the consumption process of a connector. Different connectors have different consumption processes. + +## Delete a connector + +This section describes how to delete a connector. + +### Delete a connector from the Connectors page + +1. Navigate to the **Connectors** page. +2. Click the **Delete** icon of the connector you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a connector from the Connector Details page + +1. Navigate to the details page of the connector you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/zh/v0.17.0/platform/create-queries-in-platform.md b/docs/zh/v0.17.0/platform/create-queries-in-platform.md new file mode 100644 index 0000000..1fb9165 --- /dev/null +++ b/docs/zh/v0.17.0/platform/create-queries-in-platform.md @@ -0,0 +1,127 @@ +# 创建和管理 Streaming Query + +This page describes how to create and manage streaming queries in HStream Platform. + +## Create a query + +First, navigate to the **Queries** page and click the **Create query** button to +go to the **Create query** page. + +In this page, you can see 3 areas used throughout the creation process: + +- The **Stream Catalog** area on the left is used to select the streams you + want to use in the query. +- The **Query Editor** area on the top right is used to write the query. +- The **Query Result** area on the bottom right is used to display the query result. + +Below sections describe how these areas are used in the creation process. + +### Stream Catalog + +The **Stream Catalog** will display all the streams as a list. You can use one of +them as the source stream of the query. For example, if a stream is `test`, after +selecting it, the **Query Editor** will be filled with the following query: + +```sql +CREATE STREAM stream_iyngve AS SELECT * FROM test; +``` + +This can help you quickly create a query. You can also change the query to meet your needs. + +::: tip +The auto-generated query is commented by default. You need to uncomment it to make it work. +::: + +::: info +The auto-generated query will generate a stream with a `stream_` prefix and a random suffix after +creating it. You can change the name of the stream to meet your needs. +::: + +### Query Editor + +The **Query Editor** is used to write the query. + +Besides the textarea, there are still a right sidebar to assist you in writing the query. +To create a query, you need to provide a **Query name** to identify the query, an text field in the right sidebar will automatically generate a query name for you. You can also change it to meet your needs. + +Once you finish writing the query, click the **Save & Run** button to create the query and run it. + +### Query Result + +After creating the query, the **Query Result** area will display the query result in real time. + +The query result is displayed in a table. Each row represents a record in the stream. You can refer to [Get Records](./write-in-platform.md#get-records) to learn more about the record. + +If you want to stop viewing the query result, you can click the **Cancel** button to stop it. For re-viewing the query result, you can click the **View Live Result** button to view again. + +::: info + +When creating a materialized view, it will internally create a query and its result is the view. So the query result is the same as the view result. + +::: + +## View queries + +The **Queries** page displays all the queries in your account. For each query, you can view the following information: + +- The **Name** of the query. +- The **Created time** of the query. +- The **Status** of the query. +- The **SQL** of the query. +- **Actions**, which for the extra operations of the query: + + - **Terminate**: Terminate the query. + - **Duplicate**: Duplicate the query. + - **Delete**: Delete the query. + +To view a specific query, you can click the **Name** of the query to go to the [details page](#view-query-details). + +## View query details + +The details page displays the detailed information of a query: + +1. All the information in the [queries](#view-queries) page. +2. Different tabs are provided to display the related information of the query: + + - [**Overview**](#view-query-overview): Besides the basic information, also can view the metrics of the query. + - [**SQL**](#view-query-sql): View the SQL of the query. + +## View query overview + +The **Overview** page displays the metrics of a query. The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the query include (with last 5 minutes as an example), from left to right: + +- **Input records throughput**: The number of records that the query receives from the source stream per second. +- **Output records throughput**: The number of records that the query outputs to the result stream per second. +- **Total records**: The number of records to the query in the last 5 minutes. Including input and output records. +- **Execution errors**: The number of errors that the query encounters in the last 5 minutes. + +## View query SQL + +The **SQL** page displays the SQL of a query. You can only view the SQL of the query, but cannot edit it. + +## Delete a query + +This section describes how to delete a query. + +### Delete a query from the Queries page + +1. Navigate to the **Queries** page. +2. Click the **Delete** icon of the query you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a query from the Query Details page + +1. Navigate to the details page of the query you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/zh/v0.17.0/platform/create-views-in-platform.md b/docs/zh/v0.17.0/platform/create-views-in-platform.md new file mode 100644 index 0000000..9cc81bd --- /dev/null +++ b/docs/zh/v0.17.0/platform/create-views-in-platform.md @@ -0,0 +1,70 @@ +# 创建和管理 Materialized View + +This page describes how to create and manage materialized views in HStream Platform. + +## Create a view + +Create a view is similar to create a query. The main difference is that the SQL is a `CREATE VIEW` statement. + +Please refer to [Create a query](./create-queries-in-platform.md#create-a-query) for more details. + +## View views + +The **Views** page displays all the views in your account. For each view, you can view the following information: + +- The **Name** of the view. +- The **Created time** of the view. +- The **Status** of the view. +- **Actions**, which for the extra operations of the view: + + - **Delete**: Delete the view. + +To view a specific view, you can click the **Name** of the view to go to the [details page](#view-view-details). + +## View view details + +The details page displays the detailed information of a view: + +1. All the information in the [views](#view-views) page. +2. Different tabs are provided to display the related information of the view: + + - [**Overview**](#view-view-overview): Besides the basic information, also can view the metrics of the view. + - [**SQL**](#view-view-sql): View the SQL of the view. + +## View view overview + +The **Overview** page displays the metrics of a view. The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the view include (with last 5 minutes as an example), from left to right: + +- **Execution queries throughput**: The number of queries that the view executes per second. +- **Execution queries**: The number of queries that the view executes in the last 5 minutes. + +## View view SQL + +The **SQL** page displays the SQL of a view. You can only view the SQL of the view, but cannot edit it. + +## Delete a view + +This section describes how to delete a view. + +### Delete a view from the Views page + +1. Navigate to the **Views** page. +2. Click the **Delete** icon of the view you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a view from the View Details page + +1. Navigate to the details page of the view you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/zh/v0.17.0/platform/stream-in-platform.md b/docs/zh/v0.17.0/platform/stream-in-platform.md new file mode 100644 index 0000000..ec40cd2 --- /dev/null +++ b/docs/zh/v0.17.0/platform/stream-in-platform.md @@ -0,0 +1,156 @@ +# 创建和管理 Stream + +This tutorial guides you on how to create and manage streams in HStream Platform. + +## Preparation + +1. If you do not have an account, please [apply for a trial](../start/try-out-hstream-platform.md#apply-for-a-trial) first and log in. After logging in, click **Streams** on the left sidebar to enter the streams page. + +2. If you have already logged in, click **Streams** on the left sidebar to enter the **Streams** page. + +3. Click the **New stream** button to create a stream. + +## Create a stream + +After clicking the **New stream** button, you will be directed to the **New Stream** page. You need to set some necessary properties for your stream and create it: + +1. Specify the **stream name**. You can refer to [Guidelines to name a resource](../write/stream.md#命名资源准则) to name a stream. + +2. Fill in with the number of **shards** you want this stream to have. The default value is **1**. + + > Shard is the primary storage unit for the stream. For more details, please refer to [Sharding in HStreamDB](../write/shards.md#sharding-in-hstreamdb). + +3. Fill in with the number of **replicas** for each stream. The default value is **3**. + +4. Fill in with the number of **retention** for each stream. Default value is **72**. Unit is **hour**. + +5. Click the **Confirm** button to create a stream. + +::: tip +For more details about **replicas** and **retention**, please refer to [Attributes of a Stream](../write/stream.md#stream-的属性). +::: + +::: warning +Currently, the number of **replicas** and **retention** are fixed for each stream in HStream Platform. We will gradually adjust these attributes in the future. +::: + +## View streams + +The **Streams** page lists all the streams in your account with a high-level overview. For each stream, you can view the following information: + +- The **name** of the stream. +- The **Creation time** of the stream. +- The number of **shards** in a stream. +- The number of **replicas** in a stream. +- The **Data retention period** of the records in a stream. +- **Actions**, which for the extra operations of the stream: + + - **Metrics**: View the metrics of the stream. + - **Subscriptions**: View the subscriptions of the stream. + - **Shards**: View the shard details of the stream. + - **Delete**: Delete the stream. + +To view a specific stream, click the name. [The details page of the stream](#view-stream-details) will be displayed. + +## View stream details + +The details page displays the detailed information of a stream: + +1. All the information in the [streams](#view-streams) page. +2. Different tabs are provided to display the related information of the stream: + + - [**Metrics**](#view-stream-metrics): View the metrics of the stream. + - [**Subscriptions**](#view-stream-subscriptions): View the subscriptions of the stream. + - [**Shards**](#view-stream-shards): View the shard details of the stream. + - [**Records**](#get-records-in-a-stream): Search records in the stream. + +### View stream metrics + +After clicking the **Metrics** tab, you can view the metrics of the stream. +The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the stream include (with last 5 minutes as an example), from left to right: + +- The **Append records throughout** chart shows the number of records to the stream per second in the last 5 minutes. +- The **Append bytes throughout** chart shows the number of bytes to the stream per second in the last 5 minutes. +- The **Total requests** chart shows the number of requests to the stream in the last 5 minutes. Including failed requests. +- The **Append requests throughout** chart shows the number of requests to the stream per second in the last 5 minutes. + +### View stream subscriptions + +After clicking the **Subscriptions** tab, you can view the subscriptions of the stream. + +To create a new subscription, please refer to [Create a Subscription](./subscription-in-platform.md#create-a-subscription). + +For more details about the subscription, please refer to [Subscription Details](./subscription-in-platform.md#subscription-details) + +### View stream shards + +After clicking the **Shards** tab, you can view the shard details of the stream. + +For each shard, you can view the following information: + +- The **ID** of the shard. +- The **Range start** of the shard. +- The **Range end** of the shard. +- The current **Status** of the shard. + +You can use the ID to get records. Please refer to [Get records in a stream](#get-records-in-a-stream) or [Get Records](./write-in-platform.md#get-records). + +### Get records in a stream + +After clicking the **Records** tab, you can get records in the stream. + +::: tip + +To get records from any streams, please refer to [Get Records](./write-in-platform.md#get-records). + +::: + +You can specify the following filters to get records: + +- **Shard**: Select one of the shards in the stream you want to get records from. +- Special filters: + - **Start record ID**: Get records after a specified record ID. The default is the first record. + - **Start date**: Get records after a specified date. + +After providing the filters (or not), click the **Get records** button to get records. Each record is displayed in a row with the following information: + +- The **ID** of the record. +- The **Key** of the record. +- The **Value** of the record. +- The **Shard ID** of the record. +- The **Creation time** of the record. + +## Delete a Stream + +This section describes how to delete a stream. + +::: warning +If a stream has subscriptions, this stream cannot be deleted. +::: + +::: danger +Deleting a stream is irreversible, and the data cannot be recovered after deletion. +::: + +### Delete a stream on the Streams page + +1. Navigate to the **Streams** page. +2. Click the **Delete** icon of the stream you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a stream on the Stream Details page + +1. Navigate to the details page of the stream you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/zh/v0.17.0/platform/subscription-in-platform.md b/docs/zh/v0.17.0/platform/subscription-in-platform.md new file mode 100644 index 0000000..e77e132 --- /dev/null +++ b/docs/zh/v0.17.0/platform/subscription-in-platform.md @@ -0,0 +1,117 @@ +# 创建和管理 Subscription + +This tutorial guides you on how to create and manage subscriptions in HStream Platform. + +## Preparation + +1. If you do not have an account, please [apply for a trial](../start/try-out-hstream-platform.md#apply-for-a-trial) first and log in. After logging in, click **Subscriptions** on the left sidebar to enter the subscriptions page. + +2. If you have already logged in, click **Subscriptions** on the left sidebar to enter the **Subscriptions** page. + +3. Click the **New subscription** button to create a subscription. + +## Create a subscription + +After clicking the **New subscription** button, you will be directed to the **New subscription** page. You need to set some necessary properties for your stream and create it: + +1. Specify the **Subscription ID**. You can refer to [Guidelines to name a resource](../write/stream.md#命名资源准则) to name a subscription. + +2. Select a stream as the source from the dropdown list. + +3. Fill in with the **ACK timeout**. The default value is **60**. Unit is **second**. + +4. Fill in the number of **max unacked records**. The default value is **100**. + +5. Click the **Confirm** button to create a subscription. + +::: tip +For more details about **ACK timeout** and **max unacked records**, please refer to [Attributes of a Subscription](../receive/subscription.md#subscription-的属性). +::: + +::: warning +Currently, the number of **ACK timeout** and **max unacked records** are fixed for each subscription in HStream Platform. We will gradually adjust these attributes in the future. +::: + +## View subscriptions + +The **Subscriptions** page lists all the subscriptions in your account with a high-level overview. For each subscription, you can view the following information: + +- The subscription's **ID**. +- The name of the **stream** source. You can click on the stream name to navigate the [stream details](./stream-in-platform.md#view-stream-details) page. +- The **ACK timeout** of the subscription. +- The **Max unacked records** of the subscription. +- The **Creation time** of the subscription. +- **Actions**, which is used to expand the operations of the subscription: + + - **Metrics**: View the metrics of the subscription. + - **Consumers**: View the consumers of the subscription. + - **Delete**: Delete the subscription. + +To view a specific subscription, click the subscription's name. [The details page of the subscription](#view-subscription-details) will be displayed. + +## View subscription details + +The details page displays detailed information about a subscription: + +1. All the information in the [subscriptions](#view-subscriptions) page. +2. Different tabs are provided to view the related information of the subscription: + + - [**Metrics**](#view-subscription-metrics): View the metrics of the subscription. + - [**Consumers**](#view-subscription-consumers): View the consumers of the subscription. + - [**Consumption progress**](#view-the-consumption-progress-of-the-subscription): View the consumption progress of the subscription. + +### View subscription metrics + +After clicking the **Metrics** tab, you can view the metrics of the subscription. +The default duration is **last 5 minutes**. You can select different durations to control the time range of the metrics: + +- last 5 minutes +- last 1 hour +- last 3 hours +- last 6 hours +- last 12 hours +- last 1 day +- last 3 days +- last 1 week + +The metrics of the subscription include (with last 5 minutes as an example), from left to right: + +- The **Outcoming bytes throughput** chart shows the number of bytes sent by the subscription per second in the last 5 minutes. +- The **Outcoming records throughput** chart shows the number of records sent by the subscription per second in the last 5 minutes. +- The **Acknowledgements throughput** chart shows the number of acknowledgements received in the subscription per second in the last 5 minutes. +- The **Resent records** chart shows the number of records resent in the subscription in the last 5 minutes. + +### View subscription consumers + +After clicking the **Consumers** tab, you can view the consumers of the subscription. + +For each consumer, you can view the following information: + +- The **Name** of the consumer. +- The **Type** of the consumer. +- The **URI** of the consumer. + +### View the consumption progress of the subscription + +After clicking the **Consumption progress** tab, you can view the consumption progress of the subscription. + +For each progress, you can view the following information: + +- The **Shard ID** of the progress. +- The **Last checkpoint** of the progress. + +## Delete a Subscription + +This section describes how to delete a subscription. + +### Delete a subscription on the Subscriptions page + +1. Go to the **Subscriptions** page. +2. Click the **Delete** icon of the subscription you want to delete. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. + +### Delete a subscription on Subscription Details page + +1. Go to the details page of the subscription you want to delete. +2. Click the **Delete** button. A confirmation dialog will pop up. +3. Confirm the deletion by clicking the **Confirm** button in the dialog. diff --git a/docs/zh/v0.17.0/platform/write-in-platform.md b/docs/zh/v0.17.0/platform/write-in-platform.md new file mode 100644 index 0000000..08878ae --- /dev/null +++ b/docs/zh/v0.17.0/platform/write-in-platform.md @@ -0,0 +1,85 @@ +# 向 Stream 写入 Records + +After creating a stream, you can write records to it according to the needs of your application. +This page describes how to write and get records in HStream Platform. + +## Preparation + +To write records, you need to create a stream first. + +1. If you do not have a stream, please refer to [Create a Stream](./stream-in-platform.md#create-a-stream) to create a stream. + +2. Go into any stream you want to write records to on the **Streams** page. + +3. Click the **Write records** button to write records to the stream. + +## Write records + +A record is like a piece of JSON data. You can add arbitrary fields to a record, only ensure that the record is a valid JSON object. + +A record also ships with a partition key, which is used to determine which shard the record will be allocated to and improve the read/write performance. + +::: tip +For more details about the partition key, please refer to [Partition Key](../write/write.md#write-records-with-partition-keys). +::: + +Take the following steps to write records to a stream: + +1. Specify the optional **Key**. This is the partition key of the record. The server will automatically assign a default key to the record if not provided. + +2. Fill in the **Value**. This is the content of the record. It must be a valid JSON object. + +3. Click the **Produce** button to write the record to the stream. + +4. If the record is written successfully, you will see a success message below the **Produce** button. + +5. If the record is written failed, you will see a failure message below the **Produce** button. + +## Get Records + +After writing records to a stream, you can get records from the **Records** page or the **Stream Details** page. + +### Get records from the Records page + +After navigating to the **Records** page, you can get records from a stream. + +Below are several filters you can use to get records: + +- **Stream**: Select the stream you want to get records. +- **Shard**: Select one of the shards in the stream you want to get records. +- Special filters: + - **Start record ID**: Get records after a specified record ID. The default is the first record. + - **Start date**: Get records after a specified date. + +::: info +The **Stream** and **Shard** will be filled automatically after loading the page. +By default, the filled value is the first stream and the first shard in the stream. +You can change them to get records from other streams. +::: + +::: info +We default to showing at most **100 records** after getting. If you want to get more records, +please specify a recent record ID in the **Start record ID** field or a recent date in the **Start date** field. +::: + +After filling in the filters, click the **Get records** button to get records. + +For each record, you can view the following information: + +1. The **ID** of the record. +2. The **Key** of the record. +3. The **Value** of the record. +4. The **Shard ID** of the record. +5. The **Creation time** of the record. + +In the next section, you will learn how to get records from the Stream Details page. + +### Get records from the Stream Details page + +Similar to [Get records from Records page](#get-records-from-the-records-page), +you can also get records from the **Stream Details** page. + +The difference is that you can get records without specifying the stream. +The records will automatically be retrieved from the stream you are currently viewing. + +For more details, please refer to [Get records in a stream](./stream-in-platform.md#get-records-in-a-stream). diff --git a/docs/zh/v0.17.0/process/_index.md b/docs/zh/v0.17.0/process/_index.md new file mode 100644 index 0000000..396abbd --- /dev/null +++ b/docs/zh/v0.17.0/process/_index.md @@ -0,0 +1,6 @@ +--- +order: ["sql.md"] +collapsed: false +--- + +Process data diff --git a/docs/zh/v0.17.0/process/sql.md b/docs/zh/v0.17.0/process/sql.md new file mode 100644 index 0000000..c549964 --- /dev/null +++ b/docs/zh/v0.17.0/process/sql.md @@ -0,0 +1,202 @@ +# Perform Stream Processing by SQL + +This part provides a demo of performing real-time stream processing by SQL. You +will be introduced to some basic concepts such as **streams**, **queries** and +**materialized views** with some examples to demonstrate the power of our +processing engine, such as the ease to use and dealing with complex queries. + +## Overview + +One of the most important applications of stream processing is real-time +business information analysis. Imagine that we are managing a supermarket and +would like to analyze the sales information to adjust our marketing strategies. + +Suppose we have two **streams** of data: + +```sql +info(product, category) // represents the category a product belongs to +visit(product, user, length) // represents the length of time when a customer looks at a product +``` + +Unlike tables in traditional relational databases, a stream is an endless series +of data which comes with time. Next, we will run some analysis on the two +streams to get some useful information. + +## Requirements + +Ensure you have deployed HStreamDB successfully. The easiest way is to follow +[quickstart](../start/quickstart-with-docker.md) to start a local cluster. Of +course, you can also try other methods mentioned in the Deployment part. + +## Step 1: Create related streams + +We have mentioned that we have two streams, `info` and `visit` in the +[overview](#overview). Now let's create them. Start an HStream SQL shell and run +the following statements: + +```sql +CREATE STREAM info; +``` + +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| info | 1 | 604800 seconds | 1 | ++-------------+---------+----------------+-------------+ +``` + +```sql +CREATE STREAM visit; +``` + +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| visit | 1 | 604800 seconds | 1 | ++-------------+---------+----------------+-------------+ +``` + +We have successfully created two streams. + +## Step 2: Create streaming queries + +We can now create streaming **queries** on the streams. A query is a running +task that fetches data from the stream(s) and produces results continuously. +Let's create a trivial query that fetches data from stream `info` and outputs +them: + +```sql +SELECT * FROM info EMIT CHANGES; +``` + +The query will keep running until you interrupt it. Next, we can just leave it +there and start another query. It fetches data from the stream `visit` and +outputs the maximum length of time of each product. Start a new SQL shell and +run + +```sql +SELECT product, MAX(length) AS max_len FROM visit GROUP BY product EMIT CHANGES; +``` + +Neither of the queries will print any results since we have not inserted any +data yet. So let's do that. + +## Step 3: Insert data into streams + +There are multiple ways to insert data into the streams, such as client +libraries, and the data inserted will all be cheated the same +while processing. You can refer to [write data](../write/write.md) for client usage. + +For consistency and ease of demonstration, we would use SQL statements. + +Start a new SQL shell and run: + +```sql +INSERT INTO info (product, category) VALUES ('Apple', 'Fruit'); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Alice', 10); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Bob', 20); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Caleb', 10); +``` + +Switch to the shells with running queries You should be able to see the expected +outputs as follows: + +```sql +SELECT * FROM info EMIT CHANGES; +``` +``` +{"category":"Fruit","product":"Apple"} +``` + +```sql +SELECT product, MAX(length) AS max_len FROM visit GROUP BY product EMIT CHANGES; +``` +``` +{"max_len":{"$numberLong":"10"},"product":"Apple"} +{"max_len":{"$numberLong":"20"},"product":"Apple"} +{"max_len":{"$numberLong":"20"},"product":"Apple"} +``` + +Note that `max_len` changes from `10` to `20`, which is expected. + +## Step 4: Create materialized views + +Now let's do some more complex analysis. If we want to know the longest visit +time of each category **any time we need it**, the best way is to create +**materialized views**. + +A materialized view is an object which contains the result of a query. In +HStreamDB, the view is maintained and continuously updated in memory, which +means we can read the results directly from the view right when needed without +any extra computation. Thus getting results from a view is very fast. + +Here we can create a view like + +```sql +CREATE VIEW result AS SELECT info.category, MAX(visit.length) as max_length FROM info JOIN visit ON info.product = visit.product WITHIN (INTERVAL 1 HOUR) GROUP BY info.category; +``` +``` ++--------------------------+---------+--------------------------+---------------------------+ +| Query ID | Status | Created Time | SQL Text | ++--------------------------+---------+--------------------------+---------------------------+ +| cli_generated_xbexrdhwgz | RUNNING | 2023-07-06T07:46:13+0000 | CREATE VIEW result AS ... | ++--------------------------+---------+--------------------------+---------------------------+ +``` + +Note the query ID will be different from the one shown above. Now let's try to +get something from the view: + +```sql +SELECT * FROM result; +``` + +It outputs no data because we have not inserted any data into the streams since +**after** the view is created. Let's do it now: + +```sql +INSERT INTO info (product, category) VALUES ('Apple', 'Fruit'); +INSERT INTO info (product, category) VALUES ('Banana', 'Fruit'); +INSERT INTO info (product, category) VALUES ('Carrot', 'Vegetable'); +INSERT INTO info (product, category) VALUES ('Potato', 'Vegetable'); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Alice', 10); +INSERT INTO visit (product, user, length) VALUES ('Apple', 'Bob', 20); +INSERT INTO visit (product, user, length) VALUES ('Carrot', 'Bob', 50); +``` + +## Step 5: Get results from views + +Now let's find out what is in our view: + +```sql +SELECT * FROM result; +``` +``` +{"category":"Fruit","max_length":{"$numberLong":"20"}} +{"category":"Vegetable","max_length":{"$numberLong":"50"}} +``` + +It works. Now insert more data and repeat the inspection: + +```sql +INSERT INTO visit (product, user, length) VALUES ('Banana', 'Alice', 40); +INSERT INTO visit (product, user, length) VALUES ('Potato', 'Eve', 60); +``` + +And query again: + +```sql +SELECT * FROM result; +``` +``` +{"category":"Fruit","max_length":{"$numberLong":"40"}} +{"category":"Vegetable","max_length":{"$numberLong":"60"}} +``` + +The result is updated right away. + +## Related Pages + +For a detailed introduction to the SQL, see +[HStream SQL](../reference/sql/sql-overview.md). diff --git a/docs/zh/v0.17.0/receive/_index.md b/docs/zh/v0.17.0/receive/_index.md new file mode 100644 index 0000000..2807467 --- /dev/null +++ b/docs/zh/v0.17.0/receive/_index.md @@ -0,0 +1,6 @@ +--- +order: ["subscription.md", "consume.md", "read.md"] +collapsed: false +--- + +Receive data diff --git a/docs/zh/v0.17.0/receive/consume.md b/docs/zh/v0.17.0/receive/consume.md new file mode 100644 index 0000000..e643ed7 --- /dev/null +++ b/docs/zh/v0.17.0/receive/consume.md @@ -0,0 +1,149 @@ +# 通过订阅(Subscription)从 HStreamDB 消费数据 + +## 什么是一个订阅(Subscription)? + +要从一个 stream 中消费数据,你必须为该 stream 创建一个订阅。创建成功后,每个订阅 +都将从头开始检索数据。接收和处理消息的消费者(consumer)通过一个订阅与一个 +stream 相关联。 + +一个 stream 可以有多个订阅,但一个给定的订阅只属于一个 stream。同样地,一个订阅 +对应一个具有多个消费者的 consumer group,但每个消费者只属于一个订阅。 + +请参考[这个页面](./subscription.md),了解关于创建和管理订阅的详细信息。 + +## 如何用一个订阅来消费数据 + +为了消费写入 stream 中的数据,HStreamDB 客户端库提供了异步 Consumer API,它将发 +起请求加入指定订阅的 consumer group。 + +### 两种 HStream 记录类型和相应的 Receiver + +正如我们所介绍的,在 HStreamDB 中有两种 Record 类型,HRecord 和 Raw Record。当启 +动一个消费者时,需要相应的 Receiver。在只设置了 HRecord Receiver 的情况下,当消 +费者收到一条 raw record 时,消费者将忽略它并消费下一条 record。因此,原则上,我 +们不建议在同一个 stream 中同时写入 HRecord 和 raw record。然而,这并没有在实现的 +层面上严格禁止,用户仍然可以提供两种 receiver 来同时处理两种类型的 record。 + +## 简单的数据消费实例 + +异步的 Consumer API 不需要你的应用程序为新到来的 record 进行阻塞,可以让你的应用 +程序获得更高的吞吐量。Records 可以在你的应用程序中使用一个长期运行的 records +receiver 来接收,并逐条 ack,如下面的例子中所示。 + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ConsumeDataSimpleExample.java [Java] + +<<< @/../examples/go/examples/ExampleConsumer.go [Go] + +@snippet examples/py/snippets/guides.py common subscribe-records + +::: + +For better performance, Batched Ack is enabled by default with setting +`ackBufferSize` = 100 and `ackAgeLimit` = 100, which you can change when +initiating your consumers. + +::: code-group + +```java +Consumer consumer = + client + .newConsumer() + .subscription("you_subscription_id") + .name("your_consumer_name") + .hRecordReceiver(your_receiver) + // When ack() is called, the consumer will not send it to servers immediately, + // the ack request will be buffered until the ack count reaches ackBufferSize + // or the consumer is stopping or reached ackAgelimit + .ackBufferSize(100) + .ackAgeLimit(100) + .build(); +``` + +::: + +为了获得更好的性能,默认情况下启用了 Batched Ack,和 ackBufferSize = 100 和 +ackAgeLimit = 100 的设置,你可以在启动你的消费者时更新它。 + +::: code-group + +```java +Consumer consumer = + client + .newConsumer() + .subscription("you_subscription_id") + .name("your_consumer_name") + .hRecordReceiver(your_receiver) + // When ack() is called, the consumer will not send it to servers immediately, + // the ack request will be buffered until the ack count reaches ackBufferSize + // or the consumer is stopping or reached ackAgelimit + .ackBufferSize(100) + .ackAgeLimit(100) + .build(); +``` + +::: + +## 多个消费者和共享订阅 + +如先前提到的,在 HStream 中,一个订阅是对应了一个 consumer group 消费的。在这个 +consumer group 中,可能会有多个消费者,并且他们共享订阅的进度。当想要提高从订阅 +中消费数据的速度时,我们可以让一个新的消费者加入现有的订阅。这段代码是用来演示新 +的消费者是如何加入 consumer group 的。更常见的情况是,用户使用来自不同客户端的消 +费者去共同消费一个订阅。 + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ConsumeDataSharedExample.java [Java] + +<<< @/../examples/go/examples/ExampleConsumerGroup.go [Go] + +::: + +## 使用 `maxUnackedRecords` 的来实现流控 + +一个常发生的状况是,消费者处理和确认数据的速度很可能跟不上服务器发送的速度,或者 +一些意外的问题导致消费者无法确认收到的数据,这可能会导致以下问题: + +服务器将不得不不断重发未确认的消息,并维护未确认的消息的信息,这将消耗服务器的资 +源,并导致服务器面临资源耗尽的问题。 + +为了缓解上述问题,使用订阅的 `maxUnackedRecords` 设置来控制消费者接收消息时允许 +的未确认 records 的最大数量。一旦数量超过 `maxUnackedRecords`,服务器将停止向当 +前订阅的消费者们发送消息。 + +## 按顺序接收消息 + +注意:下面描述的接收顺序只针对单个消费者。如果一个订阅有多个消费者,在每个消费者 +中仍然可以保证顺序,但如果我们把 consumer group 看成一个整体,那么顺序性就不再保 +证了。 + +消费者将按照 HStream 服务器收到信息的顺序接收具有相同分区键的 record。由于 +HStream 以至少一次的语义发送 hstream record,在某些情况下,当 HServer 可能没有收 +到中间某些 record 的 ack 时,它将可能多次发送这条 record。而在这些情况下,我们也 +不能保证顺序。 + +## 处理错误 + +当消费者正在运行时,如果 receiver 失败了,默认的行为是消费者会将将捕获异常,打印 +错误日志,并继续消费下一条记录而不是导致消费者也失败。 + +在其他情况下可能会导致消费者的失败,例如网络、订阅被删除等。然而,作为一个服务, +你可能希望消费者继续运行,所以你可以设置一个监听器来处理一个消费者失败的情况。 + +::: code-group + +```java +// add Listener for handling failed consumer +var threadPool = new ScheduledThreadPoolExecutor(1); +consumer.addListener( + new Service.Listener() { + public void failed(Service.State from, Throwable failure) { + System.out.println("consumer failed, with error: " + failure.getMessage()); + } + }, + threadPool); +``` + +::: diff --git a/docs/zh/v0.17.0/receive/read.md b/docs/zh/v0.17.0/receive/read.md new file mode 100644 index 0000000..0760ed8 --- /dev/null +++ b/docs/zh/v0.17.0/receive/read.md @@ -0,0 +1,37 @@ +# Get Records from Shards of the Stream with Reader + +## What is a Reader + +To allow users to retrieve data from any stream shard, HStreamDB provides +readers for applications to manually manage the exact position of the record to +read from. Unlike subscription and consumption, a reader can be seen as a +lower-level API for getting records from streams. It gives users direct access +to any records in the stream, more precisely, any records from a specific shard +in the stream, and it does not require or rely on subscriptions and will not +send any acknowledgement back to the server. Therefore, the reader is helpful +for the case that requires better flexibility or rewinding of data reading. + +When a user creates a reader instance, it is required that the user needs to +specify which record and which shard the reader begins from. A reader provides +starting position with the following options: + +- The earliest available record in the shard +- The latest available record in the shard +- User-specified record location in the shard + +## Reader Example + +To read from the shards, users are required to get the desired shard id with +[`listShards`](../write/shards.md#listshards). + +The name of a reader should also follow the format specified by the [guidelines](../write/stream.md#guidelines-to-name-a-resource) + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ReadDataWithReaderExample.java [Java] + +<<< @/../examples/go/examples/ExampleReadDataWithReader.go [Go] + +@snippet examples/py/snippets/guides.py common read-reader + +::: diff --git a/docs/zh/v0.17.0/receive/subscription.md b/docs/zh/v0.17.0/receive/subscription.md new file mode 100644 index 0000000..3e6f92c --- /dev/null +++ b/docs/zh/v0.17.0/receive/subscription.md @@ -0,0 +1,65 @@ +# 创建和管理 Subscription + +## Subscription 的属性 + +- ackTimeoutSeconds + + 指定 HServer 将 records 标记为 unacked 的最大等待时间,之后该记录将被再次发送。 + +- maxUnackedRecords。 + + 允许的未 acked record 的最大数量。超过设定的大小后,服务器将停止向相应的消费者 + 发送 records。 + +## 创建一个 Subscription + +每个 subscription 都必须指定要订阅哪个 stream,这意味着你必须确保要订阅的 stream +已经被创建。 + +关于订阅的名称,请参考[资源命名准则](../write/stream.md#命名资源的准则) + +当创建一个 subscription 时,你可以像这样提供提到的属性: + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/CreateSubscriptionExample.java [Java] + +<<< @/../examples/go/examples/ExampleCreateSubscription.go [Go] + +@snippet examples/py/snippets/guides.py common create-subscription + +::: + +## 删除一个订阅 + +要删除一个的订阅,你需要确保没有活跃的订阅消费者,除非启用强制删除。 + +## 强制删除一个 Subscription + +如果你确实想删除一个 subscription,并且有消费者正在运行,请启用强制删除。当强制 +删除一个 subscription 时,该订阅将处于删除中的状态,并关闭正在运行的消费者,这意 +味着你将无法加入、删除或创建一个同名的 subscription 。在删除完成后,你可以用同样 +的名字创建一个订阅,这个订阅将是一个全新的订阅。即使他们订阅的是同一个流,这个新 +的订阅也不会与被删除的订阅共享消费进度。 + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/DeleteSubscriptionExample.java [Java] + +<<< @/../examples/go/examples/ExampleDeleteSubscription.go [Go] + +@snippet examples/py/snippets/guides.py common delete-subscription + +::: + +## 列出 HStream 中的 subscription 信息 + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ListSubscriptionsExample.java [Java] + +<<< @/../examples/go/examples/ExampleListSubscriptions.go [Go] + +@snippet examples/py/snippets/guides.py common list-subscription + +::: diff --git a/docs/zh/v0.17.0/reference/_index.md b/docs/zh/v0.17.0/reference/_index.md new file mode 100644 index 0000000..2b69445 --- /dev/null +++ b/docs/zh/v0.17.0/reference/_index.md @@ -0,0 +1,6 @@ +--- +order: ["architecture", "sql", "cli.md", "config.md", "metrics.md"] +collapsed: false +--- + +参考 diff --git a/docs/zh/v0.17.0/reference/architecture/_index.md b/docs/zh/v0.17.0/reference/architecture/_index.md new file mode 100644 index 0000000..149940f --- /dev/null +++ b/docs/zh/v0.17.0/reference/architecture/_index.md @@ -0,0 +1,5 @@ +--- +order: ["overview.md", "hstore.md", "hserver.md"] +--- + +架构 diff --git a/docs/zh/v0.17.0/reference/architecture/hserver.md b/docs/zh/v0.17.0/reference/architecture/hserver.md new file mode 100644 index 0000000..244d88b --- /dev/null +++ b/docs/zh/v0.17.0/reference/architecture/hserver.md @@ -0,0 +1,22 @@ +# HStream Server + +HStream Server (HSQL) 作为 HStreamDB 的核心计算组件,其本身被设计为无状态的。它主要负责客户端的连接管理,安全认证,SQL 解析,SQL 优化,以及流计算任务的创建、调度、执行和管理等。 +HStream Server + +## HStream Server (HSQL) 自顶向下可具体分为以下几层结构 + +### 接入层 + +主要负责客户端请求的协议处理、连接管理、以及安全认证和访问控制。 + +### SQL 层 + +客户端主要通过 SQL 语句与 HStreamDB 交互,来完成大部分流处理和实时分析的任务。该层主要负责将用户提交的 SQL 语句编译成逻辑数据流图。 与经典的数据库系统一样,这里包含两个核心的子组件:SQL 解析器 和 SQL 优化器。 SQL 解析器负责负责完成词法分析、语法分析,将 SQL 语句编译到对应的关系代数表达式;SQL 优化器负责根据各种规则和 Context 信息对生成的执行计划进行优化。 + +### Stream 层 + +该层包含各种常见的流处理算子的实现,以及表达数据流图的数据结构和 DSL,还支持用户自定义函数作为处理算子。 主要负责为 SQL 层传递下来的逻辑数据流图选择对应的算子实现和优化,生成可执行的数据流图。 + +### Runtime 层 + +该层负责实际执行数据流图的计算任务并返回结果。主要包含任务调度器、状态管理器以及执行优化器等组件。其中调度器负责计算任务在可用计算资源之间的调度,可能是在单个处理的多线程之间调度,也可能是在单机的多处理器之间调度,或者是在分布式集群的多台机器或容器之间调度。状态管理器负责协调流出里算子的状态维护和容错。执行优化器可以通过自动化并行等手段加速数据流图的执行。 diff --git a/docs/zh/v0.17.0/reference/architecture/hstore.md b/docs/zh/v0.17.0/reference/architecture/hstore.md new file mode 100644 index 0000000..b118978 --- /dev/null +++ b/docs/zh/v0.17.0/reference/architecture/hstore.md @@ -0,0 +1,24 @@ +# HStream Storage + +HStream Storage (HStore) 作为 HStreamDB 的核心存储组件,它是专门为流式数据设计的低延时存储组件,不但能够分布式持久化存储大规模实时数据,而且能够通过 Auto-Tiering 机制,无缝对接 S3 之类的大容量二级存储,实现历史数据和实时数据的统一存储。 + +HStream Storage (HStore) 的核心存储模型是非常贴合流式数据的日志模型,数据流本身可以看作是一个无限增长的日志,它支持的典型操作包括追加写和区间读,同时数据流是不可变的,一般不支持更新操作。 +HStream Storage (HStore) + +## HStream Storage (HStore) 可分为以下几个层次 + +### Streaming Data API 层 + +该层提供核心的数据流管理和读写操作,包括数据流的创建、删除,以及向数据流中写入数据和消费数据流中的数据。在 HStore 对创建的数据流的数量没有限制,同时能支持大量数据流的并发写入,在大量数据流并发写入的时候依然能够保持稳定的低延迟,HStore 的存储设计中并没有按照数据流来做存储, 因此数据流的创建是非常轻量的操作。针对数据流的特点,HStore 提供了 append 操作支持数据快速写入,同时在读取流数据方面,提供了基于订阅语义的 read 操作,数据流中新写入的数据会被实时推送给数据消费者。 + +### 复制层 + +该层主要基于优化的 Flexible Paxos 共识引擎实现了流数据的强一致复制,保证数据的容错和可高可用性。同时通过非确定性的数据分布策略,最大化了集群数据的可用性。而且支持复制组在线重配置,实现了无缝的集群数据均衡和水平扩展。 + +### 本地存储层 + +该层主要负责数据的本地持久化存储,实现上基于优化的 RocksDB 存储引擎 封装了流数据的存取接口,可支持大量数据低延迟的写入和读取。 + +### 二级存储层 + +该层为多种长期存储系统提供了统一的接口封装,比如 HDFS, AWS S3 等,支持将历史数据自动卸载到这些二级存储系统上,同时也可以通过统一的 Streaming Data 接口来访问。 diff --git a/docs/zh/v0.17.0/reference/architecture/overview.md b/docs/zh/v0.17.0/reference/architecture/overview.md new file mode 100644 index 0000000..5c3aa22 --- /dev/null +++ b/docs/zh/v0.17.0/reference/architecture/overview.md @@ -0,0 +1,7 @@ +# Architecture + +HStreamDB 的整体架构如下图所示,单个 HStreamDB 节点主要由 HStream Server (HSQL) 和 HStream Storage (HStore) 两个核心部件组成,一个 HStream 集群由若干个对等的 HStreamDB 节点组成, 客户端可连接至集群中任意一个 HStreamDB 节点, 并通过熟悉的 SQL 语言来完成各种从简单到复杂的流处理和分析任务。 + +![](https://static.emqx.net/images/faab4a8b1d02f14bc5a4153fe37f21ca.png) + +
HStreamDB 整体架构
diff --git a/docs/zh/v0.17.0/reference/cli.md b/docs/zh/v0.17.0/reference/cli.md new file mode 100644 index 0000000..df379be --- /dev/null +++ b/docs/zh/v0.17.0/reference/cli.md @@ -0,0 +1,564 @@ +# HStream CLI + +We can run the following to use HStream CLI: + +```sh-vue +docker run -it --rm --name some-hstream-admin --network host hstreamdb/hstream:{{ $version() }} hstream --help +``` + +For ease of illustration, we execute an interactive bash shell in the HStream +container to use HStream admin, + +The following example usage is based on the cluster started in +[quick start](../start/quickstart-with-docker.md), please adjust +correspondingly. + +```sh +docker exec -it docker_hserver_1 bash +``` +``` +hstream --help +``` + +```txt +======= HStream CLI ======= + +Usage: hstream [--host SERVER-HOST] [--port INT] [--tls-ca STRING] + [--tls-key STRING] [--tls-cert STRING] [--retry-timeout INT] + [--service-url ARG] COMMAND + +Available options: + --host SERVER-HOST Server host value (default: "127.0.0.1") + --port INT Server port value (default: 6570) + --tls-ca STRING path name of the file that contains list of trusted + TLS Certificate Authorities + --tls-key STRING path name of the client TLS private key file + --tls-cert STRING path name of the client TLS public key certificate + file + --retry-timeout INT timeout to retry connecting to a server in seconds + (default: 60) + --service-url ARG The endpoint to connect to + -h,--help Show this help text + +Available commands: + sql Start HStream SQL Shell + node Manage HStream Server Cluster + init Init HStream Server Cluster + stream Manage Streams in HStreamDB + subscription Manage Subscriptions in HStreamDB +``` + +## Connection + +### HStream URL + +The HStream CLI Client supports connecting to the server cluster with a url in +the following format: + +``` +://: +``` + +| Components | Description | Required | +|------------|-------------|----------| +| scheme | The scheme of the connection. Currently, we have `hstream`. To enable security options, `hstreams` is also supported | Yes | +| endpoint | The endpoint of the server cluster, which can be the hostname or address of the server cluster. | If not given, the value will be set to the `--host` default `127.0.0.1` | +| port | The port of the server cluster. | If not given, the value will be set to the `--port` default `6570` | + +### Connection Parameters + +HStream commands accept connection parameters as separate command-line flags, in addition (or in replacement) to `--service-url`. + +::: tip + +In the cases where both `--service-url` and the options below are specified, the client will use the value in `--service-url`. + +::: + +| Option | Description | +|-|-| +| `--host` | The server host and port number to connect to. This can be the address of any node in the cluster. Default: `127.0.0.1` | +| `--port` | The server port to connect to. Default: `6570`| + +### Security Settings (optional) + +If the [security option](../security/overview.md) is enabled, here are +some options that should also be configured for CLI correspondingly. + +#### Encryption + +If [server encryption](../security/encryption.md) is enabled, the +`--tls-ca` option should be added to CLI connection options: + +```sh +hstream --tls-ca "" +``` + +### Authentication + +If [server authentication](../security/authentication.md) is enabled, +the `--tls-key` and `--tls-cert` options should be added to CLI connection +options: + +```sh +hstream --tls-key "" --tls-cert "" +``` + +## Check Cluster Status + +```sh +hstream node --help +``` +``` +Usage: hstream node COMMAND + Manage HStream Server Cluster + +Available options: + -h,--help Show this help text + +Available commands: + list List all running nodes in the cluster + status Show the status of nodes specified, if not specified + show the status of all nodes + check-running Check if all nodes in the the cluster are running, + and the number of nodes is at least as specified +``` + +```sh +hstream node list +``` +``` ++-----------+ +| server_id | ++-----------+ +| 100 | +| 101 | ++-----------+ +``` + +```sh +hstream node status +``` +``` ++-----------+---------+-------------------+ +| server_id | state | address | ++-----------+---------+-------------------+ +| 100 | Running | 192.168.64.4:6570 | +| 101 | Running | 192.168.64.5:6572 | ++-----------+---------+-------------------+ +``` + +```sh +hstream node check-running +``` +``` +All nodes in the cluster are running. +``` + +## Manage Streams + +We can also manage streams through the hstream command line tool. + +```sh +hstream stream --help +``` +``` +Usage: hstream stream COMMAND + Manage Streams in HStreamDB + +Available options: + -h,--help Show this help text + +Available commands: + list Get all streams + create Create a stream + describe Get the details of a stream + delete Delete a stream +``` + +### Create a stream + +```sh +Usage: hstream stream create STREAM_NAME [-r|--replication-factor INT] + [-b|--backlog-duration INT] [-s|--shards INT] + Create a stream + +Available options: + STREAM_NAME The name of the stream + -r,--replication-factor INT + The replication factor for the stream (default: 1) + -b,--backlog-duration INT + The backlog duration of records in stream in seconds + (default: 0) + -s,--shards INT The number of shards the stream should have + (default: 1) + -h,--help Show this help text +``` + +Example: Create a demo stream with the default settings. + +```sh +hstream stream create demo +``` +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| demo | 1 | 0 seconds | 1 | ++-------------+---------+----------------+-------------+ +``` + +### Show and delete streams + +```sh +hstream stream list +``` +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| demo2 | 1 | 0 seconds | 1 | ++-------------+---------+----------------+-------------+ +``` + +```sh +hstream stream delete demo +``` +``` +Done. +``` + +```sh +hstream stream list +``` +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +``` + +## Manage Subscription + +We can also manage streams through the hstream command line tool. + +```sh +hstream stream --help +``` +``` +Usage: hstream subscription COMMAND + Manage Subscriptions in HStreamDB + +Available options: + -h,--help Show this help text + +Available commands: + list Get all subscriptions + create Create a subscription + describe Get the details of a subscription + delete Delete a subscription +``` + +### Create a subscription + +```sh +Usage: hstream subscription create SUB_ID --stream STREAM_NAME + [--ack-timeout INT] + [--max-unacked-records INT] + [--offset [earliest|latest]] + Create a subscription + +Available options: + SUB_ID The ID of the subscription + --stream STREAM_NAME The stream associated with the subscription + --ack-timeout INT Timeout for acknowledgements in seconds + --max-unacked-records INT + Maximum number of unacked records allowed per + subscription + --offset [earliest|latest] + The offset of the subscription to start from + -h,--help Show this help text +``` + +Example: Create a subscription to the stream `demo` with the default settings. + +```sh +hstream subscription create --stream demo sub_demo +``` +``` ++-----------------+-------------+-------------+---------------------+ +| Subscription ID | Stream Name | Ack Timeout | Max Unacked Records | ++-----------------+-------------+-------------+---------------------+ +| sub_demo | demo | 60 seconds | 10000 | ++-----------------+-------------+-------------+---------------------+ +``` + +### Show and delete streams + +```sh +hstream subscription list +``` +``` ++-----------------+-------------+-------------+---------------------+ +| Subscription ID | Stream Name | Ack Timeout | Max Unacked Records | ++-----------------+-------------+-------------+---------------------+ +| sub_demo | demo | 60 seconds | 10000 | ++-----------------+-------------+-------------+---------------------+ +``` + +```sh +hstream subscription delete sub_demo +``` +``` +Done. +``` + +```sh +hstream subscription list +``` +``` ++-----------------+-------------+-------------+---------------------+ +| Subscription ID | Stream Name | Ack Timeout | Max Unacked Records | ++-----------------+-------------+-------------+---------------------+ +``` + +## HStream SQL + +HStreamDB also provides an interactive SQL shell for a series of operations, +such as the management of streams and views, data insertion and retrieval, etc. + +```sh +hstream sql --help +``` +``` +Usage: hstream sql [--update-interval INT] [--retry-timeout INT] + Start HStream SQL Shell + +Available options: + --update-interval INT interval to update available servers in seconds + (default: 30) + --retry-timeout INT timeout to retry connecting to a server in seconds + (default: 60) + -e,--execute STRING execute the statement and quit + --history-file STRING history file path to write interactively executed + statements + -h,--help Show this help text +``` + +Once you entered shell, you can see the following help info: + +```sh + __ _________________ _________ __ ___ + / / / / ___/_ __/ __ \/ ____/ | / |/ / + / /_/ /\__ \ / / / /_/ / __/ / /| | / /|_/ / + / __ /___/ // / / _, _/ /___/ ___ |/ / / / + /_/ /_//____//_/ /_/ |_/_____/_/ |_/_/ /_/ + + +Command + :h To show these help info + :q To exit command line interface + :help [sql_operation] To show full usage of sql statement + +SQL STATEMENTS: + To create a simplest stream: + CREATE STREAM stream_name; + + To create a query select all fields from a stream: + SELECT * FROM stream_name EMIT CHANGES; + + To insert values to a stream: + INSERT INTO stream_name (field1, field2) VALUES (1, 2); + +``` + +There are two kinds of commands: + +1. Basic shell commands, starting with `:` +2. SQL statements end with `;` + +### Basic CLI Operations + +To quit the current CLI session: + +```sh +:q +``` + +To print out help info overview: + +```sh +:h +``` + +To show the specific usage of some SQL statements: + +```sh +:help CREATE +``` +``` + CREATE STREAM [IF EXIST] [AS ] [ WITH ( {stream_options} ) ]; + CREATE {SOURCE|SINK} CONNECTOR [IF NOT EXIST] WITH ( {connector_options} ); + CREATE VIEW AS ; +``` + +Available SQL operations include: `CREATE`, `DROP`, `SELECT`, `SHOW`, `INSERT`, +`TERMINATE`. + +### SQL Statements + +All the processing and storage operations are done via SQL statements. + +#### Stream + +There are two ways to create a new data stream. + +1. Create an ordinary stream: + +```sql +CREATE STREAM stream_name; +``` + +This will create a stream with no particular function. You can `SELECT` data +from the stream and `INSERT` to via the corresponding SQL statement. + +2. Create a stream, and this stream will also run a query to select specified + data from some other stream. + +Adding a Select statement after Create with a keyword `AS` can create a stream +will create a stream that processes data from another stream. + +For example: + +```sql +CREATE STREAM stream_name AS SELECT * from demo; +``` + +In the example above, by adding an `AS` followed by a `SELECT` statement to the +normal `CREATE` operation, it will create a stream that will also select all the +data from the `demo`. + +After Creating the stream, we can insert values into the stream. + +```sql +INSERT INTO stream_name (field1, field2) VALUES (1, 2); +``` + +There is no restriction on the number of fields a query can insert. Also, the +type of value is not restricted. However, you need to make sure that the number +of fields and the number of values are aligned. + +The deletion command is `DROP STREAM ;`, which deletes a stream, +and terminates all the [queries](#queries) that depend on the stream. + +For example: + +```sql +SELECT * FROM demo EMIT CHANGES; +``` + +will be terminated if the stream demo is deleted; + +```sql +DROP STREAM demo; +``` + +If you try to delete a stream that does not exist, an error message will be +returned. To turn it off, you can use add `IF EXISTS` after the stream_name: + +```sql +DROP STREAM demo IF EXISTS; +``` + +#### Show all streams + +You can also show all streams by using the `SHOW STREAMS` command. + +```sql +SHOW STEAMS; +``` +``` ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| demo | 3 | 0sec | 1 | ++-------------+---------+----------------+-------------+ +``` + +#### Queries + +Run a continuous query on the stream to select data from a stream: + +After creating a stream, we can select data from the stream in real-time. All +the data inserted after the select query is created will be printed out when the +insert operation happens. Select supports real-time processing of the data +inserted into the stream. + +For example, we can choose the field and filter the data selected from the +stream. + +```sql +SELECT a FROM demo EMIT CHANGES; +``` + +This will only select field `a` from the stream demo. + +How to terminate a query? + +A query can be terminated if we know the query id: + +```sql +TERMINATE QUERY ; +``` + +We can get all the query information by command `SHOW`: + +```sql +SHOW QUERIES; +``` + +output just for demonstration : + +``` ++------------------+------------+--------------------------+----------------------------------+ +| Query ID | Status | Created Time | SQL Text | ++------------------+------------+--------------------------+----------------------------------+ +| 1361978122003419 | TERMINATED | 2022-07-28T06:03:42+0000 | select * from demo emit changes; | ++------------------+------------+--------------------------+----------------------------------+ +``` + +Find the query to terminate, make sure is id not already terminated, and pass +the query id to `TERMINATE QUERY` + +Or under some circumstances, you can choose to `TERMINATE ALL;`. + +### View + +The view is a projection of specified data from streams. For example, + +```sql +CREATE VIEW v_demo AS SELECT SUM(a) FROM demo GROUP BY a; +``` + +the above command will create a view that keeps track of the sum of `a` (which +have the same values, because of groupby) and have the same value from the point +this query is executed. + +The operations on view are very similar to those on streams. + +Except we can not use `SELECT ... EMIT CHANGES` performed on streams because a +view is static and there are no changes to emit. Instead, for example, we select +from the view with: + +```sql +SELECT * FROM v_demo WHERE a = 1; +``` + +This will print the sum of `a` when `a` = 1. + +If we want to create a view to record the sum of `a`s, we can: + +```sql +CREATE STREAM demo2 AS SELECT a, 1 AS b FROM demo; +CREATE VIEW v_demo2 AS SELECT SUM(a) FROM demo2 GROUP BY b; +SELECT * FROM demo2 WHERE b = 1; +``` diff --git a/docs/zh/v0.17.0/reference/config.md b/docs/zh/v0.17.0/reference/config.md new file mode 100644 index 0000000..6b901da --- /dev/null +++ b/docs/zh/v0.17.0/reference/config.md @@ -0,0 +1,91 @@ +# HStreamDB Configuration + +HStreamDB configuration file is located at path `/etc/hstream/config.yaml` in the docker image from v0.6.3. +or you can [download](https://raw.githubusercontent.com/hstreamdb/hstream/main/conf/hstream.yaml) the config file + +## Configuration Table + +### hserver + +| Name | Default Value | Description | +| ---- | ------------- | ----------- | +| id | | The identifier of a single HServer node, the value must be given and can be overwritten by cli option `--server-id | +| bind-address | "0.0.0.0" | The IP address or name of the host to which the HServer protocol handler is bound. The value can be overwritten by cli option `--bind-address` | +| advertised-address | "127.0.0.1" | Server listener address value, the value must be given and shouldn't be "0.0.0.0", if you intend to start a cluster or trying to connect to the server from a different network. This value can be overwritten by cli option `--address` | +| gossip-address | | The address used for server internal communication, if not specified, it uses the value of `advertised-address`. The value can be overwritten by cli option "--gossip-address" | +| port | 6570 | Server port value, the value must be given and can be overwritten by cli option `--port` +| internal-port | 6571 | Server port value for internal communications between server nodes, the value must be given and can be overwritten by cli option `--internal-port` | +| metastore-uri | | The server nodes in the same cluster shares an HMeta uniy, this is used for metadata storage and is essential for a server to start. Specify the HMeta protocal such as `zk://` or `rq://`, following with Comma separated host:port pairs, each corresponding to a hmeta server. e.g. zk://127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183. The value must be given and can be overwritten by cli option `--metastore-uri` | +| onnector-meta-store | | The metadata store for connectors (hstream io), the value must be given. | +| log-with-color | true | optional, The options used to control whether print logs with color by the server node, can be overwritten by cli option `--log-with-color` | +| log-level | info | optional, the setting control lof print level by the server node the default value can be overwritten by cli option `--log-level` | +| max-record-size | 1024*1024 (1MB) | The largest size of a record batch allowed by HStreamDB| +| enable-tls | false | TLS options: Enable tls, which requires tls-key-path and tls-cert-path options | +| tls-key-path | | TLS options: Key file path for tls, can be generated by openssl | +| tls-cert-path | | The signed certificate by CA for the key(tls-key-path) | +| advertise-listeners | | The advertised listeners for the server | + +### hstore + +The configuration for hstore is optional. When the values are not provided, hstreamdb will use the default values. + +| Name | Default Value | Description | +| ---- | ------------- | ----------- | +|log-level| info | optional | + +Store admin section specifies the client config when connecting to the storage admin server +| Name | Default Value | Description | +| ---- | ------------- | ----------- | +| host | "127.0.0.1" | optional | +| port | 6440 | optional | +| protocol-id | binaryProtocolId | optional | +| conn-timeout | 5000 | optional | +| send-timeout | 5000 | optional | +| recv-timeout | 5000 | optional | + +### hstream-io + +| Name | Description | +| ---- | ----------- | +| tasks-path | the io tasks work directory | +| tasks-network | io tasks run as docker containers, so the tasks-network should be the network that can connect to HStreamDB and external systems | +| source-images | key-value map specify the images used by the source connectors | +| sink-images | key-value map specify the images used by the sink connectors | + +## Resource Attributes + +### Stream + +| Name | Description | +| ---- | ----------- | +| name | The name of the stream | +| shard count | The number of shards in the stream | +| replication factor | The number of the replicas | +| backlog retention | The retention time of the records in the stream in seconds| + +### Subscription + +| Name | Description | +| ---- | ----------- | +| id | The id of the subscription | +| stream name | The name of the stream to subscribe | +| ackTimeoutSeconds | Maximum time in the server will wait for an acknowledgement | +| maxUnackedRecords | The maximum amount of unacknowledged records allowed | + +## Command-Line Options + +For ease of use, we allow users to pass some options to override the configuration in the configuration file when starting the server with `hstream-server` : + +| Option | Meta var | Description | +| ------ | -------- | ----------- | +| config-path | PATH | hstream config path | +| bind-address | HOST | server host value | +| advertised-address | HOST | server listener address value | +| gossip-address | HOST | server gossip address value | +| port | INT | server port value | +| internal-port | INT | server channel port value for internal communication | +| server-id | UINT32 | ID of the hstream server node | +| store-admin-port | INT | store admin port value | +| metastore-uri | STR | Specify the HMeta protocal such as `zk://` or `rq://`, following with Comma separated host:port pairs, each corresponding to a hmeta server. e.g. zk://127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183. | +| log-level | | Server log level | +| log-with-color | FLAG | Server log with color | diff --git a/docs/zh/v0.17.0/reference/metrics.md b/docs/zh/v0.17.0/reference/metrics.md new file mode 100644 index 0000000..c2cd3c1 --- /dev/null +++ b/docs/zh/v0.17.0/reference/metrics.md @@ -0,0 +1,124 @@ +# HStream Metrics + +Note: For metrics with intervals, such as stats in categories like stream and +subscription, users can specify intervals (default intervals [1min, 5min, +10min]). The smaller the interval, the closer it gets to the rate in real-time. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Category
Metrics
Unit
Description
stream_counterappend_total
#
Total number of append requests of a stream
append_failed#
Total number of failed append request of a stream
append_in_bytes#
Total payload bytes successfully written to the stream
append_in_records#
Total payload records successfully written to the stream
streamappend_in_bytesB/s
+ Rate of bytes received and successfully written to the stream.
+
append_in_records#/s
Rate of records received and successfully written to the stream
append_in_requests#/s (QPS)Rate of append requests received per stream
append_failed_requests#/s (QPS)Rate of failed append requests received per stream
subscription_countersend_out_bytes#Number of bytes sent by the server per subscription
send_out_records#Number of records successfully sent by the server per subscription
send_out_records_failed#Number of records failed to send by the server per subscription
resend_records#Number of successfully resent records per subscription
resend_records_failed#Number of records failed to resend per subscription
received_acks#Number of acknowledgements received per subscription
request_messages#Number of streaming fetch requests received from clients per subscription
response_messages#Number of streaming send requests successfully sent to clients per subscription, including resends
subscriptionsend_out_bytesB/sRate of bytes sent by the server per subscription
acks / acknowledgements
#/sRate of acknowledgements received per subscription
request_messages#/sRate of streaming fetch requests received from clients per subscription
response_messages#/sRate of streaming send requests successfully sent to clients per subscription, including resends
diff --git a/docs/zh/v0.17.0/reference/sql/_index.md b/docs/zh/v0.17.0/reference/sql/_index.md new file mode 100644 index 0000000..97f7f65 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/_index.md @@ -0,0 +1,5 @@ +--- +order: ["sql-overview.md", "sql-quick-reference.md", "statements", "functions"] +--- + +HStream SQL diff --git a/docs/zh/v0.17.0/reference/sql/appendix.md b/docs/zh/v0.17.0/reference/sql/appendix.md new file mode 100644 index 0000000..e1f2eaa --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/appendix.md @@ -0,0 +1,210 @@ +Appendix +======== + +## Data Types + +| type | examples | +|-----------|---------------------------------------| +| NULL | NULL | +| INTEGER | 1, -1, 1234567 | +| FLOAT | 2.3, -3.56, 232.4 | +| NUMERIC | 1, 2.3 | +| BOOLEAN | TRUE, FALSE | +| BYTEA | '0xaa0xbb' :: BYTEA | +| STRING | "deadbeef" | +| DATE | DATE '2020-06-10' | +| TIME | TIME '11:18:30' | +| TIMESTAMP | TIMESTAMP '2022-01-01T12:00:00+08:00' | +| INTERVAL | INTERVAL 10 SECOND | +| JSON | '{"a": 1, "b": 2}' :: JSONB | +| ARRAY | [1, 2, 3] | + +## Keywords + +| keyword | description | +|-------------------|------------------------------------------------------------------------------------------| +| `ABS` | absolute value | +| `ACOS` | arccosine | +| `ACOSH` | inverse hyperbolic cosine | +| `AND` | logical and operator | +| `ARRAY_CONTAIN` | given an array, checks if a search value is contained in the array | +| `ARRAY_DISTINCT` | returns an array of all the distinct values | +| `ARRAY_EXCEPT` | `ARRAY_DISTINCT` except for those also present in the second array | +| `ARRAY_INTERSECT` | returns an array of all the distinct elements from the intersection of both input arrays | +| `ARRAY_JOIN` | creates a flat string representation of all elements contained in the given array | +| `ARRAY_LENGTH` | return the length of the given array | +| `ARRAY_MAX` | returns the maximum value from the given array of primitive elements | +| `ARRAY_MIN` | returns the minimum value from the given array of primitive elements | +| `ARRAY_REMOVE` | removes all elements from the input array equal to the second argument | +| `ARRAY_SORT` | sort the given array | +| `ARRAY_UNION` | returns an array of all the distinct elements from the union of both input arrays | +| `AS` | stream or field name alias | +| `ASIN` | arcsine | +| `ASINH` | inverse hyperbolic sine | +| `ATAN` | arctangent | +| `ATANH` | inverse hyperbolic tangent | +| `AVG` | average function | +| `BETWEEN` | range operator, used with `AND` | +| `BY` | do something by certain conditions, used with `GROUP` or `ORDER` | +| `CEIL` | rounds a number UPWARDS to the nearest integer | +| `COS` | cosine | +| `COSH` | hyperbolic cosine | +| `COUNT` | count function | +| `CREATE` | create a stream / connector | +| `DATE` | prefix of date constant | +| `DAY` | interval unit | +| `DROP` | drop a stream | +| `EXP` | exponent | +| `FLOOR` | rounds a number DOWNWARDS to the nearest integer | +| `FROM` | specify where to select data from | +| `GROUP` | group values by certain conditions, used with `BY` | +| `HAVING` | filter select values by a condition | +| `HOPPING` | hopping window | +| `IFNULL` | if the first argument is `NULL` returns the second, else the first | +| `INSERT` | insert data into a stream, used with `INTO` | +| `INTERVAL` | prefix of interval constant | +| `INTO` | insert data into a stream, used with `INSERT` | +| `IS_ARRAY` | to determine if the given value is an array of values | +| `IS_BOOL` | to determine if the given value is a boolean | +| `IS_DATE` | to determine if the given value is a date value | +| `IS_FLOAT` | to determine if the given value is a float | +| `IS_INT` | to determine if the given value is an integer | +| `IS_NUM` | to determine if the given value is a number | +| `IS_STR` | to determine if the given value is a string | +| `IS_TIME` | to determine if the given value is a time value | +| `JOIN` | for joining two streams | +| `LEFT` | joining type, used with `JOIN` | +| `LEFT_TRIM` | trim spaces from the left end of a string | +| `LOG` | logarithm with base e | +| `LOG10` | logarithm with base 10 | +| `LOG2` | logarithm with base 2 | +| `MAX` | maximum function | +| `MIN` | minimum function | +| `MINUTE` | interval unit | +| `MONTH` | interval unit | +| `NOT` | logical not operator | +| `NULLIF` | returns `NULL` if the first argument is equal to the second, otherwise the first | +| `OR` | logical or operator | +| `ORDER` | sort values by certain conditions, used with `BY` | +| `OUTER` | joining type, used with `JOIN` | +| `REVERSE` | reverse a string | +| `RIGHT_TRIM` | trim spaces from the right end of a string | +| `ROUND` | rounds a number to the nearest integer | +| `SECOND` | interval unit | +| `SELECT` | query a stream | +| `SHOW` | show something to stdout | +| `SIGN` | return the sign of a numeric value as an INTEGER | +| `SIN` | sine | +| `SINH` | hyperbolic sine | +| `SLIDING` | sliding window | +| `SQRT` | square root | +| `STREAM` | specify a stream, used with `CREATE` | +| `STRLEN` | get the length of a string | +| `SUM` | sum function | +| `TAN` | tangent | +| `TANH` | hyperbolic tangent | +| `TIME` | prefix of the time constant | +| `TO_LOWER` | convert a string to lowercase | +| `TO_STR` | convert a value to string | +| `TO_UPPER` | convert a string to uppercase | +| `TRIM` | trim spaces from both ends of a string | +| `TUMBLING` | tumbling window | +| `VALUES` | specify inserted data, used with `INSERT INTO` | +| `WEEK` | interval unit | +| `WHERE` | filter selected values by a condition | +| `WITH` | specify properties when creating a stream | +| `WITHIN` | specify time window when joining two streams | +| `YEAR` | interval unit | + +## Operators + +| operator | description | +|----------|------------------------------| +| `=` | equal to | +| `<>` | not equal to | +| `<` | less than | +| `>` | greater than | +| `<=` | less than or equal to | +| `>=` | greater than or equal to | +| `+` | addition | +| `-` | subtraction | +| `*` | multiplication | +| `.` | access field of a stream | +| `[]` | access item of an array | +| `AND` | logical and operator | +| `OR` | logical or operator | +| `::` | type casting | +| `->` | JSON access(as JSON) by key | +| `->>` | JSON access(as text) by key | +| `#>` | JSON access(as JSON) by path | +| `#>>` | JSON access(as text) by path | + +## Scalar Functions + +| function | description | +|-------------------|------------------------------------------------------------------------------------------| +| `ABS` | absolute value | +| `ACOS` | arccosine | +| `ACOSH` | inverse hyperbolic cosine | +| `ARRAY_CONTAIN` | given an array, checks if a search value is contained in the array | +| `ARRAY_DISTINCT` | returns an array of all the distinct values | +| `ARRAY_EXCEPT` | `ARRAY_DISTINCT` except for those also present in the second array | +| `ARRAY_INTERSECT` | returns an array of all the distinct elements from the intersection of both input arrays | +| `ARRAY_JOIN` | creates a flat string representation of all elements contained in the given array | +| `ARRAY_LENGTH` | return the length of the given array | +| `ARRAY_MAX` | returns the maximum value from the given array of primitive elements | +| `ARRAY_MIN` | returns the minimum value from the given array of primitive elements | +| `ARRAY_REMOVE` | removes all elements from the input array equal to the second argument | +| `ARRAY_SORT` | sort the given array | +| `ARRAY_UNION` | returns an array of all the distinct elements from the union of both input arrays | +| `ASIN` | arcsine | +| `ASINH` | inverse hyperbolic sine | +| `ATAN` | arctangent | +| `ATANH` | inverse hyperbolic tangent | +| `CEIL` | rounds a number UPWARDS to the nearest integer | +| `COS` | cosine | +| `COSH` | hyperbolic cosine | +| `EXP` | exponent | +| `FLOOR` | rounds a number DOWNWARDS to the nearest integer | +| `IFNULL` | if the first argument is `NULL` returns the second, else the first | +| `NULLIF` | returns `NULL` if the first argument is equal to the second, otherwise the first | +| `IS_ARRAY` | to determine if the given value is an array of values | +| `IS_BOOL` | to determine if the given value is a boolean | +| `IS_DATE` | to determine if the given value is a date value | +| `IS_FLOAT` | to determine if the given value is a float | +| `IS_INT` | to determine if the given value is an integer | +| `IS_NUM` | to determine if the given value is a number | +| `IS_STR` | to determine if the given value is a string | +| `IS_TIME` | to determine if the given value is a time value | +| `LEFT_TRIM` | trim spaces from the left end of a string | +| `LOG` | logarithm with base e | +| `LOG10` | logarithm with base 10 | +| `LOG2` | logarithm with base 2 | +| `REVERSE` | reverse a string | +| `RIGHT_TRIM` | trim spaces from the right end of a string | +| `ROUND` | rounds a number to the nearest integer | +| `SIGN` | return the sign of a numeric value as an INTEGER | +| `SIN` | sine | +| `SINH` | hyperbolic sine | +| `SQRT` | square root | +| `STRLEN` | get the length of a string | +| `TAN` | tangent | +| `TANH` | hyperbolic tangent | +| `TO_LOWER` | convert a string to lowercase | +| `TO_STR` | convert a value to string | +| `TO_UPPER` | convert a string to uppercase | +| `TOPK` | topk aggregate function | +| `TOPKDISTINCT` | topkdistinct aggregate function | +| `TRIM` | trim spaces from both ends of a string | + +## Aggregate Functions + +| function | description | +|----------------|--------------------------------| +| `AVG` | average | +| `COUNT` | count | +| `MAX` | maximum | +| `MIN` | minimum | +| `SUM` | sum | +| `TOPK` | top k values as array | +| `TOPKDISTINCT` | distinct top k values as array | diff --git a/docs/zh/v0.17.0/reference/sql/functions/_index.md b/docs/zh/v0.17.0/reference/sql/functions/_index.md new file mode 100644 index 0000000..2817bb3 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/functions/_index.md @@ -0,0 +1,5 @@ +--- +order: ["aggregation.md", "scalar.md"] +--- + +Functions diff --git a/docs/zh/v0.17.0/reference/sql/functions/aggregation.md b/docs/zh/v0.17.0/reference/sql/functions/aggregation.md new file mode 100644 index 0000000..2479c90 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/functions/aggregation.md @@ -0,0 +1,37 @@ +Aggregate Functions +=================== + +Aggregate functions perform a calculation on a set of values and return a single value. + +```sql +COUNT(col) +COUNT(*) +``` + +Return the number of rows. +When `col` is specified, the count returned will be the number of rows. +When `*` is specified, the count returned will be the total number of rows. + +```sql +AVG(col) +``` + +Return the average value of a given column. + +```sql +SUM(col) +``` + +Return the sum value of a given column. + +```sql +MAX(col) +``` + +Return the max value of a given column. + +```sql +MIN(col) +``` + +Return the min value of a given column. diff --git a/docs/zh/v0.17.0/reference/sql/functions/scalar.md b/docs/zh/v0.17.0/reference/sql/functions/scalar.md new file mode 100644 index 0000000..10ee4c2 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/functions/scalar.md @@ -0,0 +1,327 @@ +Scalar Functions +================ + +Scalar functions operate on one or more values and then return a single value. They can be used wherever a value expression is valid. + +Scalar functions are divided into serval kinds. + +### Type Casting Functions + +Our SQL supports explicit type casting in the form of `CAST(expr AS type)` or `expr :: type`. Target type can be one of the follows: + +- `INTEGER` +- `FLOAT` +- `NUMERIC` +- `BOOLEAN` +- `BYTEA` +- `STRING` +- `DATE` +- `TIME` +- `TIMESTAMP` +- `INTERVAL` +- `JSONB` +- `[]` (array) + +### JSON Functions + +To use JSON data conveniently, we support the following functions: + +- ` -> `, which gets the corresponded field and return as JSON format. +- ` ->> `, which gets the corresponded field and return as text format. +- ` #> `, which gets the corresponded field in the specified path and return as JSON format. +- ` #>> `, which gets the corresponded field in the specified path and return as text format. + +### Array Accessing Functions + +To access fields of arrays, we support the following functions: + +- ` []`, ` [:]`, ` [:]` and ` [:]` + +### Trigonometric Functions + +All trigonometric functions perform a calculation, operate on a single numeric value and then return a single numeric value. + +For values outside the domain, `NaN` is returned. + +```sql +SIN(num_expr) +SINH(num_expr) +ASIN(num_expr) +ASINH(num_expr) +COS(num_expr) +COSH(num_expr) +ACOS(num_expr) +ACOSH(num_expr) +TAN(num_expr) +TANH(num_expr) +ATAN(num_expr) +ATANH(num_expr) +``` + +### Arithmetic Functions + +The following functions perform a calculation, operate on a single numeric value and then return a single numeric value. + +```sql +ABS(num_expr) +``` + +Absolute value. + +```sql +CEIL(num_expr) +``` +The function application `CEIL(n)` returns the least integer not less than `n`. + +```sql +FLOOR(num_expr) +``` + +The function application `FLOOR(n)` returns the greatest integer not greater than `n`. + +```sql +ROUND(num_expr) +``` +The function application `ROUND(n)` returns the nearest integer to `n` the even integer if `n` is equidistant between two integers. + +```sql +SQRT(num_expr) +``` + +The square root of a numeric value. + +```sql +LOG(num_expr) +LOG2(num_expr) +LOG10(num_expr) +EXP(num_expr) +``` + +```sql +SIGN(num_expr) +``` +The function application `SIGN(n)` returns the sign of a numeric value as an Integer. + +- returns `-1` if `n` is negative +- returns `0` if `n` is exact zero +- returns `1` if `n` is positive +- returns `null` if `n` is exact `null` + +### Predicate Functions + +Function applications of the form `IS_A(x)` where `A` is the name of a type returns `TRUE` if the argument `x` is of type `A`, otherwise `FALSE`. + +```sql +IS_INT(val_expr) +IS_FLOAT(val_expr) +IS_NUM(val_expr) +IS_BOOL(val_expr) +IS_STR(val_expr) +IS_ARRAY(val_expr) +IS_DATE(val_expr) +IS_TIME(val_expr) +``` + +### String Functions + +```sql +TO_STR(val_expr) +``` + +Convert a value expression to a readable string. + +```sql +TO_LOWER(str) +``` +Convert a string to lower case, using simple case conversion. + +```sql +TO_UPPER(str) +``` + +Convert a string to upper case, using simple case conversion. + +```sql +TRIM(str) +``` + +Remove leading and trailing white space from a string. + +```sql +LEFT_TRIM(str) +``` + +Remove leading white space from a string. + +```sql +RIGHT_TRIM(str) +``` + +Remove trailing white space from a string. + +```sql +REVERSE(str) +``` + +Reverse the characters of a string. + +```sql +STRLEN(str) +``` + +Returns the number of characters in a string. + +```sql +TAKE(num_expr, str) +``` + +The function application `TAKE(n, s)` returns the prefix of the string of length `n`. + +```sql +TAKEEND(num_expr, str) +``` + +The function application `TAKEEND(n, s)` returns the suffix remaining after taking `n` characters from the end of the string. + +```sql +DROP(num_expr, str) +``` + +The function application `DROP(n, s)` returns the suffix of the string after the first `n` characters, or the empty string if n is greater than the length of the string. + +```sql +DROPEND(num_expr, str) +``` + +The function application `DROPEND(n, s)` returns the prefix remaining after dropping `n` characters from the end of the string. + +### Null Functions + +```sql +IFNULL(val_expr, val_expr) +``` + +The function application `IFNULL(x, y)` returns `y` if `x` is `NULL`, otherwise `x`. + +When the argument type is a complex type, for example, `ARRAY`, the contents of the complex type are not inspected. + +```sql +NULLIF(val_expr, val_expr) +``` + +The function application `NULLIF(x, y)` returns `NULL` if `x` is equal to `y`, otherwise `x`. + +When the argument type is a complex type, for example, `ARRAY`, the contents of the complex type are not inspected. + +### Time and Date Functions + +#### Time Format + +Formats are analogous to [strftime](https://man7.org/linux/man-pages/man3/strftime.3.html). + +| Format Name | Raw Format String | +| ----------------- | --------------------------- | +| simpleDateFormat | "%Y-%m-%d %H:%M:%S" | +| iso8061DateFormat | "%Y-%m-%dT%H:%M:%S%z" | +| webDateFormat | "%a, %d %b %Y %H:%M:%S GMT" | +| mailDateFormat | "%a, %d %b %Y %H:%M:%S %z" | + +```sql +DATETOSTRING(val_expr, str) +``` + +Formatting seconds since 1970-01-01 00:00:00 UTC to string in GMT with the second string argument as the given format name. + +```sql +STRINGTODATE(str, str) +``` + +Formatting string to seconds since 1970-01-01 00:00:00 UTC in GMT with the second string argument as the given format name. + +### Array Functions + +```sql +ARRAY_CONTAINS(arr_expr, val_expr) +``` + +Given an array, checks if the search value is contained in the array (of the same type). + +```sql +ARRAY_DISTINCT(arr_expr) +``` + +Returns an array of all the distinct values, including `NULL` if present, from the input array. The output array elements are in order of their first occurrence in the input. + +Returns `NULL` if the argument is `NULL`. + +```sql +ARRAY_EXCEPT(arr_expr, arr_expr) +``` + +Returns an array of all the distinct elements from an array, except for those also present in a second array. The order of entries in the first array is preserved but duplicates are removed. + +Returns `NULL` if either input is `NULL`. + +```sql +ARRAY_INTERSECT(arr_expr, arr_expr) +``` + +Returns an array of all the distinct elements from the intersection of both input arrays. If the first list contains duplicates, so will the result. If the element is found in both the first and the second list, the element from the first list will be used. + +Returns `NULL` if either input is `NULL`. + +```sql +ARRAY_UNION(arr_expr, arr_expr) +``` + +Returns the array union of the two arrays. Duplicates, and elements of the first list, are removed from the second list, but if the first list contains duplicates, so will the result. + +Returns `NULL` if either input is `NULL`. + +```sql +ARRAY_JOIN(arr_expr) +ARRAY_JOIN(arr_expr, str) +``` + +Creates a flat string representation of all the primitive elements contained in the given array. The elements in the resulting string are separated by the chosen delimiter, which is an optional parameter that falls back to a comma `,`. + +```sql +ARRAY_LENGTH(arr_expr) +``` + +Returns the length of a finite list. + +Returns `NULL` if the argument is `NULL`. + +```sql +ARRAY_MAX(arr_expr) +``` + +Returns the maximum value from within a given array of elements. + +Returns `NULL` if the argument is `NULL`. + +```sql +ARRAY_MIN(arr_expr) +``` + +Returns the minimum value from within a given array of elements. + +Returns `NULL` if the argument is `NULL`. + +```sql +ARRAY_REMOVE(arr_expr, val_expr) +``` + +Removes all elements from the input array equal to the second argument. + +Returns `NULL` if the first argument is `NULL`. + + +```sql +ARRAY_SORT(arr_expr) +``` + +Sort an array. Elements are arranged from lowest to highest, keeping duplicates in the order they appeared in the input. + +Returns `NULL` if the first argument is `NULL`. diff --git a/docs/zh/v0.17.0/reference/sql/sql-overview.md b/docs/zh/v0.17.0/reference/sql/sql-overview.md new file mode 100644 index 0000000..87d069d --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/sql-overview.md @@ -0,0 +1,196 @@ +# SQL Overview + +SQL is a domain-specific language used in programming and designed for managing +data held in a database management system. A standard for the specification of +SQL is maintained by the American National Standards Institute (ANSI). Also, +there are many variants and extensions to SQL to express more specific programs. + +The +[SQL grammar of HStreamDB](https://github.com/hstreamdb/hstream/blob/main/hstream-sql/etc/SQL-v1.cf) +is based on a subset of standard SQL with some extensions to support stream +operations. + +## Syntax + +SQL inputs are made up of a series of statements. Each statement is made up of a +series of tokens and ends in a semicolon (`;`). + +A token can be a keyword argument, an identifier, a literal, an operator, or a +special character. The details of the rules can be found in the +[BNFC grammar file](https://github.com/hstreamdb/hstream/blob/main/hstream-sql/etc/SQL-v1.cf). +Normally, tokens are separated by whitespace. + +The following examples are syntactically valid SQL statements: + +```sql +SELECT * FROM my_stream; + +CREATE STREAM abnormal_weather AS SELECT * FROM weather WHERE temperature > 30 AND humidity > 80 WITH (REPLICATE = 3); + +INSERT INTO weather (cityId, temperature, humidity) VALUES (11254469, 12, 65); +``` + +## Keywords + +Some tokens such as `SELECT`, `INSERT` and `WHERE` are reserved _keywords_, +which have specific meanings in SQL syntax. Keywords are case insensitive, which +means that `SELECT` and `select` are equivalent. A keyword can not be used as an +identifier. + +For a complete list of keywords, see the [appendix](appendix.md). + +## Identifiers + +Identifiers are tokens that represent user-defined objects such as streams, +fields, and other ones. For example, `my_stream` can be used as a stream name, +and `temperature` can represent a field in the stream. + +By now, identifiers only support C-style naming rules. It means that an +identifier name can only have letters (both uppercase and lowercase letters), +digits, and the underscore. Besides, the first letter of an identifier should be +either a letter or an underscore. + +By now, identifiers are case-sensitive, which means that `my_stream` and +`MY_STREAM` are different identifiers. + +## Expressions + +An expression is a value that can exist almost everywhere in a SQL query. It can +be both a constant whose value is known before execution (such as an integer or +a string literal) and a variable whose value is known during execution (such as +a field of a stream). + +### Integer + +Integers are in the form of `digits`, where `digits` are one or more +single-digit integers (0 through 9). Negatives such as `-1` are also supported. +**Note that scientific notation is not supported yet**. + +### Float + +Floats are in the form of `.`. Negative floats such as `-11.514` +are supported. Note that + +- **scientific notation is not supported yet**. +- **Forms such as `1.` and `.99` are not supported yet**. + +### Boolean + +A boolean value is either `TRUE` or `FALSE`. + +### String + +Strings are arbitrary character series surrounded by single quotes (`'`), such +as `'anyhow'`. + +### Date + +Dates represent a date exact to a day in the form of +`DATE '--'`, where ``, `` and `` are all +integer constants. Note that the leading `DATE` should not be omitted. + +Example: `DATE '2021-01-02'` + +### Time + +Time constants represent time exact to a second or a microsecond in the form of +`TIME '--'` or +`TIME '--.'`, where ``, ``, +`` and `` are all integer constants. Note that the leading +`TIME` should not be omitted. + +Example: `TIME '10:41:03'`, `TIME '01:02:03.456'` + +### Timestamp + +Timestamp constants represent values that contain both date and time parts. It +can also contain an optional timezone part for convenience. A timestamp is in +the form of `TIMESTAMP ''`. For more information, please refer to +[ISO 8601](https://en.wikipedia.org/wiki/ISO_8601). + +Example: `TIMESTAMP '2023-06-30T12:30:45+02:00'` + +### Interval + +Intervals represent a time section in the form of +`INTERVAL ` or. Note that the leading `INTERVAL` should +not be omitted. + +Example: `INTERVAL 5 SECOND`(5 seconds) + +### Array + +Arrays represent a list of values, where each one of them is a valid expression. +It is in the form of `[, ...]`. + +Example: `["aa", "bb", "cc"]`, `[1, 2]` + +### Column(Field) + +A column(or a field) represents a part of a value in a stream or materialized +view. It is similar to column of a table in traditional relational databases. A +column is in the form of `` or +`.`. When a column name is ambiguous(for +example it has the same name as a function application) the double quote `` " `` +can be used. + +Example: `temperature`, `stream_test.humidity`, `` "SUM(a)" `` + +### Subquery + +A subquery is a SQL clause start with `SELECT`, see +[here](./statements/select-stream.md). + +### Function or Operator Application + +An expression can also be formed by other expressions by applying functions or +operators on them. The details of function and operator can be found in the +following parts. + +Example: `SUM(stream_test.cnt)`, (`raw_stream::jsonb)->>'value'` + +## Operators and Functions + +Functions are special keywords that mean some computation, such as `SUM` and +`MIN`. And operators are infix functions composed of special characters, such as +`>=` and `<>`. + +For a complete list of functions and operators, see the [appendix](appendix.md). + +## Special Characters + +There are some special characters in the SQL syntax with particular meanings: + +- Parentheses (`()`) are used outside an expression for controlling the order of + evaluation or specifying a function application. +- Brackets (`[]`) are used with maps and arrays for accessing their + substructures, such as `some_map[temp]` and `some_array[1]`. **Note that it is + not supported yet**. +- Commas (`,`) are used for delineating a list of objects. +- The semicolons (`;`) represent the end of a SQL statement. +- The asterisk (`*`) represents "all fields", such as + `SELECT * FROM my_stream;`. +- The period (`.`) is used for accessing a field in a stream, such as + `my_stream.humidity`. +- The double quote (`` " ``) represents an "raw column name" in the `SELECT` + clause to distinguish a column name with functions from actual function + applications. For example, `SELECT SUM(a) FROM s;` means applying `SUM` + function on the column `a` from stream `s`. However if the stream `s` actually + contains a column called `SUM(a)` and you want to take it out, you can use + back quotes like `` SELECT "SUM(a)" FROM s; ``. + +## Comments + +A single-line comment begins with `//`: + +``` +// This is a comment +``` + +Also, C-style multi-line comments are supported: + +``` +/* This is another + comment +*/ +``` diff --git a/docs/zh/v0.17.0/reference/sql/sql-quick-reference.md b/docs/zh/v0.17.0/reference/sql/sql-quick-reference.md new file mode 100644 index 0000000..fd60f1b --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/sql-quick-reference.md @@ -0,0 +1,66 @@ +SQL quick reference +=================== + +## CREATE STREAM + +Create a new HStreamDB stream with the stream name given. +An exception will be thrown if the stream is already created. +See [CREATE STREAM](statements/create-stream.md). + +```sql +CREATE STREAM stream_name [AS select_query] [WITH (stream_option [, ...])]; +``` + +## CREATE VIEW + +Create a new view with the view name given. A view is a physical object like a stream and it is updated with time. +An exception will be thrown if the view is already created. The name of a view can either be the same as a stream. +See [CREATE VIEW](statements/create-view.md). + +```sql +CREATE VIEW view_name AS select_query; +``` + +## SELECT + +Get records from a materialized view or a stream. Note that `SELECT` from streams can only used as a part of `CREATE STREAM` or `CREATE VIEW`. When you want to get results in a command-line session, create a materialized view first and then `SELECT` from it. +See [SELECT (Stream)](statements/select-stream.md). + +```sql +SELECT <* | expression [ AS field_alias ] [, ...]> + FROM stream_ref + [ WHERE expression ] + [ GROUP BY field_name [, ...] ] + [ HAVING expression ]; +``` + +## INSERT + +Insert data into the specified stream. It can be a data record, a JSON value or binary data. +See [INSERT](statements/insert.md). + +```sql +INSERT INTO stream_name (field_name [, ...]) VALUES (field_value [, ...]); +INSERT INTO stream_name VALUES 'json_value'; +INSERT INTO stream_name VALUES "binary_value"; +``` + +## DROP + +Delete a given stream or view. There can be an optional `IF EXISTS` config to only delete the given category if it exists. + +```sql +DROP STREAM stream_name [IF EXISTS]; +DROP VIEW view_name [IF EXISTS]; +``` + +## SHOW + +Show the information of all streams, queries, views or connectors. + +```sql +SHOW STREAMS; +SHOW QUERIES; +SHOW VIEWS; +SHOW CONNECTORS; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/_index.md b/docs/zh/v0.17.0/reference/sql/statements/_index.md new file mode 100644 index 0000000..1f96acc --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/_index.md @@ -0,0 +1,18 @@ +--- +order: + [ + 'create-stream.md', + 'create-view.md', + 'create-connector.md', + 'drop-stream.md', + 'drop-view.md', + 'drop-connector.md', + 'select-stream.md', + 'insert.md', + 'show.md', + 'pause.md', + 'resume.md', + ] +--- + +Statements diff --git a/docs/zh/v0.17.0/reference/sql/statements/create-connector.md b/docs/zh/v0.17.0/reference/sql/statements/create-connector.md new file mode 100644 index 0000000..164790f --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/create-connector.md @@ -0,0 +1,37 @@ +CREATE CONNECTOR +================ + +Create a new connector for fetching data from or writing data to an external system. A connector can be either a source or a sink one. + + +## Synopsis + +Create source connector: + +```sql +CREATE SOURCE CONNECTOR connector_name FROM source_name WITH (connector_option [, ...]); +``` + +Create sink connector: + +```sql +CREATE SINK CONNECTOR connector_name TO sink_name WITH (connector_option [, ...]); +``` + +## Notes + +- `connector_name` is a valid identifier. +- `source_name` is a valid identifier(`mysql`, `postgresql` etc.). +- There is are some connector options in the `WITH` clause separated by commas. + +check [Connectors](https://hstream.io/docs/en/latest/io/connectors.html) to find the connectors and their configuration options . + +## Examples + +```sql +create source connector source01 from mysql with ("host" = "mysql-s1", "port" = 3306, "user" = "root", "password" = "password", "database" = "d1", "table" = "person", "stream" = "stream01"); +``` + +```sql +create sink connector sink01 to postgresql with ("host" = "pg-s1", "port" = 5432, "user" = "postgres", "password" = "postgres", "database" = "d1", "table" = "person", "stream" = "stream01"); +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/create-stream.md b/docs/zh/v0.17.0/reference/sql/statements/create-stream.md new file mode 100644 index 0000000..8aea808 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/create-stream.md @@ -0,0 +1,25 @@ +CREATE STREAM +============= + +Create a new hstream stream with the given name. An exception will be thrown if a stream with the same name already exists. + +## Synopsis + +```sql +CREATE STREAM stream_name [ AS select_query ] WITH ([ REPLICATE = INT, DURATION = INTERVAL ]); +``` + +## Notes + +- `stream_name` is a valid identifier. +- `select_query` is an optional `SELECT` (Stream) query. For more information, see `SELECT` section. When `` is specified, the created stream will be filled with records from the `SELECT` query continuously. Otherwise, the stream will only be created and kept empty. +- `WITH` clause contains some stream options. Only `REPLICATE` and `DURATION` options are supported now, which represents the replication factor and the retention time of the stream. If it is not specified, they will be set to default value. +- Sources in `select_query` can be both stream(s) and materialized view(s). + +## Examples + +```sql +CREATE STREAM foo; + +CREATE STREAM abnormal_weather AS SELECT * FROM weather WHERE temperature > 30 AND humidity > 80; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/create-view.md b/docs/zh/v0.17.0/reference/sql/statements/create-view.md new file mode 100644 index 0000000..681a15a --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/create-view.md @@ -0,0 +1,37 @@ +CREATE VIEW +=========== + +Create a new hstream view with the given name. An exception will be thrown if a view or stream with the same name already exists. + +A view is **NOT** just an alias but physically maintained in the memory and is updated incrementally. Thus queries on a view are really fast and do not require extra resources. + +## Synopsis + +```sql +CREATE VIEW view_name AS select_query; +``` +## Notes +- `view_name` is a valid identifier. +- `select_query` is a valid `SELECT` query. For more information, see `SELECT` section. There is no extra restrictions on `select_query` but we recommend using at least one aggregate function and a `GROUP BY` clause. Otherwise, the query may be a little weird and consumes more resources. See the following examples: + +``` +// CREATE VIEW v1 AS SELECT id, SUM(sales) FROM s GROUP BY id; +// what the view contains at time +// [t1] [t2] [t3] +// {"id":1, "SUM(sales)": 10} -> {"id":1, "SUM(sales)": 10} -> {"id":1, "SUM(sales)": 30} +// {"id":2, "SUM(sales)": 8} {"id":2, "SUM(sales)": 15} + +// CREATE VIEW AS SELECT id, sales FROM s; +// what the view contains at time +// [t1] [t2] [t3] +// {"id":1, "sales": 10} -> {"id":1, "sales": 10} -> {"id":1, "sales": 10} +// {"id":2, "sales": 8} {"id":1, "sales": 20} +// {"id":2, "sales": 8} +// {"id":2, "sales": 7} +``` + +## Examples + +```sql +CREATE VIEW foo AS SELECT a, SUM(a), COUNT(*) FROM s1 GROUP BY b; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/drop-connector.md b/docs/zh/v0.17.0/reference/sql/statements/drop-connector.md new file mode 100644 index 0000000..0038bf7 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/drop-connector.md @@ -0,0 +1,20 @@ +DROP CONNECTOR +=========== + +Drop a connector with the given name. + +## Synopsis + +```sql +DROP CONNECTOR connector_name; +``` + +## Notes + +- `connector_name` is a valid identifier. + +## Examples + +```sql +DROP CONNECTOR foo; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/drop-stream.md b/docs/zh/v0.17.0/reference/sql/statements/drop-stream.md new file mode 100644 index 0000000..a25b397 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/drop-stream.md @@ -0,0 +1,23 @@ +DROP STREAM +=========== + +Drop a stream with the given name. If `IF EXISTS` is present, the statement won't fail if the stream does not exist. + +## Synopsis + +```sql +DROP STREAM stream_name [ IF EXISTS ]; +``` + +## Notes + +- `stream_name` is a valid identifier. +- `IF EXISTS` annotation is optional. + +## Examples + +```sql +DROP STREAM foo; + +DROP STREAM foo IF EXISTS; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/drop-view.md b/docs/zh/v0.17.0/reference/sql/statements/drop-view.md new file mode 100644 index 0000000..ed81a93 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/drop-view.md @@ -0,0 +1,23 @@ +DROP VIEW +=========== + +Drop a view with the given name. If `IF EXISTS` is present, the statement won't fail if the view does not exist. + +## Synopsis + +```sql +DROP VIEW view_name [ IF EXISTS ]; +``` + +## Notes + +- `view_name` is a valid identifier. +- `IF EXISTS` annotation is optional. + +## Examples + +```sql +DROP VIEW foo; + +DROP VIEW foo IF EXISTS; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/insert.md b/docs/zh/v0.17.0/reference/sql/statements/insert.md new file mode 100644 index 0000000..bcb48b5 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/insert.md @@ -0,0 +1,30 @@ +INSERT +====== + +Insert a record into specified stream. + +## Synopsis + +```sql +INSERT INTO stream_name (field_name [, ...]) VALUES (field_value [, ...]); +INSERT INTO stream_name VALUES CAST ('json_value' AS JSONB); +INSERT INTO stream_name VALUES CAST ('binary_value' AS BYTEA); +INSERT INTO stream_name VALUES 'json_value' :: JSONB; +INSERT INTO stream_name VALUES 'binary_value' :: BYTEA; +``` + +## Notes + +- `field_value` represents the value of corresponding field, which is a [constant](../sql-overview.md#literals-constants). The correspondence between field type and inserted value is maintained by users themselves. +- `json_value` should be a valid JSON expression. And when inserting a JSON value, remember to put `'`s around it. +- `binary_value` can be any value in the form of a string. It will not be processed by HStreamDB and can only be fetched by certain client API. Remember to put `'`s around it. + +## Examples + +```sql +INSERT INTO weather (cityId, temperature, humidity) VALUES (11254469, 12, 65); +INSERT INTO foo VALUES CAST ('{"a": 1, "b": "abc"}' AS JSONB); +INSERT INTO foo VALUES '{"a": 1, "b": "abc"}' :: JSONB; +INSERT INTO bar VALUES CAST ('some binary value \x01\x02\x03' AS BYTEA); +INSERT INTO bar VALUES 'some binary value \x01\x02\x03' :: BYTEA; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/pause.md b/docs/zh/v0.17.0/reference/sql/statements/pause.md new file mode 100644 index 0000000..f529522 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/pause.md @@ -0,0 +1,22 @@ +PAUSE +================ + +Pause a running task(e.g. connector). + +## Synopsis + +Pause a task: + +```sql +PAUSE name; +``` + +## Notes + +- `name` is a valid identifier. + +## Examples + +```sql +PAUSE CONNECTOR source01; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/resume.md b/docs/zh/v0.17.0/reference/sql/statements/resume.md new file mode 100644 index 0000000..3acf9a4 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/resume.md @@ -0,0 +1,22 @@ +RESUME +================ + +Resume a paused task(e.g. connector). + +## Synopsis + +resume a paused task: + +```sql +RESUME name; +``` + +## Notes + +- `name` is a valid identifier. + +## Examples + +```sql +RESUME CONNECTOR source01; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/select-stream.md b/docs/zh/v0.17.0/reference/sql/statements/select-stream.md new file mode 100644 index 0000000..a91978a --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/select-stream.md @@ -0,0 +1,121 @@ +# SELECT (Stream) + +Get records from a materialized view or a stream. Note that `SELECT` from +streams can only be used as a part of `CREATE STREAM` or `CREATE VIEW`. + +::: tip +Unless when there are cases you would want to run an interactive query from the command +shell, you could add `EMIT CHANGES` at the end of the following examples. +::: + +## Synopsis + +```sql +SELECT <* | identifier.* | expression [ AS field_alias ] [, ...]> + FROM stream_ref + [ WHERE expression ] + [ GROUP BY field_name [, ...] ] + [ HAVING expression ]; +``` + +## Notes + +### About `expression` + +`expression` can be any expression described +[here](../sql-overview.md#Expressions), such as `temperature`, +`weather.humidity`, `42`, `1 + 2`, `SUM(productions)`, `'COUNT(*)'` and +even subquery `SELECT * FROM stream_test WHERE a > 1`. In `WHERE` and `HAVING` +clauses, `expression` should have a value of boolean type. + +### About `stream_ref` + +`stream_ref` specifies a source stream or materialized view: + +``` + stream_ref ::= + | AS + | WITHIN Interval + | ( ) +``` + +It seems quite complex! Do not worry. In a word, a `stream_ref` is something you +can retrieve data from. A `stream_ref` can be an identifier, a join +of two `stream_ref`s, a `stream_ref` with a time window or a `stream_ref` with an +alias. We will describe them in detail. + +#### JOIN + +Fortunately, the `JOIN` in our SQL query is the same as the SQL standard, which +is used by most of your familiar databases such as MySQL and PostgreSQL. It can +be one of: + +- `CROSS JOIN`, which produces the Cartesian product of two streams and/or + materialized view(s). It is equivalent to `INNER JOIN ON TRUE`. +- `[INNER] JOIN`, which produces all data in the qualified Cartesian product by + the join condition. Note a join condition must be specified. +- `LEFT [OUTER] JOIN`, which produces all data in the qualified Cartesian + product by the join condition plus one copy of each row in the left-hand + `stream_ref` for which there was no right-hand row that passed the join + condition(extended with nulls on the right). Note a join condition must be + specified. +- `RIGHT [OUTER] JOIN`, which produces all data in the qualified Cartesian + product by the join condition plus one copy of each row in the right-hand + `stream_ref` for which there was no left-hand row that passed the join + condition(extended with nulls on the left). Note a join condition must be + specified. +- `FULL [OUTER] JOIN`, which produces all data in the qualified Cartesian + product by the join condition, plus one row for each unmatched left-hand row + (extended with nulls on the right), plus one row for each unmatched right-hand + row (extended with nulls on the left). Note a join condition must be + specified. + +A join condition can be any of + +- `ON `. The condition passes when the value of the expression is + `TRUE`. +- `USING(column[, ...])`. The specified column(s) is matched. +- `NATURAL`. The common columns of two `stream_ref`s are matched. It is + equivalent to `USING(common_columns)`. + +#### Time Windows + +A `stream_ref` can also have a time window. Currently, we support the following 3 +time-window functions: + +``` +Tumble( , ) +HOP( , , ) +SLIDE( , ) +``` + +Note that + +- `some_interval` represents a period of time. See + [Intervals](../sql-overview.md#intervals). + +## Examples + +- A simple query: + +```sql +SELECT * FROM my_stream; +``` + +- Filtering rows: + +```sql +SELECT temperature, humidity FROM weather WHERE temperature > 10 AND humidity < 75; +``` + +- Joining streams: + +```sql +SELECT stream1.temperature, stream2.humidity FROM stream1 JOIN stream2 USING(humidity) WITHIN (INTERVAL 1 HOUR); +``` + +- Grouping records: + +```sql +SELECT COUNT(*) FROM TUMBLE(weather, INTERVAL 10 SECOND) GROUP BY cityId; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/select-view.md b/docs/zh/v0.17.0/reference/sql/statements/select-view.md new file mode 100644 index 0000000..149fd57 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/select-view.md @@ -0,0 +1,30 @@ +SELECT (View) +============= + +Get record(s) from the specified view. The fields to get have to be already in the view. +It produces static record(s) and costs little time. + +## Synopsis + +```sql +SELECT <* | column_name [AS field_alias] [, ...]> + FROM view_name + [ WHERE search_condition ]; +``` + +## Notes + +Selecting from a view is a very fast operation that takes advantage of the concept of a view. So it has a more restricted syntax than selecting from a stream: + +- The most important difference between `SELECT` from a stream and from a view is that the former has an `EMIT CHANGES` clause and the latter does not. +- `SELECT` clause can only contain `*` or column names with/without aliases. Other ones such as constants, arithmetical expressions, aggregate/scalar functions, etc. are not allowed. And the column names should be in the `SELECT` clause of the query when creating the corresponding view. If a column name contains function names, use the raw column name by back quotes (`` ` ``). See [Special Characters](../sql-overview.md#special-characters). +- `FROM` clause can only contain **ONE** view name. + +## Examples + +```sql +// Assume that this query has been executed successfully before +// CREATE VIEW my_view AS SELECT a, b, SUM(a), COUNT(*) AS cnt FROM foo GROUP BY b EMIT CHANGES; + +SELECT `SUM(a)`, cnt, a FROM my_view WHERE b = 1; +``` diff --git a/docs/zh/v0.17.0/reference/sql/statements/show.md b/docs/zh/v0.17.0/reference/sql/statements/show.md new file mode 100644 index 0000000..3e92a13 --- /dev/null +++ b/docs/zh/v0.17.0/reference/sql/statements/show.md @@ -0,0 +1,18 @@ +SHOW +================ + +Show resources(e.g. streams, connnectors). + +## Synopsis + +show resources: + +```sql +RESUME ; +``` + +## Examples + +```sql +SHOW CONNECTORS; +``` diff --git a/docs/zh/v0.17.0/release-notes.md b/docs/zh/v0.17.0/release-notes.md new file mode 100644 index 0000000..f95fca9 --- /dev/null +++ b/docs/zh/v0.17.0/release-notes.md @@ -0,0 +1,647 @@ +# Release Notes + +## v0.16.0 [2023-07-07] + +### HServer + +- Add ReadStream and ReadSingleShardStream RPC to read data from a stream +- Add a new RPC for get tail recordId of specific shard +- Add validation when lookup resource +- Add readShardStream RPC for grpc-haskell +- Add `meta`, `lookup`, `query`, `connector` subcommand for hadmin +- Add command for cli to get hstream version +- Add benchmark for logdevice create LogGroups +- Add dockerfile for arm64 build +- Add readShard command in hstream cli +- Add stats command for connector, query and view in hadmin cli +- Improve readShardStream RPC to accept max read records nums and until offset +- Improve read-shard cli command to support specify max read records nums and until offset +- Improve sql cli help info +- Improve dockerfile to speed up build +- Improve error messages in case of cli errors +- Improve the output of cli +- Imporve add more logs in streaming fetch handler +- Improve delete resource relaed stats in deleteHandlers +- Improve change some connector log level +- Refactor data structures of inflight recordIds in subscription +- Refactor replace SubscriptionOnDifferentNode exception to WrongServer exception +- Fix hs-grpc memory leak and core dump problem +- Fix error handling in streaming fetch handler +- Fix checking waitingConsumer list when invalid a consumer +- Fix redundant recordIds deletion +- Fix remove stream when deleting a query +- Fix check whether query exists before creating a new query +- Fix stop related threads after a subscription is deleted +- Fix bug that can cause CheckedRecordIds pile +- Fix check meta store when listConsumer +- Fix subscription created by query can be never acked +- Fix getSubscription with non-existent checkpoint logId + +### SQL && Processing Engine + +- Add `BETWEEN`、`NOT` operators +- Add `HAVING` clause in views +- Add `INSERT INTO SELECT` statement +- Add extra label for JSON data +- Add syntax tests +- Add planner tests +- Improve syntax for quotes +- Improve remove duplicate aggregates +- Improve restore with refined AST +- Refactor remove _view postfix of a view +- Refactor create connector syntax +- Fix alias problems in aggregates and `GROUP BY` statements +- Fix refine string literal +- Fix grammar conflicts +- Fix `IFNULL` operator not work +- Fix runtime error caused by no aggregate with group by +- Fix batched messaged stuck +- Fix incorrect view name && aggregate result +- Fix cast operation +- Fix json related operation not work +- Fix mark the state as TERMINATED if a source is missing on resuming + +### Connector +- Add sink-las connector +- Add sink-elasticsearch connector +- Add Connection, Primary Keys checking for sink-jdbc +- Add retry for sink connectors +- Add Batch Receiver for sinks +- Add full-featured JSON-schema for source-generator +- Replace Subscription with StreamShardReader +- Fix source-debezium offsets + +## v0.15.0 [2023-04-28] + +### HServer + +- Add support for automatic recovery of computing tasks(query, connector) on other nodes when a node in the cluster fails +- Add support for reading data from a given timestamp +- Add support for reconnecting nodes that were previously determined to have failed in the cluster +- Add a new RPC for reading stream shards +- Add metrics for query, view, connector +- Add support for fetching logs of connectors +- Add retry read from hstore when the subscription do resend +- Improve the storage of checkpoints for subscriptions +- Improve read performance of hstore reader +- Improve error handling of RPC methods +- Improve the process of nodes restart +- Improve requests validation in handlers +- Imporve the timestamp of records +- Improve the deletion of queries +- Refactor logging modules +- Fix the load distribution logic in case of cluster members change + +### SQL && Processing Engine + +- The v1 engine is used by default +- Add states saving and restoration of a query +- Add validation for select statements with group by clause +- Add retention time option for ``create stream`` statement +- Add a window_end column for aggregated results based on time window +- Add time window columns to the result stream when using time windows +- Improve the syntax of time windows in SQL +- Improve the syntax of time interval in SQL +- Improve the process of creating the result stream of a query +- Fix `as` in `join` clause +- Fix creating a view without a group by clause +- Fix an issue which can cause incomplete aggregated columns +- Fix alias of an aggregation expr not work +- Fix aggregation queries on views +- Fix errors when joining multiple streams (3 or more) +- Disable subqueries temporarily + +## v0.14.0 [2023-02-28] + +- HServer now uses the in-house Haskell GRPC framework by default +- Add deployment support for CentOS 7 +- Add stats for failed record delivery in subscriptions +- Remove `pushQuery` RPC from the protocol +- Fix the issue causing client stalls when multiple clients consume the same + subscription, and one fails to acknowledge +- Fix possible memory leaks caused by STM +- Fix cluster bootstrap issue causing incorrect status display +- Fix the issue that allows duplicate consumer names on the same subscription +- Fix the issue that allows readers to be created on non-existent shards +- Fix the issue causing the system to stall with the io check command + +## v0.13.0 [2023-01-18] + +- hserver is built with ghc 9.2 by default now +- Add support for getting the IP of the proxied client +- Add support for overloading the client's `user-agent` by setting `proxy-agent` +- Fix the statistics of retransmission and response metrics of subscriptions +- Fix some issues of the processing engine +- CLI: add `service-url` option + +## v0.12.0 [2022-12-29] + +- Add a new RPC interface for getting information about clients connected to the + subscription (including IP, type and version of client SDK, etc.) +- Add a new RPC interface for getting the progress of consumption on a + subscription +- Add a new RPC interface for listing the current `ShardReader`s +- Add TLS support for `advertised-listener`s +- Add support for file-based metadata storage, mainly for simplifying deployment + in local development and testing environments +- Add support for configuring the number of copies of the internal stream that + stores consumption progress +- Fix the problem that the consumption progress of subscriptions was not saved + correctly in some cases +- Improve the CLI tool: + - simplify some command options + - improve cluster interaction + - add retry for requests + - improve delete commands +- Switch to a new planner implementation for HStream SQL + - Improve stability and performance + - Improve the support for subqueries in the FROM clause + - add a new `EXPLAIN` statement for viewing logical execution plans + - more modular design for easy extension and optimization + +## v0.11.0 [2022-11-25] + +- Add support for getting the creation time of streams and subscriptions +- Add `subscription` subcommand in hstream CLI +- [**Breaking change**]Remove the compression option on the hserver side(should + use end-to-end compression instead) +- Remove logid cache +- Unify resource naming rules and improve the corresponding resource naming + checks +- [**Breaking change**]Rename hserver's startup parameters `host` and `address` + to `bind-address` and `advertised-address` +- Fix routing validation for some RPC requests +- Fix a possible failure when saving the progress of a subscription +- Fix incorrect results of `JOIN .. ON` +- Fix the write operation cannot be retried after got a timeout error + +## v0.10.0 [2022-10-28] + +### Highlights + +#### End-to-end compression + +In this release we have introduced a new feature called end-to-end compression, +which means data will be compressed in batches at the client side when it is +written, and the compressed data will be stored directly by HStore. In addition, +the client side can automatically decompress the data when it is consumed, and +the whole process is not perceptible to the user. + +In high-throughput scenarios, enabling end-to-end data compression can +significantly alleviate network bandwidth bottlenecks and improve read and write +performance.Our benchmark shows more than 4x throughput improvement in this +scenario, at the cost of increased CPU consumption on the client side. + +#### HStream SQL Enhancements + +In this release we have introduced many enhancements for HStream SQL, see +[here](#hstream-sql) for details. + +#### HServer based on a new gRPC library + +In this release we replaced the gRPC-haskell library used by HServer with a new +self-developed gRPC library, which brings not only better performance but also +improved long-term stability. + +#### Rqlite Based MetaStore + +In this release we have refactored the MetaStore component of HStreamDB to make +it more scalable and easier to use. We also **experimentally** support the use +of Rqlite instead of Zookeeper as the default MetaStore implementation, which +will make the deployment and maintenance of HStreamDB much easier. Now HServer, +HStore and HStream IO all use a unified MetaStore to store metadata. + +### HServer + +#### New Features + +- Add [e2e compression](#end-to-end-compression) + +#### Enhancements + +- Refactor the server module with a new grpc library +- Adpate to the new metastore and add support for rqlite +- Improve the mechanism of cluster resources allocation +- Improve the cluster startup and initialization process +- Improve thread usage and scheduling for the gossip module + +#### Bug fixes + +- Fix a shard can be assigned to an invalid consumer +- Fix memory leak caused by the gossip module +- Add existence check for dependent streams when creating a view +- Fix an issue where new nodes could fail when joining a cluster +- Fix may overflow while decoding batchedRecord +- Check metadata first before initializing sub when recving fetch request to + avoid inconsistency +- Fix max-record-size option validation + +### HStream SQL + +- Full support of subqueries. A subquery can replace almost any expression now. +- Refinement of data types. It supports new types such as date, time, array and + JSON. It also supports explicit type casting and JSON-related operators. +- Adjustment of time windows. Now every source stream can have its own time + window rather than a global one. +- More general queries on materialized views. Now any SQL clauses applicable to + a stream can be performed on a materialized view, including nested subqueries + and time windows. +- Optimized JOIN clause. It supports standard JOINs such as CROSS, INNER, OUTER + and NATURAL. It also allows JOIN between streams and materialized views. + +### HStream IO + +- Add MongoDB source and sink +- Adapt to the new metastore + +### Java Client + +[hstream-java v0.10.0](https://github.com/hstreamdb/hstreamdb-java/releases/tag/v0.10.0) +has been released: + +#### New Features + +- Add support for e2e compression: zstd, gzip +- Add `StreamBuilder ` + +#### Enhancements + +- Use `directExecutor` as default executor for `grpcChannel` + +#### Bug fixes + +- Fix `BufferedProducer` memory is not released in time +- Fix missing `RecordId` in `Reader`'s results +- Fix dependency conflicts when using hstreamdb-java via maven + +### Go Client + +[hstream-go v0.3.0](https://github.com/hstreamdb/hstreamdb-go/releases/tag/v0.3.0) +has been released: + +- Add support for TLS +- Add support for e2e compression: zstd, gzip +- Improve tests + +### Python Client + +[hstream-py v0.3.0](https://github.com/hstreamdb/hstreamdb-py/releases/tag/v0.3.0) +has been released: + +- Add support for e2e compression: gzip +- Add support for hrecord in BufferedProducer + +### Rust Client + +Add a new [rust client](https://github.com/hstreamdb/hstreamdb-rust) + +### HStream CLI + +- Add support for TLS +- Add -e, --execute options for non-interactive execution of SQL statements +- Add support for keeping the history of entered commands +- Improve error messages +- Add stream subcommands + +### Other Tools + +- Add a new tool [hdt](https://github.com/hstreamdb/deployment-tool) for + deployment + +## v0.9.0 [2022-07-29] + +### HStreamDB + +#### Highlights + +- [Shards in Streams](#shards-in-streams) +- [HStream IO](#hstream-io) +- [New Stream Processing Engine](#new-stream-processing-engine) +- [Gossip-based HServer Clusters](#gossip-based-hserver-clusters) +- [Advertised Listeners](#advertised-listeners) +- [Improved HStream CLI](#improved-hstream-cli) +- [Monitoring with Grafana](#monitoring-with-grafana) +- [Deployment on K8s with Helm](#deployment-on-k8s-with-helm) + +#### Shards in Streams + +We have extended the sharding model in v0.8, which provides direct access and +management of the underlying shards of a stream, allowing a finer-grained +control of data distribution and stream scaling. Each shard will be assigned a +range of hashes in the stream, and every record whose hash of `partitionKey` +falls in the range will be stored in that shard. + +Currently, HStreamDB supports: + +- set the initial number of shards when creating a stream +- distribute written records among shards of the stream with `partitionKey`s +- direct access to records from any shard of the specified position +- check the shards and their key range in a stream + +In future releases, HStreamDB will support dynamic scaling of streams through +shard splitting and merging + +#### HStream IO + +HStream IO is the built-in data integration framework for HStreamDB, composed of +source connectors, sink connectors and the IO runtime. It allows interconnection +with various external systems and empowers more instantaneous unleashing of the +value of data with the facilitation of efficient data flow throughout the data +stack. + +In particular, this release provides connectors listed below: + +- Source connectors: + - [source-mysql](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_mysql_spec.md) + - [source-postgresql](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_postgresql_spec.md) + - [source-sqlserver](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/source_sqlserver_spec.md) +- Sink connectors: + - [sink-mysql](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_mysql_spec.md) + - [sink-postgresql](https://github.com/hstreamdb/hstream-connectors/blob/main/docs/specs/sink_postgresql_spec.md) + +You can refer to [the documentation](./ingest-and-distribute/overview.md) to learn more about +HStream IO. + +#### New Stream Processing Engine + +We have re-implemented the stream processing engine in an interactive and +differential style, which reduces the latency and improves the throughput +magnificently. The new engine also supports **multi-way join**, **sub-queries**, +and **more** general materialized views. + +The feature is still experimental. For try-outs, please refer to +[the SQL guides](./process/sql.md). + +#### Gossip-based HServer Clusters + +We refactor the hserver cluster with gossip-based membership and failure +detection based on [SWIM](https://ieeexplore.ieee.org/document/1028914), +replacing the ZooKeeper-based implementation in the previous version. The new +mechanism will improve the scalability of the cluster and as well as reduce +dependencies on external systems. + +#### Advertised Listeners + +The deployment and usage in production could involve a complex network setting. +For example, if the server cluster is hosted internally, it would require an +external IP address for clients to connect to the cluster. The use of docker and +cloud-hosting can make the situation even more complicated. To ensure that +clients from different networks can interact with the cluster, HStreamDB v0.9 +provides configurations for advertised listeners. With advertised listeners +configured, servers can return the corresponding address for different clients, +according to the port to which the client sent the request. + +#### Improved HStream CLI + +To make CLI more unified and more straightforward, we have migrated the old +HStream SQL Shell and some other node management functionalities to the new +HStream CLI. HStream CLI currently supports operations such as starting an +interacting SQL shell, sending bootstrap initiation and checking server node +status. You can refer to [the CLI documentation](./reference/cli.md) for +details. + +#### Monitoring with Grafana + +We provide a basic monitoring solution based on Prometheus and Grafana. Metrics +collected by HStreamDB will be stored in Prometheus by the exporter and +displayed on the Grafana board. + +#### Deployment on K8s with Helm + +We provide a helm chart to support deploying HStreamDB on k8s using Helm. You +can refer to [the documentation](./deploy/deploy-helm.md) for +details. + +### Java Client + +The +[Java Client v0.9.0](https://github.com/hstreamdb/hstreamdb-java/releases/tag/v0.9.0) +has been released, with support for HStreamDB v0.9. + +### Golang Client + +The +[Go Client v0.2.0](https://github.com/hstreamdb/hstreamdb-go/releases/tag/v0.2.0) +has been released, with support for HStreamDB v0.9. + +### Python Client + +The +[Python Client v0.2.0](https://github.com/hstreamdb/hstreamdb-py/releases/tag/v0.2.0) +has been released, with support for HStreamDB v0.9. + +## v0.8.0 [2022-04-29] + +### HServer + +#### New Features + +- Add [mutual TLS support](./security/overview.md) +- Add `maxUnackedRecords` option in Subscription: The option controls the + maximum number of unacknowledged records allowed. When the amount of unacked + records reaches the maximum setting, the server will stop sending records to + consumers, which can avoid the accumulation of unacked records impacting the + server's and consumers' performance. We suggest users adjust the option based + on the consumption performance of their application. +- Add `backlogDuration` option in Streams: the option determines how long + HStreamDB will store the data in the stream. The data will be deleted and + become inaccessible when it exceeds the time set. +- Add `maxRecordSize` option in Streams: Users can use the option to control the + maximum size of a record batch in the stream when creating a stream. If the + record size exceeds the value, the server will return an error. +- Add more metrics for HStream Server. +- Add compression configuration for HStream Server. + +#### Enhancements + +- [breaking changes] Simplify protocol, refactored codes and improve the + performance of the subscription +- Optimise the implementation and improve the performance of resending +- Improve the reading performance for the HStrore client. +- Improve how duplicated acknowledges are handled in the subscription +- Improve subscription deletion +- Improve stream deletion +- Improve the consistent hashing algorithm of the cluster +- Improve the handling of internal exceptions for the HStream Server +- Optimise the setup steps of the server +- Improve the implementation of the stats module + +#### Bug fixes + +- Fix several memory leaks caused by grpc-haskell +- Fix several zookeeper client issues +- Fix the problem that the checkpoint store already exists during server startup +- Fix the inconsistent handling of the default key during the lookupStream + process +- Fix the problem of stream writing error when the initialisation of hstore + loggroup is incompleted +- Fix the problem that hstore client writes incorrect data +- Fix an error in allocating to idle consumers on subscriptions +- Fix the memory allocation problem of hstore client's `appendBatchBS` function +- Fix the problem of losing retransmitted data due to the unavailability of the + original consumer +- Fix the problem of data distribution caused by wrong workload sorting + +### Java Client + +#### New Features + +- Add TLS support +- Add `FlowControlSetting` setting for `BufferedProducer` +- Add `maxUnackedRecords` setting for subscription +- Add `backlogDurantion` setting for stream +- Add force delete support for subscription +- Add force delete support for stream + +#### Enhancements + +- [Breaking change] Improve `RecordId` as opaque `String` +- Improve the performance of `BufferedProducer` +- Improve `Responder` with batched acknowledges for better performance +- Improve `BufferedProducerBuilder` to use `BatchSetting` with unified + `recordCountLimit`, `bytesCountLimit`, `ageLimit` settings +- Improve the description of API in javadoc + +#### Bug fixes + +- Fix `streamingFetch` is not canceled when `Consumer` is closed +- Fix missing handling for grpc exceptions in `Consumer` +- Fix the incorrect computation of accumulated record size in `BufferedProducer` + +### Go Client + +- hstream-go v0.1.0 has been released. For a more detailed introduction and + usage, please check the + [Github repository](https://github.com/hstreamdb/hstreamdb-go). + +### Admin Server + +- a new admin server has been released, see + [Github repository](https://github.com/hstreamdb/http-services) + +### Tools + +- Add [bench tools](https://github.com/hstreamdb/bench) +- [dev-deploy] Support limiting resources of containers +- [dev-deploy] Add configuration to restart containers +- [dev-deploy] Support uploading all configuration files in deploying +- [dev-deploy] Support deployments with Prometheus Integration + +## v0.7.0 [2022-01-28] + +### Features + +#### Add transparent sharding support + +HStreamDB has already supported the storage and management of large-scale data +streams. With the newly added cluster support in the last release, we decided to +improve a single stream's scalability and reading/writing performance with a +transparent sharding strategy. In HStreamDB v0.7, every stream is spread across +multiple server nodes, but it appears to users that a stream with partitions is +managed as an entity. Therefore, users do not need to specify the number of +shards or any sharding logic in advance. + +In the current implementation, each record in a stream should contain an +ordering key to specify a logical partition, and the HStream server will be +responsible for mapping these logical partitions to physical partitions when +storing data. + +#### Redesign load balancing with the consistent hashing algorithm + +We have adapted our load balancing with a consistent hashing algorithm in this +new release. Both write and read requests are currently allocated by the +ordering key of the record carried in the request. + +In the previous release, our load balancing was based on the hardware usage of +the nodes. The main problem with this was that it relied heavily on a leader +node to collect it. At the same time, this policy requires the node to +communicate with the leader to obtain the allocation results. Overall the past +implementation was too complex and inefficient. Therefore, we have +re-implemented the load balancer, which simplifies the core algorithm and copes +well with redistribution when cluster members change. + +#### Add HStream admin tool + +We have provided a new admin tool to facilitate the maintenance and management +of HStreamDB. HAdmin can be used to monitor and manage the various resources of +HStreamDB, including Stream, Subscription and Server nodes. The HStream Metrics, +previously embedded in the HStream SQL Shell, have been migrated to the new +HAdmin. In short, HAdmin is for HStreamDB operators, and SQL Shell is for +HStreamDB end-users. + +#### Deployment and usage + +- Support quick deployment via the script, see: + [Manual Deployment with Docker](./deploy/deploy-docker.md) +- Support config HStreamDB with a configuration file, see: + [HStreamDB Configuration](./reference/config.md) +- Support one-step docker-compose for quick-start: + [Quick Start With Docker Compose](./start/quickstart-with-docker.md) + +**To make use of HStreamDB v0.7, please use +[hstreamdb-java v0.7.0](https://github.com/hstreamdb/hstreamdb-java) and above** + +## v0.6.0 [2021-11-04] + +### Features + +#### Add HServer cluster support + +As a cloud-native distributed streaming database, HStreamDB has adopted a +separate architecture for computing and storage from the beginning of design, to +support the independent horizontal expansion of the computing layer and storage +layer. In the previous version of HStreamDB, the storage layer HStore already +has the ability to scale horizontally. In this release, the computing layer +HServer will also support the cluster mode so that the HServer node of the +computing layer can be expanded according to the client request and the scale of +the computing task. + +HStreamDB's computing node HServer is designed to be stateless as a whole, so it +is very suitable for rapid horizontal expansion. The HServer cluster mode of +v0.6 mainly includes the following features: + +- Automatic node health detection and failure recovery +- Scheduling and balancing client requests or computing tasks according to the + node load conditions +- Support dynamic joining and exiting of nodes + +#### Add shared-subscription mode + +In the previous version, one subscription only allowed one client to consume +simultaneously, which limited the client's consumption capacity in the scenarios +with a large amount of data. Therefore, in order to support the expansion of the +client's consumption capacity, HStreamDB v0.6 adds a shared-subscription mode, +which allows multiple clients to consume in parallel on one subscription. + +All consumers included in the same subscription form a Consumer Group, and +HServer will distribute data to multiple consumers in the consumer group through +a round-robin manner. The consumer group members can be dynamically changed at +any time, and the client can join or exit the current consumer group at any +time. + +HStreamDB currently supports the "at least once" consumption semantics. After +the client consumes each data, it needs to reply to the ACK. If the Ack of a +certain piece of data is not received within the timeout, HServer will +automatically re-deliver the data to the available consumers. + +Members in the same consumer group share the consumption progress. HStream will +maintain the consumption progress according to the condition of the client's +Ack. The client can resume consumption from the previous location at any time. + +It should be noted that the order of data is not maintained in the shared +subscription mode of v0.6. Subsequent shared subscriptions will support a +key-based distribution mode, which can support the orderly delivery of data with +the same key. + +#### Add statistical function + +HStreamDB v0.6 also adds a basic data statistics function to support the +statistics of key indicators such as stream write rate and consumption rate. +Users can view the corresponding statistical indicators through HStream CLI, as +shown in the figure below. + +![](./statistics.png) + +#### Add REST API for data writing + +HStreamDB v0.6 adds a REST API for writing data to HStreamDB. diff --git a/docs/zh/v0.17.0/security/_index.md b/docs/zh/v0.17.0/security/_index.md new file mode 100644 index 0000000..2c92852 --- /dev/null +++ b/docs/zh/v0.17.0/security/_index.md @@ -0,0 +1,6 @@ +--- +order: ["overview.md", "encryption.md", "authentication.md"] +collapsed: false +--- + +安全 diff --git a/docs/zh/v0.17.0/security/authentication.md b/docs/zh/v0.17.0/security/authentication.md new file mode 100644 index 0000000..cc807f4 --- /dev/null +++ b/docs/zh/v0.17.0/security/authentication.md @@ -0,0 +1,71 @@ +# 认证 + +在开启TLS之后,客户端就可以验证连接的服务端的合法性,同时可以保证中间数据处于加密状态,但服务端无法验证客户的合法性,所以认证机制提供了一个服务端可以对可信的客户端进行验证的功能。 + +认证功能还提供了另一个功能,它可以赋予每个客户端一个角色名,hstream在这个基础上实现授权等功能。 + +hstream目前只支持基于TLS的认证功能,这是TLS的一个扩展, +为了启用TLS认证功能,你需要为一个角色创建密钥和证书, +然后把这个角色的密钥和证书交给可信的客户端, +这些客户端就可以通过这个绑定了相应角色的密钥和证书连接服务端。 + +## 创建一个可信角色 +生成密钥: +```shell +openssl genrsa -out role01.key.pem 2048 +``` + +把密钥转换成PKCS 8格式(Java客户端需要这种格式的密钥) +```shell +openssl pkcs8 -topk8 -inform PEM -outform PEM \ + -in role01.key.pem -out role01.key-pk8.pem -nocrypt +``` + +生成证书请求(填写Common Name时,需要填写角色名): +```shell +openssl req -config openssl.cnf \ + -key role01.key.pem -new -sha256 -out role01.csr.pem +``` + +生成签名证书: +```shell +openssl ca -config openssl.cnf -extensions usr_cert \ + -days 1000 -notext -md sha256 \ + -in role01.csr.pem -out signed.role01.cert.pem +``` + +## 配置 +对于hstream服务端,你可以通过设置``tls-ca-path``来开启TLS认证功能,如: +```yaml +# TLS options +# +# enable tls, which requires tls-key-path and tls-cert-path options +enable-tls: true +# +# key file path for tls, can be generated by openssl +tls-key-path: /path/to/the/server.key.pem +# +# the signed certificate by CA for the key(tls-key-path) +tls-cert-path: /path/to/the/signed.server.cert.pem +# +# optional for tls, if tls-ca-path is not empty, then enable TLS authentication, +# in the handshake phase, +# the server will request and verify the client's certificate. +tls-ca-path: /path/to/the/ca.cert.pem +``` + +Java客户端示例: +```java +HStreamClient.builder() + .serviceUrl(serviceUrl) + // enable tls + .enableTLS() + .tlsCaPath("/path/to/ca.pem") + + // for authentication + .enableTlsAuthentication() + .tlsKeyPath("path/to/role01.key-pk8.pem") + .tlsCertPath("path/to/signed.role01.cert.pem") + + .build() +``` diff --git a/docs/zh/v0.17.0/security/encryption.md b/docs/zh/v0.17.0/security/encryption.md new file mode 100644 index 0000000..bfa7489 --- /dev/null +++ b/docs/zh/v0.17.0/security/encryption.md @@ -0,0 +1,95 @@ +# 数据加密 + +hstream已经使用TLS支持了客户端和服务端之间的数据加密功能, +在本章节,不会过多介绍TLS的细节, +而是集中在开启TLS所需的一些步骤和配置上。 + +## 步骤 + +如果你还没有一个CA,你可以本地创建一个, +TLS需要服务端有一个私钥和一个对应签名证书, +openssl可以很好地生成key和签证证书, +在那之后,你需要在服务端和客户端配置上对应的生成文件路径。 + +### 创建CA + +创建或选择一个目录,用于存储私钥和证书: +```shell +mkdir tls +cd tls +``` + +创建数据库文件和序列号文件: +```shell +touch index.txt +echo 1000 > serial +``` + +获取openssl.cnf模板文件(注意:**这个模板文件主要是用来测试和开发,请不要直接在生产环境使用**) +```shell +wget https://raw.githubusercontent.com/hstreamdb/hstream/main/conf/openssl.cnf +``` + +生成CA密钥文件: +```shell +openssl genrsa -aes256 -out ca.key.pem 4096 +``` + +生成CA证书文件: +```shell +openssl req -config openssl.cnf -key ca.key.pem \ + -new -x509 -days 7300 -sha256 -extensions v3_ca \ + -out ca.cert.pem +``` + +### 为服务端创建密钥对和签名证书 + +这里,我们只为一个服务器生成密钥和证书, +你应该为所有有不同主机名的服务端创建密钥和证书, +或者创建一个在SAN中包含所有主机名(域名或IP地址)的证书。 + +生成服务端密钥: +```shell +openssl genrsa -out server01.key.pem 2048 +``` + +生成一个服务端证书请求, +你输入Common Name时,应该输入正确的主机名(如localhost): +```shell +openssl req -config openssl.cnf \ + -key server01.key.pem -new -sha256 -out server01.csr.pem +``` + +generate server certificate with generated CA: +```shell +openssl ca -config openssl.cnf -extensions server_cert \ + -days 1000 -notext -md sha256 \ + -in server01.csr.pem -out signed.server01.cert.pem +``` + +### 配置服务端和服务端 +服务端配置: +```yaml +# TLS options +# +# enable tls, which requires tls-key-path and tls-cert-path options +enable-tls: true + +# +# key file path for tls, can be generated by openssl +tls-key-path: /path/to/the/server01.key.pem + +# the signed certificate by CA for the key(tls-key-path) +tls-cert-path: /path/to/the/signed.server01.cert.pem +``` + +客户端示例: +```java +HStreamClient.builder() + .serviceUrl(serviceUrl) + // optional, enable tls + .enableTls() + .tlsCaPath("/path/to/ca.cert.pem") + + .build() +``` diff --git a/docs/zh/v0.17.0/security/overview.md b/docs/zh/v0.17.0/security/overview.md new file mode 100644 index 0000000..6f46874 --- /dev/null +++ b/docs/zh/v0.17.0/security/overview.md @@ -0,0 +1,7 @@ +# 概览 + +考虑到性能和便利,hstream不会默认开启安全功能特性(如加密、认证等),但如果客户端连接服务端的网络是不可信的,那么应该启用这个功能。 + +hstream已经支持的安全特性: ++ 数据加密:可以避免客户端和服务端之间传输的数据被中间人监听和篡改。 ++ 认证:为服务端提供认证客户端合法性的机制,并给授权功能提供统一的接口。 diff --git a/docs/zh/v0.17.0/start/_index.md b/docs/zh/v0.17.0/start/_index.md new file mode 100644 index 0000000..ed9ee03 --- /dev/null +++ b/docs/zh/v0.17.0/start/_index.md @@ -0,0 +1,9 @@ +--- +order: + - try-out-hstream-platform.md + - quickstart-with-docker.md + - hstream-console.md +collapsed: false +--- + +Get started diff --git a/docs/zh/v0.17.0/start/hstream-console-screenshot.png b/docs/zh/v0.17.0/start/hstream-console-screenshot.png new file mode 100644 index 0000000..9490fa4 Binary files /dev/null and b/docs/zh/v0.17.0/start/hstream-console-screenshot.png differ diff --git a/docs/zh/v0.17.0/start/hstream-console.md b/docs/zh/v0.17.0/start/hstream-console.md new file mode 100644 index 0000000..66d99ea --- /dev/null +++ b/docs/zh/v0.17.0/start/hstream-console.md @@ -0,0 +1,33 @@ +# 在 HStream Console 上开始 + +HStream Console 是 HStreamDB 的基于网络的管理工具。它提供了一个图形用户界面,用于管理 HStreamDB 集群。 +通过 HStream Console ,您可以轻松创建和管理流(streams),以及编写 SQL 查询以实时处理数据。除了操作 HStreamDB, +HStream Console 还为集群中的每个资源提供了指标,帮助您监视集群的状态。 + +![HStream Console 概览](./hstream-console-screenshot.png) + +## 特点 + +### 直接管理 HStreamDB 资源 + +HStream Console 提供了一个图形用户界面,用于直接管理 HStreamDB 的资源,包括流、订阅和查询。 +您可以轻松在集群中创建和删除资源、向流中写入数据以及编写 SQL 查询以处理数据。 + +它还可以帮助您搜索集群中的资源,并提供每个资源的详细视图。 + +### 监视集群中的资源 + +在每个资源视图中,HStream Console 提供了一个度量面板,用于实时监视资源状态。借助度量面板, +您可以直观地可视化资源状态,并轻松找出集群的瓶颈所在。 + +### 数据同步 + +借助 HStream Console 中的连接器,您可以实现在 HStreamDB 和其他数据源(如 MySQL、PostgreSQL 和 Elasticsearch)之间同步数据的能力。 +请查看 [HStream IO 概览](../ingest-and-distribute/overview.md) 以了解有关连接器的更多信息。 + +## 下一步操作 + +要了解有关 HStreamDB 资源的更多信息,请点击以下链接: + +- [流 (Streams)](../write/stream.md) +- [订阅 (Subscriptions)](../receive/subscription.md) diff --git a/docs/zh/v0.17.0/start/quickstart-with-docker.md b/docs/zh/v0.17.0/start/quickstart-with-docker.md new file mode 100644 index 0000000..1541633 --- /dev/null +++ b/docs/zh/v0.17.0/start/quickstart-with-docker.md @@ -0,0 +1,244 @@ +# 使用 Docker-Compose 快速开始 + +## 前提条件 + +启动 HStream 需要一个内核版本不小于 Linux 4.14 的操作系统。 + +::: tip +如果遇到无法使用 4.14 或以上版本 Linux 内核的情况, +可以给 HStore 添加一个 `--enable-dscp-reflection=false` 选项。 +::: + +## 安装 + +### 安装 docker + +::: tip +如果您已经有一安装好的 Docker,可以跳过这一步 +::: + +浏览查阅 [Install Docker Engine](https://docs.docker.com/engine/install/),然后 +安装到您的操作系统上。安装时,请注意检查您的设备是否满足所有的前置条件。 + +确认 Docker daemon 正在运行: + +```sh +docker version +``` + +::: tip +在 Linux,Docker 需要 root 权限。当然,你也可以以非 root 用户的方式运行 +Docker,详情可以参考 [Post-installation steps for Linux][non-root-docker]。 +::: + +### 安装 docker-compose + +::: tip +如果您已经有一安装好的 Docker Compose,可以跳过这一步 +::: + +浏览查阅 [Install Docker Compose](https://docs.docker.com/compose/install/),然 +后安装到您的操作系统上。安装时,请注意检查您的设备是否满足所有的前置条件。 + +```sh +docker-compose -v +``` + +## 启动 HStreamDB 服务 + +::: warning +请不要在生产环境中使用以下配置 +::: + +创建一个 quick-start.yaml, 可以直接[下载][quick-start.yaml]或者复制以下内容: + +<<< @/../assets/quick-start.yaml.template{yaml-vue} + +在同一个文件夹中运行: + +```sh +docker-compose -f quick-start.yaml up +``` + +如果出现如下信息,表明现在已经有了一个运行中的 HServer: + +```txt +hserver_1 | [INFO][2021-11-22T09:15:18+0000][app/server.hs:137:3][thread#67]************************ +hserver_1 | [INFO][2021-11-22T09:15:18+0000][app/server.hs:145:3][thread#67]Server started on port 6570 +hserver_1 | [INFO][2021-11-22T09:15:18+0000][app/server.hs:146:3][thread#67]************************* +``` + +::: tip +当然,你也可以选择在后台启动: +```sh +docker-compose -f quick-start.yaml up -d +``` +::: + +:::tip +可以通过以下命令展示 logs: +```sh +docker-compose -f quick-start.yaml logs -f hserver +``` +::: + +## 使用 HStream CLI 连接 HStreamDB + +可以直接使用 `hstream` 命令行接口(CLI)来管理 HStreamDB,该接口包含在 `hstreamdb/hstream` 镜像中。 + +使用 Docker 启动 `hstreamdb/hstream` 实例: + +```sh-vue +docker run -it --rm --name some-hstream-cli --network host hstreamdb/hstream:{{ $version() }} bash +``` + +## 创建 stream + +使用 `hstream stream create` 命令来创建 stream。现在我们将创建一个包含2个 shard 的 stream。 + +```sh +hstream stream create demo --shards 2 +``` + +```sh ++-------------+---------+----------------+-------------+ +| Stream Name | Replica | Retention Time | Shard Count | ++-------------+---------+----------------+-------------+ +| demo | 1 | 604800 seconds | 2 | ++-------------+---------+----------------+-------------+ +``` + +## 向 stream 中写入数据 + +`hstream stream append` 命令会启动一个交互式 shell,可以通过它来向 stream 中写入数据 +```sh +hstream stream append demo --separator "@" +``` +-- `--separator` 选项可以指定 key 分隔符,默认为 “@”。通过分隔符,可以为每条 record 设置一个 key。具有相同 key 的 record +会被写入到 stream 的同一个 shard 中。 + +```sh +key1@{"temperature": 22, "humidity": 80} +key1@{"temperature": 32, "humidity": 21, "tag": "test1"} +hello world! +``` +这里我们写入了 3 条数据。前两条是 json 格式,且被关联到 key1 上,第三条没有设置 key + +如需更多信息,可以使用 `hstream stream append -h`。 + +## 从 stream 中读取数据 + +要从特定的 stream 中读取数据,可以使用 `hstream stream read-stream` 命令。 + +```sh +hstream stream read-stream demo +``` + +```sh +timestamp: "1692774821444", id: 1928822601796943-8589934593-0, key: "key1", record: {"humidity":80.0,"temperature":22.0} +timestamp: "1692774844649", id: 1928822601796943-8589934594-0, key: "key1", record: {"humidity":21.0,"tag":"test1","temperature":32.0} +timestamp: "1692774851017", id: 1928822601796943-8589934595-0, key: "", record: hello world! +``` + +`read-stream` 命令 可以设置读取偏移量,可以有以下三种类型: + +- `earliest`:寻找到 stream 的第一条记录。 +- `latest`:寻找到 stream 的最后一条记录。 +- `timestamp`:寻找到指定创建时间戳的记录。 + +例如: + +```sh +hstream stream read-stream demo --from 1692774844649 --total 1 +``` + +```sh +timestamp: "1692774844649", id: 1928822601796943-8589934594-0, key: "key1", record: {"humidity":21.0,"tag":"test1","temperature":32.0} +``` + +## 启动 HStreamDB 的 SQL 命令行界面 + +```sh-vue +docker run -it --rm --name some-hstream-cli --network host hstreamdb/hstream:{{ $version() }} hstream --port 6570 sql +``` + +如果所有的步骤都正确运行,您将会进入到命令行界面,并且能看见一下帮助信息: + +```txt + __ _________________ _________ __ ___ + / / / / ___/_ __/ __ \/ ____/ | / |/ / + / /_/ /\__ \ / / / /_/ / __/ / /| | / /|_/ / + / __ /___/ // / / _, _/ /___/ ___ |/ / / / + /_/ /_//____//_/ /_/ |_/_____/_/ |_/_/ /_/ + +Command + :h To show these help info + :q To exit command line interface + :help [sql_operation] To show full usage of sql statement + +SQL STATEMENTS: + To create a simplest stream: + CREATE STREAM stream_name; + + To create a query select all fields from a stream: + SELECT * FROM stream_name EMIT CHANGES; + + To insert values to a stream: + INSERT INTO stream_name (field1, field2) VALUES (1, 2); + +> +``` + +## 对这个 stream 执行一个持久的查询操作 + +现在,我们可以通过 `SELECT` 在这个 stream 上执行一个持久的查询。 + +这个查询的结果将被直接展现在 CLI 中。 + +以下查询任务会输出所有 `demo` stream 中具有 humidity 大于 70 的数据。 + +```sql +SELECT * FROM demo WHERE humidity > 70 EMIT CHANGES; +``` + +现在看起来好像无事发生。这是因为从这个任务执行开始,还没有数据被写入到 demo 中。 +接下来,我们会写入一些数据,然后符合条件的数据就会被以上任务输出。 + +## 启动另一个 CLI 窗口 + +我们可以利用这个新的 CLI 来插入数据: + +```sh +docker exec -it some-hstream-cli hstream --port 6570 sql +``` + +## 向 stream 中写入数据 + +输入并运行以下所有 `INSERT` 语句,然后关注我们之前创建的 CLI 窗口。 + +```sql +INSERT INTO demo (temperature, humidity) VALUES (22, 80); +INSERT INTO demo (temperature, humidity) VALUES (15, 20); +INSERT INTO demo (temperature, humidity) VALUES (31, 76); +INSERT INTO demo (temperature, humidity) VALUES ( 5, 45); +INSERT INTO demo (temperature, humidity) VALUES (27, 82); +INSERT INTO demo (temperature, humidity) VALUES (28, 86); +``` + +不出意外的话,你将看到以下的结果。 + +```json +{"humidity":{"$numberLong":"80"},"temperature":{"$numberLong":"22"}} +{"humidity":{"$numberLong":"76"},"temperature":{"$numberLong":"31"}} +{"humidity":{"$numberLong":"82"},"temperature":{"$numberLong":"27"}} +{"humidity":{"$numberLong":"86"},"temperature":{"$numberLong":"28"}} +``` + +[non-root-docker]: https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user +[quick-start.yaml]: https://raw.githubusercontent.com/hstreamdb/docs-next/main/assets/quick-start.yaml + +## 启动 HStreamDB CONSOLE + +Console 是 HStreamDB 的图形管理面板。使用 Console 你可以方便的管理绝大部分 HStreamDB 的资源,执行数据读写,执行 SQL 查询等。 + +在浏览器中输入 http://localhost:5177 即可打开 Console 面板,更多信息请参考 [在 HStream Console 上开始](./hstream-console.md) diff --git a/docs/zh/v0.17.0/start/try-out-hstream-platform.md b/docs/zh/v0.17.0/start/try-out-hstream-platform.md new file mode 100644 index 0000000..7d0a7d0 --- /dev/null +++ b/docs/zh/v0.17.0/start/try-out-hstream-platform.md @@ -0,0 +1,85 @@ +# 开始使用 HStream Platform + +This page guides you on how to try out the HStream Platform quickly from scratch. +You will learn how to create a stream, write records to the stream, and query records from the stream. + +## Apply for a Trial + +Before starting, you need to apply for a trial account for the HStream Platform. +If you already have an account, you can skip this step. + +### Create a new account + + + +::: info +By creating an account, you agree to the [Terms of Service](https://www.emqx.com/en/policy/terms-of-use) and [Privacy Policy](https://www.emqx.com/en/policy/privacy-policy). +::: + +To create a new account, please fill in the required information on the form provided on the [Sign Up](https://account.hstream.io/signup) page, all fields are shown below: + +- **Username**: Your username. +- **Email**: Your email address. This email address will be used for the HStream Platform login. +- **Password**: Your password. The password must be at least eight characters long. +- **Company (Optional)**: Your company name. + +After completing the necessary fields, click the **Sign Up** button to proceed with creating your new account. In case of a successful account creation, you will be redirected to the login page. + +### Log in to the HStream Platform + +To log in to the HStream Platform after creating an account, please fill in the required information on the form provided on the [Log In](https://account.hstream.io/login) page, all fields are shown below: + +- **Email**: Your email address. +- **Password**: Your password. + +Once you have successfully logged in, you will be redirected to the home of HStream Platform. + +## Create a stream + +To create a new stream, follow the steps below: + +1. Head to the **Streams** page and locate the **New stream** button. +2. Once clicked, you will be directed to the **New stream** page. +3. Here, simply provide a name for the stream and leave the other fields as default. +4. Finally, click on the **Create** button to finalize the stream creation process. + +The stream will be created immediately, and you will see the stream listed on the **Streams** page. + +::: tip +For more information about how to create a stream, see [Create a Stream](../platform/stream-in-platform.md#create-a-stream). +::: + +## Write records to the stream + +After creating a stream, you can write records to the stream. Go to the stream details page by clicking the stream name in the table and +then click the **Write records** button. A drawer will appear, and you can write records to the stream in the drawer. + +In this example, we will write the following record to the stream: + +```json +{ "name": "Alice", "age": 18 } +``` + +Please fill it in the **Value** Field and click the **Produce** button. + +If the record is written successfully, you will see a success message and the response +of the request. + +Next, we can query this record from the stream. + +::: tip +For more information about how to write records to a stream, see [Write Records to Streams](../platform/write-in-platform.md). +::: + +## Get records from the stream + +After writing records to the stream, you can get records from the stream. Go back +to the stream page, click the **Records** tab, you will see a empty table. + +Click the **Get records** button, and then the record written in the previous step will be displayed. + +## Next steps + +- Explore the [stream in details](../platform/stream-in-platform.md#view-stream-details). +- [Create a subscription](../platform/subscription-in-platform.md#create-a-subscription) to consume records from the stream. +- [Query records](../platform/write-in-platform.md#query-records) from streams. diff --git a/docs/zh/v0.17.0/statistics.png b/docs/zh/v0.17.0/statistics.png new file mode 100644 index 0000000..adaf375 Binary files /dev/null and b/docs/zh/v0.17.0/statistics.png differ diff --git a/docs/zh/v0.17.0/write/_index.md b/docs/zh/v0.17.0/write/_index.md new file mode 100644 index 0000000..d91d614 --- /dev/null +++ b/docs/zh/v0.17.0/write/_index.md @@ -0,0 +1,6 @@ +--- +order: ["stream.md", "shards.md", "write.md"] +collapsed: false +--- + +Write data diff --git a/docs/zh/v0.17.0/write/shards.md b/docs/zh/v0.17.0/write/shards.md new file mode 100644 index 0000000..5724a95 --- /dev/null +++ b/docs/zh/v0.17.0/write/shards.md @@ -0,0 +1,39 @@ +# Manage Shards of the Stream + +## Sharding in HStreamDB + +A stream is a logical concept for producer and consumer, and under the hood, +these data passing through are stored in the shards of the stream in an +append-only fashion. + +A shard is essentially the primary storage unit which contains all the +corresponding records with some partition keys. Every stream will contain +multiple shards spread across multiple server nodes. Since we believe that +stream on itself is a sufficiently concise and powerful abstraction, the +sharding logic is minimally visible to the user. For example, during writing or +consumption, each stream appears to be managed as an entity as far as the user +is concerned. + +However, for the cases where the user needs more fine-grained control and better +flexibility, we offer interfaces to get into the details of shards of the stream +and other interfaces to work with shards like Reader. + +## Specify the Number of Shards When Creating a Stream + +To decide the number of shards which a stream should have, an attribute +shardCount is provided when creating a +[stream](./stream.md#attributes-of-a-stream). + +## List Shards + +To list all the shards of one stream. + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ListShardsExample.java [Java] + +<<< @/../examples/go/examples/ExampleListShards.go [Go] + +@snippet examples/py/snippets/guides.py common list-shards + +::: diff --git a/docs/zh/v0.17.0/write/stream.md b/docs/zh/v0.17.0/write/stream.md new file mode 100644 index 0000000..a6b4e10 --- /dev/null +++ b/docs/zh/v0.17.0/write/stream.md @@ -0,0 +1,81 @@ +# 创建和管理 Stream + +## 命名资源准则 + +一个 HStream 资源的名称可以唯一地识别一个 HStream 资源,如一个 stream、 subscription 或 reader。 +资源名称必须符合以下要求: + +- 以一个字母开头 +- 长度必须不超过 255 个字符 +- 只包含以下字符。字母`[A-Za-z]`,数字`[0-9]`。 + 破折号`-`,下划线`_`。 + +\*用于资源名称作为 SQL 语句的一部分的情况。例如在 [HStream SQL Shell](../reference/cli.md#hstream-sql) 中或者用 SQL 创建 IO 任务时, +将会出现资源名称无法被正确解析的情况(如与关键词冲突),此时需要用户用双引号 `"`括住资源名称。这个限制或将会在日后的版本中被改进移除。 + +## Stream 的属性 + +- Replication factor + + 为了容错性和更高的可用性,每个 Stream 都可以在集群中的节点之间进行复制。一个常 + 用的生产环境 Replication factor 配置是为 3,也就是说,你的数据总是有三个副本, + 这在出错或你想对 Server 进行维护时将会很有帮助。这种复制是以 Stream 为单位上进 + 行的。 + +- Backlog Retention + + 该配置控制 HStreamDB 的 Stream 中的 records 被写入后保留的时间。当超过 + retention 保留的时间后,HStreamDB 将会清理这些 records,不管它是否被消费过。 + + - 默认值=7 天 + - 最小值=1 秒 + - 最大值=21 天 + +## 创建一个 stream + +在你写入 records 或者 创建一个订阅之前先创建一个 stream。 + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/CreateStreamExample.java [Java] + +<<< @/../examples/go/examples/ExampleCreateStream.go [Go] + +@snippet examples/py/snippets/guides.py common create-stream + +::: + +## 删除一个 Stream + +只有当一个 Stream 没有所属的订阅时才允许被删除,除非传一个强制标删除的 flag 。 + +## 强制删除一个 Stream + +如果你需要删除一个有订阅的 stream 时,请启用强制删除。在强制删除一个 stream 后, +原来 stream 的订阅仍然可以从 backlog 中读取数据。这些订阅的 stream 名字会变成 +`__deleted_stream__`。同时,我们并不允许在被删除的 stream 上创建新的订阅,也不允 +许向该 stream 写入新的 record。 + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/DeleteStreamExample.java [Java] + +<<< @/../examples/go/examples/ExampleDeleteStream.go [Go] + +@snippet examples/py/snippets/guides.py common delete-stream + +::: + +## 列出所有 stream 信息 + +可以如下拿到所有 HStream 中的 stream: + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/ListStreamsExample.java [Java] + +<<< @/../examples/go/examples/ExampleListStreams.go [Go] + +@snippet examples/py/snippets/guides.py common list-streams + +::: diff --git a/docs/zh/v0.17.0/write/write.md b/docs/zh/v0.17.0/write/write.md new file mode 100644 index 0000000..b9e90cb --- /dev/null +++ b/docs/zh/v0.17.0/write/write.md @@ -0,0 +1,84 @@ +# 向 HStreamDB 中的 Stream 写入 Records + +本文档提供了关于如何通过 hstreamdb-java 等客户端向 HStreamDB 中的 Stream 写入数据的相关教程。 + +同时还可参考其他的相关教程: + +- 如何[创建和管理 Stream](./stream.md). +- 如何[通过 Subscription 消费写入 Stream 中的 Records](../receive/consume.md). + +为了向 HStreamDB 写数据,我们需要将消息打包成 HStream Record,以及一个创建和发送 +消息到服务器的 Producer。 + +## HStream Record + +Stream 中的所有数据都是以 HStream Record 的形式存在,HStreamDB 支持以下两种 +HStream Record: + +- **HRecord**: 可以看作是一段 JSON 数据,就像一些 NoSQL 数据库中的 document。 +- **Raw Record**: 二进制数据。 + +## 端到端压缩 + +为了降低传输开销,最大化带宽利用率,HStreamDB 支持对写入的 HStream Record 进行压缩。 +用户在创建 `BufferedProducer` 时可以设置压缩算法。当前可选的压缩算法有 +`gzip` 和 `zstd`。客户端从 HStreamDB 中消费数据时会自动完成解压缩操作。 + +## 写入 HStream Record + +有两种方法可以把 records 写入 HStreamDB。从简单易用的角度,你可以从 +`client.newProducer()` 的`Producer` 入手。这个 `Producer` 没有提供任何配置项,它 +只会即刻将收到的每个 record 并行发送到 HServer,这意味着它并不能保证这些 records +的顺序。在生产环境中, `client.newBufferedProducer()` 中的 `BufferedProducer` 将 +是更好的选择,`BufferedProducer` 将按顺序缓存打包 records 成一个 batch,并将该 +batch 发送到服务器。每一条 record 被写入 stream 时,HServer 将为该 record 生成一 +个相应的 record ID,并将其发回给客户端。这个 record ID 在 stream 中是唯一的。 + +## 使用 Producer + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/WriteDataSimpleExample.java [Java] + +<<< @/../examples/go/examples/ExampleWriteProducer.go [Go] + +@snippet examples/py/snippets/guides.py common append-records + +::: + +## 使用 BufferedProducer + +在几乎所有情况下,我们更推荐使用 `BufferedProducer`。不仅因为它能提供更大的吞吐 +量,它还提供了更加灵活的配置去调整,用户可以根据需求去在吞吐量和时延之间做出调整 +。你可以配置 `BufferedProducer` 的以下两个设置来控制和设置触发器和缓存区大小。通 +过 `BatchSetting`,你可以根据 batch 的最大 record 数、batch 的总字节数和 batch +存在的最大时限来决定何时发送。通过配置 `FlowControlSetting`,你可以为所有的缓存 +的 records 设置缓存大小和策略。下面的代码示例展示了如何使用 BatchSetting 来设置 +响应的 trigger,以通知 producers 何时应该刷新,以及 `FlowControlSetting` 来限制 +`BufferedProducer` 中的 buffer 的最大字节数。 + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/WriteDataBufferedExample.java [Java] + +<<< @/../examples/go/examples/ExampleWriteBatchProducer.go [Go] + +@snippet examples/py/snippets/guides.py common buffered-append-records + +::: + +## 使用分区键(Partition Key) + +具有相同分区键的 records 可以在 BufferedProducer 中被保证能有序地写入。HStreamDB +的另一个重要功能,分区,也使用这些分区键来决定 records 将被分配到哪个分区, +以此提高写/读性能。更详细的解释请看[管理 Stream 的分区](./shards.md)。 + +参考下面的例子,你可以很容易地写入带有分区键的 records。 + +::: code-group + +<<< @/../examples/java/app/src/main/java/docs/code/examples/WriteDataWithKeyExample.java [Java] + +<<< @/../examples/go/examples/ExampleWriteBatchProducerMultiKey.go [Go] + +:::