diff --git a/.github/ISSUE_TEMPLATE/help_support.md b/.github/ISSUE_TEMPLATE/help_support.md index 3b7cc7c25ae..51669998cac 100644 --- a/.github/ISSUE_TEMPLATE/help_support.md +++ b/.github/ISSUE_TEMPLATE/help_support.md @@ -1,11 +1,11 @@ --- name: QA/Help/Support -about: Please use [Discussion](https://github.com/uber/cadence/discussions) or [StackOverflow](https://stackoverflow.com/questions/tagged/cadence-workflow) for QA/Help/Support +about: Please use [Discussion](https://github.com/cadence-workflow/cadence/discussions) or [StackOverflow](https://stackoverflow.com/questions/tagged/cadence-workflow) for QA/Help/Support title: '' labels: '' assignees: '' --- -Please use [Discussion](https://github.com/uber/cadence/discussions) or [StackOverflow](https://stackoverflow.com/questions/tagged/cadence-workflow) for QA/Help/Support. -Do NOT use issue for this. \ No newline at end of file +Please use [Discussion](https://github.com/cadence-workflow/cadence/discussions) or [StackOverflow](https://stackoverflow.com/questions/tagged/cadence-workflow) for QA/Help/Support. +Do NOT use issue for this. diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 28e669bd9fd..5030737666b 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -16,5 +16,5 @@ **Release notes** - + **Documentation Changes** diff --git a/CHANGELOG.md b/CHANGELOG.md index 4452007c711..4d1e2376011 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,34 +4,34 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -You can find a list of previous releases on the [github releases](https://github.com/uber/cadence/releases) page. +You can find a list of previous releases on the [github releases](https://github.com/cadence-workflow/cadence/releases) page. ## [Unreleased] -Global ratelimiter, see [detailed doc](https://github.com/uber/cadence/blob/master/common/quotas/global/doc.go) +Global ratelimiter, see [detailed doc](https://github.com/cadence-workflow/cadence/blob/master/common/quotas/global/doc.go) ## [1.2.13] - 2024-09-25 -See [Release Note](https://github.com/uber/cadence/releases/tag/v1.2.13) for details +See [Release Note](https://github.com/cadence-workflow/cadence/releases/tag/v1.2.13) for details ## [1.2.12] - 2024-08-19 -See [Release Note](https://github.com/uber/cadence/releases/tag/v1.2.12) for details +See [Release Note](https://github.com/cadence-workflow/cadence/releases/tag/v1.2.12) for details ## [1.2.11] - 2024-07-10 -See [Release Note](https://github.com/uber/cadence/releases/tag/v1.2.11) for details +See [Release Note](https://github.com/cadence-workflow/cadence/releases/tag/v1.2.11) for details ## [1.2.10] - 2024-06-04 -See [Release Note](https://github.com/uber/cadence/releases/tag/v1.2.10) for details +See [Release Note](https://github.com/cadence-workflow/cadence/releases/tag/v1.2.10) for details ## [1.2.9] - 2024-05-01 -See [Release Note](https://github.com/uber/cadence/releases/tag/v1.2.9) for details +See [Release Note](https://github.com/cadence-workflow/cadence/releases/tag/v1.2.9) for details ## [1.2.8] - 2024-03-26 -See [Release Note](https://github.com/uber/cadence/releases/tag/v1.2.8) for details +See [Release Note](https://github.com/cadence-workflow/cadence/releases/tag/v1.2.8) for details ## [1.2.7] - 2024-02-09 -See [Release Note](https://github.com/uber/cadence/releases/tag/v1.2.7) for details +See [Release Note](https://github.com/cadence-workflow/cadence/releases/tag/v1.2.7) for details ### Upgrade notes Cadence repo now has multiple submodules, -the split and explanation in [PR](https://github.com/uber/cadence/pull/5609). +the split and explanation in [PR](https://github.com/cadence-workflow/cadence/pull/5609). In principle, "plugins" are "optional" and we should not be forcing all optional dependencies on all users of any of Cadence. Splitting dependencies into choose-your-own-adventure submodules is simply good library design for the ecosystem, and it's something we should be doing more of. @@ -275,10 +275,10 @@ Disable isolation for sticky tasklist (#5319) Change default value of AsyncTaskDispatchTimeout (#5320) ## [1.0.0] - 2023-04-26 -See [Release Note](https://github.com/uber/cadence/releases/tag/v1.0.0) +See [Release Note](https://github.com/cadence-workflow/cadence/releases/tag/v1.0.0) ## [0.23.1] - 2021-11-18 -See [Release Note](https://github.com/uber/cadence/releases/tag/v0.23.1) +See [Release Note](https://github.com/cadence-workflow/cadence/releases/tag/v0.23.1) ## [0.21.3] - 2021-07-17 ### Added diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 16b83424888..fba34d212a2 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -2,7 +2,7 @@ This doc is intended for contributors to `cadence` server (hopefully that's you!) -Join our Slack channel(invite link in the [home page](https://github.com/uber/cadence#cadence)) #development if you need help. +Join our Slack channel(invite link in the [home page](https://github.com/cadence-workflow/cadence#cadence)) #development if you need help. >Note: All contributors need to fill out the [Uber Contributor License Agreement](http://t.uber.com/cla) before we can merge in any of your changes ## Development Environment @@ -13,7 +13,7 @@ Below are the instructions of how to set up a development Environment. * Golang. Install on OS X with. ``` brew install go -``` +``` * Make sure set PATH to include bin path of GO so that other executables like thriftrw can be found. ```bash # check it first @@ -22,9 +22,9 @@ echo $GOPATH PATH=$PATH:$GOPATH/bin # to confirm, run echo $PATH -``` +``` -* Download golang dependencies. +* Download golang dependencies. ```bash go mod download ``` @@ -35,7 +35,7 @@ After check out and go to the Cadence repo, compile the `cadence` service and he git submodule update --init --recursive make bins -``` +``` You should be able to get all the binaries of this repo: * cadence-server: the server binary @@ -46,18 +46,18 @@ You should be able to get all the binaries of this repo: * cadence-bench: the benchmark test binary -:warning: Note: +:warning: Note: If running into any compiling issue ->1. For proto/thrift errors, run `git submodule update --init --recursive` to fix +>1. For proto/thrift errors, run `git submodule update --init --recursive` to fix >2. Make sure you upgrade to the latest stable version of Golang. ->3. Check if this document is outdated by comparing with the building steps in [Dockerfile](https://github.com/uber/cadence/blob/master/Dockerfile) +>3. Check if this document is outdated by comparing with the building steps in [Dockerfile](https://github.com/cadence-workflow/cadence/blob/master/Dockerfile) ### 2. Setup Dependency NOTE: you may skip this section if you have installed the dependencies in any other ways, for example, using homebrew. Cadence's core data model can be running with different persistence storages, including Cassandra,MySQL and Postgres. -Please refer to [persistence documentation](https://github.com/uber/cadence/blob/master/docs/persistence.md) if you want to learn more. +Please refer to [persistence documentation](https://github.com/cadence-workflow/cadence/blob/master/docs/persistence.md) if you want to learn more. Cadence's visibility data model can be running with either Cassandra/MySQL/Postgres database, or ElasticSearch+Kafka. The latter provides [advanced visibility feature](./docs/visibility-on-elasticsearch.md) We recommend to use [docker-compose](https://docs.docker.com/compose/) to start those dependencies: @@ -66,52 +66,52 @@ We recommend to use [docker-compose](https://docs.docker.com/compose/) to start ``` docker-compose -f ./docker/dev/cassandra.yml up ``` -You will use `CTRL+C` to stop it. Then `docker-compose -f ./docker/dev/cassandra.yml down` to clean up the resources. +You will use `CTRL+C` to stop it. Then `docker-compose -f ./docker/dev/cassandra.yml down` to clean up the resources. Or to run in the background ``` -docker-compose -f ./docker/dev/cassandra.yml up -d +docker-compose -f ./docker/dev/cassandra.yml up -d ``` Also use `docker-compose -f ./docker/dev/cassandra.yml down` to stop and clean up the resources. * Alternatively, use `./docker/dev/mysql.yml` for MySQL dependency. (MySQL has been updated from 5.7 to 8.0) -* Alternatively, use `./docker/dev/postgres.yml` for PostgreSQL dependency +* Alternatively, use `./docker/dev/postgres.yml` for PostgreSQL dependency * Alternatively, use `./docker/dev/cassandra-esv7-kafka.yml` for Cassandra, ElasticSearch(v7) and Kafka/ZooKeeper dependencies * Alternatively, use `./docker/dev/mysql-esv7-kafka.yml` for MySQL, ElasticSearch(v7) and Kafka/ZooKeeper dependencies * Alternatively, use `./docker/dev/cassandra-opensearch-kafka.yml` for Cassandra, OpenSearch(compatible with ElasticSearch v7) and Kafka/ZooKeeper dependencies * Alternatively, use `./docker/dev/mongo-esv7-kafka.yml` for MongoDB, ElasticSearch(v7) and Kafka/ZooKeeper dependencies -### 3. Schema installation -Based on the above dependency setup, you also need to install the schemas. +### 3. Schema installation +Based on the above dependency setup, you also need to install the schemas. * If you use `cassandra.yml` then run `make install-schema` to install Cassandra schemas * If you use `cassandra-esv7-kafka.yml` then run `make install-schema && make install-schema-es-v7` to install Cassandra & ElasticSearch schemas -* If you use `cassandra-opensearch-kafka.yml` then run `make install-schema && make install-schema-es-opensearch` to install Cassandra & OpenSearch schemas +* If you use `cassandra-opensearch-kafka.yml` then run `make install-schema && make install-schema-es-opensearch` to install Cassandra & OpenSearch schemas * If you use `mysql.yml` then run `install-schema-mysql` to install MySQL schemas * If you use `postgres.yml` then run `install-schema-postgres` to install Postgres schemas * `mysql-esv7-kafka.yml` can be used for single MySQL + ElasticSearch or multiple MySQL + ElasticSearch mode * for single MySQL: run `install-schema-mysql && make install-schema-es-v7` - * for multiple MySQL: run `make install-schema-multiple-mysql` which will install schemas for 4 mysql databases and ElasticSearch + * for multiple MySQL: run `make install-schema-multiple-mysql` which will install schemas for 4 mysql databases and ElasticSearch -:warning: Note: ->If you use `cassandra-esv7-kafka.yml` and start server before `make install-schema-es-v7`, ElasticSearch may create a wrong index on demand. +:warning: Note: +>If you use `cassandra-esv7-kafka.yml` and start server before `make install-schema-es-v7`, ElasticSearch may create a wrong index on demand. You will have to delete the wrong index and then run the `make install-schema-es-v7` again. To delete the wrong index: ``` curl -X DELETE "http://127.0.0.1:9200/cadence-visibility-dev" ``` -### 4. Run +### 4. Run Once you have done all above, try running the local binaries: -Then you will be able to run a basic local Cadence server for development. +Then you will be able to run a basic local Cadence server for development. - * If you use `cassandra.yml`, then run `./cadence-server start`, which will load `config/development.yaml` as config + * If you use `cassandra.yml`, then run `./cadence-server start`, which will load `config/development.yaml` as config * If you use `mysql.yml` then run `./cadence-server --zone mysql start`, which will load `config/development.yaml` + `config/development_mysql.yaml` as config - * If you use `postgres.yml` then run `./cadence-server --zone postgres start` , which will load `config/development.yaml` + `config/development_postgres.yaml` as config + * If you use `postgres.yml` then run `./cadence-server --zone postgres start` , which will load `config/development.yaml` + `config/development_postgres.yaml` as config * If you use `cassandra-esv7-kafka.yml` then run `./cadence-server --zone es_v7 start`, which will load `config/development.yaml` + `config/development_es_v7.yaml` as config * If you use `cassandra-opensearch-kafka.yml` then run `./cadence-server --zone es_opensearch start` , which will load `config/development.yaml` + `config/development_es_opensearch.yaml` as config - * If you use `mysql-esv7-kafka.yaml` + * If you use `mysql-esv7-kafka.yaml` * To run with multiple MySQL : `./cadence-server --zone multiple_mysql start`, which will load `config/development.yaml` + `config/development_multiple_mysql.yaml` as config Then register a domain: @@ -120,19 +120,19 @@ Then register a domain: ``` ### Sample Repo -The sample code is available in the [Sample repo]https://github.com/uber-common/cadence-samples +The sample code is available in the [Sample repo]https://github.com/cadence-workflow/cadence-samples -Then run a helloworld from [Go Client Sample](https://github.com/uber-common/cadence-samples/) or [Java Client Sample](https://github.com/uber/cadence-java-samples) +Then run a helloworld from [Go Client Sample](https://github.com/cadence-workflow/cadence-samples) or [Java Client Sample](https://github.com/cadence-workflow/cadence-java-samples) ``` make bins ``` -will build all the samples. +will build all the samples. -Then +Then +``` +./bin/helloworld -m worker & ``` -./bin/helloworld -m worker & -``` will start a worker for helloworld workflow. You will see like: ``` @@ -143,8 +143,8 @@ $./bin/helloworld -m worker & 2021-09-24T21:07:03.250-0700 INFO common/sample_helper.go:161 Domain successfully registered. {"Domain": "samples-domain"} 2021-09-24T21:07:03.291-0700 INFO internal/internal_worker.go:833 Started Workflow Worker {"Domain": "samples-domain", "TaskList": "helloWorldGroup", "WorkerID": "16520@IT-USA-25920@helloWorldGroup"} 2021-09-24T21:07:03.300-0700 INFO internal/internal_worker.go:858 Started Activity Worker {"Domain": "samples-domain", "TaskList": "helloWorldGroup", "WorkerID": "16520@IT-USA-25920@helloWorldGroup"} -``` -Then +``` +Then ``` ./bin/helloworld ``` @@ -162,16 +162,16 @@ $./bin/helloworld 2021-09-24T21:07:06.513-0700 INFO helloworld/helloworld_workflow.go:55 Workflow completed. {"Domain": "samples-domain", "TaskList": "helloWorldGroup", "WorkerID": "16520@IT-USA-25920@helloWorldGroup", "WorkflowType": "helloWorldWorkflow", "WorkflowID": "helloworld_75cf142b-c0de-407e-9115-1d33e9b7551a", "RunID": "98a229b8-8fdd-4d1f-bf41-df00fb06f441", "Result": "Hello Cadence!"} ``` -See [instructions](service/worker/README.md) for setting up replication(XDC). +See [instructions](service/worker/README.md) for setting up replication(XDC). ## Issues to start with Take a look at the list of issues labeled with -[good first issue](https://github.com/uber/cadence/labels/good%20first%20issue). +[good first issue](https://github.com/cadence-workflow/cadence/labels/good%20first%20issue). These issues are a great way to start contributing to Cadence. -Later when you are more familiar with Cadence, look at issues with -[up-for-grabs](https://github.com/uber/cadence/labels/up-for-grabs). +Later when you are more familiar with Cadence, look at issues with +[up-for-grabs](https://github.com/cadence-workflow/cadence/labels/up-for-grabs). @@ -204,9 +204,9 @@ This will run all the tests excluding end-to-end integration test in host/ packa make test ``` :warning: Note: -> You will see some test failures because of errors connecting to MySQL/Postgres if only Cassandra is up. This is okay if you don't write any code related to persistence layer. - -To run all end-to-end integration tests in **host/** package: +> You will see some test failures because of errors connecting to MySQL/Postgres if only Cassandra is up. This is okay if you don't write any code related to persistence layer. + +To run all end-to-end integration tests in **host/** package: ```bash make test_e2e ``` @@ -232,7 +232,7 @@ You have a few options for choosing when to submit: * You can open a PR with an initial prototype with "Draft" option or with "WIP"(work in progress) in the title. This is useful when want to get some early feedback. -* PR is supposed to be or near production ready. You should have fixed all styling, adding new tests if possible, and verified the change doesn't break any existing tests. +* PR is supposed to be or near production ready. You should have fixed all styling, adding new tests if possible, and verified the change doesn't break any existing tests. * For small changes where the approach seems obvious, you can open a PR with what you believe to be production-ready or near-production-ready code. As you get more experience with how we develop code, you'll find that more PRs will begin falling into this category. @@ -242,19 +242,19 @@ Overcommit adds some requirements to your commit messages. At Uber, we follow th [Chris Beams](http://chris.beams.io/posts/git-commit/) guide to writing git commit messages. Read it, follow it, learn it, love it. -All commit messages are from the titles of your pull requests. So make sure follow the rules when titling them. -Please don't use very generic titles like "bug fixes". +All commit messages are from the titles of your pull requests. So make sure follow the rules when titling them. +Please don't use very generic titles like "bug fixes". All PR titles should start with UPPER case. Examples: -- [Make sync activity retry multiple times before fetch history from remote](https://github.com/uber/cadence/pull/1379) -- [Enable archival config per domain](https://github.com/uber/cadence/pull/1351) +- [Make sync activity retry multiple times before fetch history from remote](https://github.com/cadence-workflow/cadence/pull/1379) +- [Enable archival config per domain](https://github.com/cadence-workflow/cadence/pull/1351) #### Code Format and Licence headers checking -The project has strict rule about Golang format. You have to run +The project has strict rule about Golang format. You have to run ```bash make fmt ``` @@ -277,12 +277,11 @@ No one is expected to write perfect code on the first try. That's why we have co Also, don't be embarrassed when your review points out syntax errors, stray whitespace, typos, and missing docstrings! That's why we have reviews. These properties are meant to guide you in your final scan. ### Addressing feedback -If someone leaves line comments on your PR without leaving a top-level "looks good to me" (LGTM), it means they feel you should address their line comments before merging. +If someone leaves line comments on your PR without leaving a top-level "looks good to me" (LGTM), it means they feel you should address their line comments before merging. You should respond to all unresolved comments whenever you push a new revision or before you merge. -Also, as you gain confidence in Go, you'll find that some of the nitpicky style feedback you get does not make for obviously better code. Don't be afraid to stick to your guns and push back. Much of coding style is subjective. +Also, as you gain confidence in Go, you'll find that some of the nitpicky style feedback you get does not make for obviously better code. Don't be afraid to stick to your guns and push back. Much of coding style is subjective. ### Merging External contributors: you don't need to worry about this section. We'll merge your PR as soon as you've addressed all review feedback(you will get at least one approval) and pipeline runs are all successful(meaning all tests are passing). - diff --git a/PROPOSALS.md b/PROPOSALS.md index e122c175272..7439c7e5bdd 100644 --- a/PROPOSALS.md +++ b/PROPOSALS.md @@ -6,7 +6,7 @@ The design process for changes to Cadence is modeled on the [proposal process us ## Process -- [Create an issue](https://github.com/uber/cadence/issues/new) describing the proposal. +- [Create an issue](https://github.com/cadence-workflow/cadence/issues/new) describing the proposal. - Like any GitHub issue, a proposal issue is followed by an initial discussion about the suggestion. For Proposal issues: @@ -22,7 +22,7 @@ The design process for changes to Cadence is modeled on the [proposal process us - The design doc should only allow edit access to authors of the document. - Do not create the document from a corporate G Suite account. If you want to edit from a corporate G Suite account then first create the document from a personal Google account and grant edit access to your corporate G Suite account. - Comment access should also be accessible by the public. Make sure this is the case by clicking "Share" and "Get shareable link" ensuring you select "Anyone with the link can comment". - + - Once comments and revisions on the design doc wind down, there is a final discussion about the proposal. - The goal of the final discussion is to reach agreement on the next step: (1) accept or (2) decline. diff --git a/README.md b/README.md index 5308539dae4..0208f167738 100644 --- a/README.md +++ b/README.md @@ -1,33 +1,33 @@ # Cadence [![Build Status](https://badge.buildkite.com/159887afd42000f11126f85237317d4090de97b26c287ebc40.svg?theme=github&branch=master)](https://buildkite.com/uberopensource/cadence-server) -[![Coverage](https://codecov.io/gh/uber/cadence/graph/badge.svg?token=7SD244ImNF)](https://codecov.io/gh/uber/cadence) +[![Coverage](https://codecov.io/gh/cadence-workflow/cadence/graph/badge.svg?token=7SD244ImNF)](https://codecov.io/gh/cadence-workflow/cadence) [![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://t.uber.com/cadence-slack) -[![Github release](https://img.shields.io/github/v/release/uber/cadence.svg)](https://GitHub.com/uber/cadence/releases) -[![License](https://img.shields.io/github/license/uber/cadence.svg)](http://www.apache.org/licenses/LICENSE-2.0) +[![Github release](https://img.shields.io/github/v/release/cadence-workflow/cadence.svg)](https://github.com/cadence-workflow/cadence/releases) +[![License](https://img.shields.io/github/license/cadence-workflow/cadence.svg)](http://www.apache.org/licenses/LICENSE-2.0) -[![GitHub stars](https://img.shields.io/github/stars/uber/cadence.svg?style=social&label=Star&maxAge=2592000)](https://GitHub.com/uber/cadence/stargazers/) -[![GitHub forks](https://img.shields.io/github/forks/uber/cadence.svg?style=social&label=Fork&maxAge=2592000)](https://GitHub.com/uber/cadence/network/) +[![GitHub stars](https://img.shields.io/github/stars/cadence-workflow/cadence.svg?style=social&label=Star&maxAge=2592000)](https://github.com/cadence-workflow/cadence/stargazers/) +[![GitHub forks](https://img.shields.io/github/forks/cadence-workflow/cadence.svg?style=social&label=Fork&maxAge=2592000)](https://github.com/cadence-workflow/cadence/network/) This repo contains the source code of the Cadence server and other tooling including CLI, schema tools, bench and canary. You can implement your workflows with one of our client libraries. -The [Go](https://github.com/uber-go/cadence-client) and [Java](https://github.com/uber-java/cadence-client) libraries are officially maintained by the Cadence team, +The [Go](https://github.com/cadence-workflow/cadence-go-client) and [Java](https://github.com/cadence-workflow/cadence-java-client) libraries are officially maintained by the Cadence team, while the [Python](https://github.com/firdaus/cadence-python) and [Ruby](https://github.com/coinbase/cadence-ruby) client libraries are developed by the community. You can also use [iWF](https://github.com/indeedeng/iwf) as a DSL framework on top of Cadence. See Maxim's talk at [Data@Scale Conference](https://atscaleconference.com/videos/cadence-microservice-architecture-beyond-requestreply) for an architectural overview of Cadence. -Visit [cadenceworkflow.io](https://cadenceworkflow.io) to learn more about Cadence. Join us in [Cadence Documentation](https://github.com/uber/cadence-docs) project. Feel free to raise an Issue or Pull Request there. +Visit [cadenceworkflow.io](https://cadenceworkflow.io) to learn more about Cadence. Join us in [Cadence Documentation](https://github.com/cadence-workflow/Cadence-Docs) project. Feel free to raise an Issue or Pull Request there. ### Community -* [Github Discussion](https://github.com/uber/cadence/discussions) +* [Github Discussion](https://github.com/cadence-workflow/cadence/discussions) * Best for Q&A, support/help, general discusion, and annoucement * [StackOverflow](https://stackoverflow.com/questions/tagged/cadence-workflow) * Best for Q&A and general discusion -* [Github Issues](https://github.com/uber/cadence/issues) +* [Github Issues](https://github.com/cadence-workflow/cadence/issues) * Best for reporting bugs and feature requests * [Slack](http://t.uber.com/cadence-slack) * Best for contributing/development discussion @@ -43,7 +43,7 @@ Please visit our [documentation](https://cadenceworkflow.io/docs/operation-guide ### Run the Samples -Try out the sample recipes for [Go](https://github.com/uber-common/cadence-samples) or [Java](https://github.com/uber/cadence-java-samples) to get started. +Try out the sample recipes for [Go](https://github.com/cadence-workflow/cadence-samples) or [Java](https://github.com/cadence-workflow/cadence-java-samples) to get started. ### Use [Cadence CLI](https://cadenceworkflow.io/docs/cli/) @@ -51,7 +51,7 @@ Cadence CLI can be used to operate workflows, tasklist, domain and even the clus You can use the following ways to install Cadence CLI: * Use brew to install CLI: `brew install cadence-workflow` - * Follow the [instructions](https://github.com/uber/cadence/discussions/4457) if you need to install older versions of CLI via homebrew. Usually this is only needed when you are running a server of a too old version. + * Follow the [instructions](https://github.com/cadence-workflow/cadence/discussions/4457) if you need to install older versions of CLI via homebrew. Usually this is only needed when you are running a server of a too old version. * Use docker image for CLI: `docker run --rm ubercadence/cli:` or `docker run --rm ubercadence/cli:master ` . Be sure to update your image when you want to try new features: `docker pull ubercadence/cli:master ` * Build the CLI binary yourself, check out the repo and run `make cadence` to build all tools. See [CONTRIBUTING](CONTRIBUTING.md) for prerequisite of make command. * Build the CLI image yourself, see [instructions](docker/README.md#diy-building-an-image-for-any-tag-or-branch) @@ -62,7 +62,7 @@ Please read the [documentation](https://cadenceworkflow.io/docs/cli/#documentati ### Use Cadence Web -Try out [Cadence Web UI](https://github.com/uber/cadence-web) to view your workflows on Cadence. +Try out [Cadence Web UI](https://github.com/cadence-workflow/cadence-web) to view your workflows on Cadence. (This is already available at localhost:8088 if you run Cadence with docker compose) @@ -91,7 +91,7 @@ The easiest way to get the schema tool is via homebrew. `brew install cadence-workflow` also includes `cadence-sql-tool` and `cadence-cassandra-tool`. * The schema files are located at `/usr/local/etc/cadence/schema/`. * To upgrade, make sure you remove the old ElasticSearch schema first: `mv /usr/local/etc/cadence/schema/elasticsearch /usr/local/etc/cadence/schema/elasticsearch.old && brew upgrade cadence-workflow`. Otherwise ElasticSearch schemas may not be able to get updated. - * Follow the [instructions](https://github.com/uber/cadence/discussions/4457) if you need to install older versions of schema tools via homebrew. + * Follow the [instructions](https://github.com/cadence-workflow/cadence/discussions/4457) if you need to install older versions of schema tools via homebrew. However, easier way is to use new versions of schema tools with old versions of schemas. All you need is to check out the older version of schemas from this repo. Run `git checkout v0.21.3` to get the v0.21.3 schemas in [the schema folder](/schema). @@ -102,4 +102,4 @@ The easiest way to get the schema tool is via homebrew. ## License -MIT License, please see [LICENSE](https://github.com/uber/cadence/blob/master/LICENSE) for details. +MIT License, please see [LICENSE](https://github.com/cadence-workflow/cadence/blob/master/LICENSE) for details. diff --git a/canary/README.md b/canary/README.md index 16b46efdc80..1976ac8a073 100644 --- a/canary/README.md +++ b/canary/README.md @@ -8,8 +8,8 @@ Setup Canary test suite is running against a Cadence server/cluster. See [documentation](https://cadenceworkflow.io/docs/operation-guide/setup/) for Cadence server cluster setup. -Note that some tests require features like [Advanced Visibility]((https://cadenceworkflow.io/docs/concepts/search-workflows/).) and [History Archival](https://cadenceworkflow.io/docs/concepts/archival/). - +Note that some tests require features like [Advanced Visibility]((https://cadenceworkflow.io/docs/concepts/search-workflows/).) and [History Archival](https://cadenceworkflow.io/docs/concepts/archival/). + For local server env you can run it through: - Docker: Instructions for running Cadence server through docker can be found in `docker/README.md`. Either `docker-compose-es-v7.yml` or `docker-compose-es.yml` can be used to start the server. - Build from source: Please check [CONTRIBUTING](/CONTRIBUTING.md) for how to build and run Cadence server from source. Please also make sure Kafka and ElasticSearch are running before starting the server with `./cadence-server --zone es start`. If ElasticSearch v7 is used, change the value for `--zone` flag to `es_v7`. @@ -18,7 +18,7 @@ For local server env you can run it through: Different ways of start the canary: -### 1. Use docker image `ubercadence/cadence-canary:master` +### 1. Use docker image `ubercadence/cadence-canary:master` You can [pre-built docker-compose file](../docker/docker-compose-canary.yml) to run against local server In the `docker/` directory, run: @@ -29,12 +29,12 @@ docker-compose -f docker-compose-canary.yml up This will start the canary worker and also the cron canary. You can modify [the canary worker config](../docker/config/canary/development.yaml) to run against a prod server cluster: -* Use a different mode to start canary worker only for testing -* Update the config to use Thrift/gRPC for communication -* Use a different image than `master` tag. See [docker hub](https://hub.docker.com/repository/docker/ubercadence/cadence-canary) for all the images. -Similar to server/CLI images, the `master` image will be built and published automatically by Github on every commit onto the `master` branch. +* Use a different mode to start canary worker only for testing +* Update the config to use Thrift/gRPC for communication +* Use a different image than `master` tag. See [docker hub](https://hub.docker.com/repository/docker/ubercadence/cadence-canary) for all the images. +Similar to server/CLI images, the `master` image will be built and published automatically by Github on every commit onto the `master` branch. -### 2. Build & Run +### 2. Build & Run In the project root, build cadence canary binary: ``` @@ -45,13 +45,13 @@ Then start canary worker & cron: ``` ./cadence-canary start ``` -This is essentially the same as +This is essentially the same as ``` ./cadence-canary start -mode all ``` -By default, it will load [the configuration in `config/canary/development.yaml`](../config/canary/development.yaml). -Run `./cadence-canary -h` for details to understand the start options of how to change the loading directory if needed. +By default, it will load [the configuration in `config/canary/development.yaml`](../config/canary/development.yaml). +Run `./cadence-canary -h` for details to understand the start options of how to change the loading directory if needed. To start the worker only for manual testing certain cases: ``` @@ -60,23 +60,23 @@ To start the worker only for manual testing certain cases: ### 3. Monitoring -In production, it's recommended to monitor the result of this canary. You can use [the workflow success metric](https://github.com/uber/cadence/blob/9336ed963ca1b5e0df7206312aa5236433e04fd9/service/history/execution/context_util.go#L138) -emitted by cadence history service `workflow_success`. To monitor all the canary test cases, use `workflowType` of `workflow.sanity`. +In production, it's recommended to monitor the result of this canary. You can use [the workflow success metric](https://github.com/cadence-workflow/cadence/blob/9336ed963ca1b5e0df7206312aa5236433e04fd9/service/history/execution/context_util.go#L138) +emitted by cadence history service `workflow_success`. To monitor all the canary test cases, use `workflowType` of `workflow.sanity`. Configurations ---------------------- Canary workers configuration contains two parts: - **Canary**: this part controls which domains canary workers are responsible for what tests the sanity workflow will exclude. -```yaml +```yaml canary: - domains: ["cadence-canary"] # it will start workers on all those domains(also try to register if not exists) - excludes: ["workflow.searchAttributes", "workflow.batch", "workflow.archival.visibility", "workflow.archival.history"] # it will exclude the three test cases. If archival is not enabled, you should exclude "workflow.archival.visibility" and"workflow.archival.history". If advanced visibility is not enabled, you should exclude "workflow.searchAttributes" and "workflow.batch". Otherwise canary will fail on those test cases. - cron: + domains: ["cadence-canary"] # it will start workers on all those domains(also try to register if not exists) + excludes: ["workflow.searchAttributes", "workflow.batch", "workflow.archival.visibility", "workflow.archival.history"] # it will exclude the three test cases. If archival is not enabled, you should exclude "workflow.archival.visibility" and"workflow.archival.history". If advanced visibility is not enabled, you should exclude "workflow.searchAttributes" and "workflow.batch". Otherwise canary will fail on those test cases. + cron: cronSchedule: "@every 30s" #the schedule of cron canary, default to "@every 30s" cronExecutionTimeout: 18m #the timeout of each run of the cron execution, default to 18 minutes startJobTimeout: 9m #the timeout of each run of the sanity test suite, default to 9 minutes -``` -An exception here is `HistoryArchival` and `VisibilityArchival` test cases will always use `canary-archival-domain` domain. +``` +An exception here is `HistoryArchival` and `VisibilityArchival` test cases will always use `canary-archival-domain` domain. - **Cadence**: this control how canary worker should talk to Cadence server, which includes the server's service name and address. ```yaml @@ -87,15 +87,15 @@ cadence: #tlsCaFile: "path/to/file" # give file path to TLS CA file if TLS is enabled on the Cadence server #metrics: ... # optional detailed client side metrics like workflow latency. But for monitoring, simply use server side metrics `workflow_success` is enough. ``` -- **Metrics**: metrics configuration. Similar to server metric emitter, only M3/Statsd/Prometheus is supported. -- **Log**: logging configuration. Similar to server logging configuration. +- **Metrics**: metrics configuration. Similar to server metric emitter, only M3/Statsd/Prometheus is supported. +- **Log**: logging configuration. Similar to server logging configuration. Canary Test Cases & Starter ---------------------- -### Cron Canary (periodically running the Sanity/starter suite) +### Cron Canary (periodically running the Sanity/starter suite) -The Cron workflow is not a test case. It's a top-level workflow to kick off the Sanity suite(described below) periodically. +The Cron workflow is not a test case. It's a top-level workflow to kick off the Sanity suite(described below) periodically. To start the cron canary: ``` ./cadence-canary start -mode cronCanary @@ -106,23 +106,23 @@ For local development, you can also start the cron canary workflows along with t ./cadence-canary start -m all ``` -The Cron Schedule is from the Configuration. +The Cron Schedule is from the Configuration. However, changing the schedule requires you manually terminate the existing cron workflow to take into effect. -It can be [improved](https://github.com/uber/cadence/issues/4469) in the future. +It can be [improved](https://github.com/cadence-workflow/cadence/issues/4469) in the future. -The workflowID is fixed: `"cadence.canary.cron"` +The workflowID is fixed: `"cadence.canary.cron"` -### Sanity suite (Starter for all test cases) -The sanity workflow is test suite workflow. It will kick off a bunch of childWorkflows for all the test to verify that Cadence server is operating correctly. +### Sanity suite (Starter for all test cases) +The sanity workflow is test suite workflow. It will kick off a bunch of childWorkflows for all the test to verify that Cadence server is operating correctly. An error result of the sanity workflow indicates at least one of the test case fails. You can start the sanity workflow as one-off run: ``` cadence --do workflow start --tl canary-task-queue --et 1200 --wt workflow.sanity -i 0 -``` +``` -Or using the Cron Canary mentioned above to manage it. +Or using the Cron Canary mentioned above to manage it. Then observe the progress: @@ -132,21 +132,21 @@ cadence --do cadence-canary workflow ob -w <...workflowID form the start command NOTE 1: * tasklist(tl) is fixed to `canary-task-queue` -* execution timeout(et) is recommended to 20 minutes(`1200` seconds) but you can adjust it +* execution timeout(et) is recommended to 20 minutes(`1200` seconds) but you can adjust it * the only required input is the scheduled unix timestamp, and `0` will uses the workflow starting time - -NOTE 2: This is the workflow that you should monitor for alerting. -You can use [the workflow success metric](https://github.com/uber/cadence/blob/9336ed963ca1b5e0df7206312aa5236433e04fd9/service/history/execution/context_util.go#L138) -emitted by cadence history service `workflow_success`. To monitor all the canary test cases use `workflowType` of `workflow.sanity`. - -NOTE 3: This is [the list of the test cases](./sanity.go) that it will start all supported test cases by default if no excludes are configured. -You can find [the workflow names of the tests cases in this file](./const.go) if you want to manually start certain test cases. +NOTE 2: This is the workflow that you should monitor for alerting. +You can use [the workflow success metric](https://github.com/cadence-workflow/cadence/blob/9336ed963ca1b5e0df7206312aa5236433e04fd9/service/history/execution/context_util.go#L138) +emitted by cadence history service `workflow_success`. To monitor all the canary test cases use `workflowType` of `workflow.sanity`. + + +NOTE 3: This is [the list of the test cases](./sanity.go) that it will start all supported test cases by default if no excludes are configured. +You can find [the workflow names of the tests cases in this file](./const.go) if you want to manually start certain test cases. ### Echo -Echo workflow tests the very basic workflow functionality. It executes an activity to return some output and verifies it as the workflow result. +Echo workflow tests the very basic workflow functionality. It executes an activity to return some output and verifies it as the workflow result. To manually start an `Echo` test case: ``` @@ -157,10 +157,10 @@ Then observe the progress: cadence --do cadence-canary workflow ob -w <...workflowID form the start command output> ``` -You can use these command for all other test cases listed below. +You can use these command for all other test cases listed below. ### Signal -Signal workflow tests the signal feature. +Signal workflow tests the signal feature. To manually start one run of this test case: ``` @@ -168,7 +168,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.sign ``` ### Visibility -Visibility workflow tests the basic visibility feature. No advanced visibility needed, but advanced visibility should also support it. +Visibility workflow tests the basic visibility feature. No advanced visibility needed, but advanced visibility should also support it. To manually start one run of this test case: ``` @@ -176,7 +176,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.visi ``` ### SearchAttributes -SearchAttributes workflow tests the advanced visibility feature. Make sure advanced visibility feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case. +SearchAttributes workflow tests the advanced visibility feature. Make sure advanced visibility feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case. To manually start one run of this test case: ``` @@ -184,7 +184,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.sear ``` ### ConcurrentExec -ConcurrentExec workflow tests executing activities concurrently. +ConcurrentExec workflow tests executing activities concurrently. To manually start one run of this test case: ``` @@ -192,7 +192,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.conc ``` ### Query -Query workflow tests the Query feature. +Query workflow tests the Query feature. To manually start one run of this test case: ``` @@ -200,7 +200,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.quer ``` ### Timeout -Timeout workflow make sure the activity timeout is enforced. +Timeout workflow make sure the activity timeout is enforced. To manually start one run of this test case: ``` @@ -208,7 +208,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.time ``` ### LocalActivity -LocalActivity workflow tests the local activity feature. +LocalActivity workflow tests the local activity feature. To manually start one run of this test case: ``` @@ -216,7 +216,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.loca ``` ### Cancellation -Cancellation workflowt tests cancellation feature. +Cancellation workflowt tests cancellation feature. To manually start one run of this test case: ``` @@ -224,7 +224,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.canc ``` ### Retry -Retry workflow tests activity retry policy. +Retry workflow tests activity retry policy. To manually start one run of this test case: ``` @@ -232,7 +232,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.retr ``` ### Reset -Reset workflow tests reset feature. +Reset workflow tests reset feature. To manually start one run of this test case: ``` @@ -240,7 +240,7 @@ cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.rese ``` ### HistoryArchival -HistoryArchival tests history archival feature. Make sure history archival feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case. +HistoryArchival tests history archival feature. Make sure history archival feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case. This test case always uses `canary-archival-domain` domain. @@ -248,7 +248,7 @@ To manually start one run of this test case: ``` cadence --do canary-archival-domain workflow start --tl canary-task-queue --et 10 --wt workflow.timeout -i 0 ``` - + ### VisibilityArchival VisibilityArchival tests visibility archival feature. Make sure visibility feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case. @@ -259,28 +259,10 @@ To manually start one run of this test case: cadence --do canary-archival-domain workflow start --tl canary-task-queue --et 10 --wt workflow.timeout -i 0 ``` -### Batch +### Batch Batch workflow tests the batch job feature. Make sure advanced visibility feature is configured on the server. Otherwise, it should be excluded from the sanity test suite/case. To manually start one run of this test case: ``` cadence --do <> workflow start --tl canary-task-queue --et 10 --wt workflow.batch -i 0 ``` - -### Cross Cluster - -Executes the 'cross-cluster' feature which allows child workflows to be launched in different clusters and different domains. The test itself will launch a workflow in a domain equivalent to the current canary domain suffixed with `-cross-cluster` (and self-register it as necessary). - -This test case is launched by the 'sanity' cron workflow if enabled. - -To enable this feature, its necessary to enable it in config with the -`crossClusterTestMode` config key set to `test-all`. Eg: - -```yaml -canary: - domains: ["cadence-canary"] - crossClusterTestMode: "test-all" - canaryDomainClusters: ["cluster0", "cluster1", "cluster2"] -``` - -The canary test will fail the target domain over to a different cluster and back again with some small probability each iteration. This ensures that both the cross-domain and the cross-cluster parts are excercised. diff --git a/common/elasticsearch/esql/cadenceDevReadme.md b/common/elasticsearch/esql/cadenceDevReadme.md index 10812610942..bdbc36b28cc 100644 --- a/common/elasticsearch/esql/cadenceDevReadme.md +++ b/common/elasticsearch/esql/cadenceDevReadme.md @@ -1,7 +1,7 @@ # ESQL Cadence Usage ## Motivation -Currently [Cadence](https://github.com/uber/cadence) is using [elasticsql](https://github.com/cch123/elasticsql) to translate sql query. However it only support up to ES V2.x while Cadence is using ES V6.x. Beyond that, Cadence has some specific requirements that not supported by elasticsql yet. +Currently [Cadence](https://github.com/cadence-workflow/cadence) is using [elasticsql](https://github.com/cch123/elasticsql) to translate sql query. However it only support up to ES V2.x while Cadence is using ES V6.x. Beyond that, Cadence has some specific requirements that not supported by elasticsql yet. Current Cadence query request processing steps are listed below: - generate SQL from query @@ -53,8 +53,8 @@ if err == nil { ## Testing To setup local testing environment: -- start cassandra service locally. Please refer to [Cadence](https://github.com/uber/cadence) readme. +- start cassandra service locally. Please refer to [Cadence](https://github.com/cadence-workflow/cadence) readme. - start zookeeper and kafka service locally. Here is a [referecne](https://kafka.apache.org/quickstart). - start elasticsearch and kibana service locally. - start a cadence worker by `./bin/helloworld -m worker` under cadence directory. -- start cadence service locally. Please refer to [Cadence](https://github.com/uber/cadence) readme. \ No newline at end of file +- start cadence service locally. Please refer to [Cadence](https://github.com/cadence-workflow/cadence) readme. diff --git a/docker/README.md b/docker/README.md index a1bb3d7990a..7d8cde85856 100644 --- a/docker/README.md +++ b/docker/README.md @@ -1,7 +1,7 @@ -Quickstart for development with local Cadence server +Quickstart for development with local Cadence server ==================================== -**Prerequisite**: [Docker + Docker compose](https://docs.docker.com/engine/installation/) +**Prerequisite**: [Docker + Docker compose](https://docs.docker.com/engine/installation/) Following steps will bring up the docker container running cadence server along with all its dependencies (cassandra, prometheus, grafana). Exposes cadence @@ -17,17 +17,16 @@ docker-compose up To update your `master-auto-setup` image to the latest version ``` docker pull ubercadence/server:master-auto-setup - ``` -* View Cadence-Web at http://localhost:8088 +* View Cadence-Web at http://localhost:8088 * View metrics at http://localhost:3000 Using different docker-compose files ----------------------- By default `docker-compose up` will run with `docker-compose.yml` in this folder. -This compose file is running with Cassandra, with basic visibility, -using Prometheus for emitting metric, with Grafana access. +This compose file is running with Cassandra, with basic visibility, +using Prometheus for emitting metric, with Grafana access. We also provide several other compose files for different features/modes: @@ -51,11 +50,11 @@ Run canary and bench(load test) After a local cadence server started, use the below command to run canary ro bench test ``` docker-compose -f docker-compose-bench.yml up -``` -and +``` +and ``` docker-compose -f docker-compose-canary.yml up -``` +``` Using a released image @@ -65,9 +64,9 @@ You may want to use more stable version from our release process. With every tagged release of the cadence server, there is also a corresponding docker image that's uploaded to docker hub. In addition, the release will also -contain a **docker.tar.gz** file (docker-compose startup scripts). +contain a **docker.tar.gz** file (docker-compose startup scripts). -Go [here](https://github.com/uber/cadence/releases/latest) to download a latest **docker.tar.gz** +Go [here](https://github.com/cadence-workflow/cadence/releases/latest) to download a latest **docker.tar.gz** Execute the following commands to start a pre-built image along with all dependencies. @@ -81,11 +80,11 @@ docker-compose up DIY: Building an image for any tag or branch ----------------------------------------- Replace **YOUR_TAG** and **YOUR_CHECKOUT_BRANCH_OR_TAG** in the below command to build: -You can checkout a [release tag](https://github.com/uber/cadence/tags) (e.g. v0.21.3) or any branch you are interested. +You can checkout a [release tag](https://github.com/cadence-workflow/cadence/tags) (e.g. v0.21.3) or any branch you are interested. ``` cd $GOPATH/src/github.com/uber/cadence -git checkout YOUR_CHECKOUT_BRANCH_OR_TAG +git checkout YOUR_CHECKOUT_BRANCH_OR_TAG docker build . -t ubercadence/:YOUR_TAG ``` @@ -93,7 +92,7 @@ You can specify `--build-arg TARGET=` to build different binaries. There are three targets supported: * server. Default target if not specified. This will build a regular server binary. * auto-setup. The image will setup all the DB/ElasticSearch schema during startup. -* cli. This image is for [CLI](https://cadenceworkflow.io/docs/cli/). +* cli. This image is for [CLI](https://cadenceworkflow.io/docs/cli/). For example of auto-setup images: ``` @@ -114,7 +113,8 @@ DIY: Troubleshooting docker builds Note that Docker has been making changes to its build system, and the new system is currently missing some capabilities that the old one had, and makes major changes to how you control it. When searching for workarounds, make sure you are looking at modern answers, and consider specifically searching for -"buildkit" solutions. +"buildkit" solutions. + You can also disable buildkit explicitly with `DOCKER_BUILDKIT=0 docker build ...`. For output limiting (e.g. `[output clipped ...]` messages), or for anything that requires changing buildkit environment @@ -155,7 +155,7 @@ docker run -e CASSANDRA_SEEDS=10.x.x.x -- csv of cassandra serv -e CASSANDRA_USER= -- Cassandra username -e CASSANDRA_PASSWORD= -- Cassandra password -e KEYSPACE= -- Cassandra keyspace - -e VISIBILITY_KEYSPACE= -- Cassandra visibility keyspace, if using basic visibility + -e VISIBILITY_KEYSPACE= -- Cassandra visibility keyspace, if using basic visibility -e KAFKA_SEEDS=10.x.x.x -- Kafka broker seed, if using ElasticSearch + Kafka for advanced visibility feature -e CASSANDRA_PROTO_VERSION= -- Cassandra protocol version -e ES_SEEDS=10.x.x.x -- ElasticSearch seed , if using ElasticSearch + Kafka for advanced visibility feature @@ -167,9 +167,9 @@ docker run -e CASSANDRA_SEEDS=10.x.x.x -- csv of cassandra serv -e DYNAMIC_CONFIG_FILE_PATH= -- Dynamic config file to be watched, default to /etc/cadence/config/dynamicconfig/development.yaml, but you can choose /etc/cadence/config/dynamicconfig/development_es.yaml if using ElasticSearch ubercadence/server: ``` -Note that each env variable has a default value, so you don't have to specify it if the default works for you. +Note that each env variable has a default value, so you don't have to specify it if the default works for you. For more options to configure the docker, please refer to `config_template.yaml`. -For ``, use `auto-setup` images only for first initial setup, and use regular ones for production deployment. See the above explanation about `auto-setup`. +For ``, use `auto-setup` images only for first initial setup, and use regular ones for production deployment. See the above explanation about `auto-setup`. -When upgrading, follow the release instrusctions if version upgrades require some configuration or schema changes. +When upgrading, follow the release instrusctions if version upgrades require some configuration or schema changes. diff --git a/docs/design/1533-host-specific-tasklist.md b/docs/design/1533-host-specific-tasklist.md index e0e9e2e0d85..49f07cdc2fc 100755 --- a/docs/design/1533-host-specific-tasklist.md +++ b/docs/design/1533-host-specific-tasklist.md @@ -4,7 +4,7 @@ Author: Yichao Yang (@yycptt) Last updated: July 2019 -Discussion at +Discussion at ## Abstract @@ -51,7 +51,7 @@ The SessionOptions struct contains two fields: `ExecutionTimeout`, which specifi CreateSession() will return an error if the context passed in already contains an open session. If all the workers are currently busy and unable to handle new sessions, the framework will keep retrying until the CreationTimeout you specified in the SessionOptions has passed before returning an error. When executing an activity within a session, a user might get three types of errors: -1. Those returned from user activities. The session will not be marked as failed in this case, so the user can return whatever error they want and apply their business logic as necessary. If a user wants to end a session due to the error returned from the activity, use the CompleteSession() API below. +1. Those returned from user activities. The session will not be marked as failed in this case, so the user can return whatever error they want and apply their business logic as necessary. If a user wants to end a session due to the error returned from the activity, use the CompleteSession() API below. 2. A special `ErrSessionFailed` error: this error means the session has failed due to worker failure and the session is marked as failed in the background. In this case, no activities can be executed using this context. The user can choose how to handle the failure. They can create a new session to retry or end the workflow with an error. 3. Cancelled error: If a session activity has been scheduled before worker failure is detected, it will be cancelled afterwards and a cancelled error will be returned. @@ -65,7 +65,7 @@ This API is used to complete a session. It releases the resources reserved on th sessionInfoPtr := workflow.GetSessionInfo(sessionCtx Context) ``` -This API returns session metadata stored in the context. If the context passed in doesn’t contain any session metadata, this API will return a nil pointer. For now, the only exported fields in sessionInfo are: sessionID, which is a unique identifier for a session, and hostname. +This API returns session metadata stored in the context. If the context passed in doesn’t contain any session metadata, this API will return a nil pointer. For now, the only exported fields in sessionInfo are: sessionID, which is a unique identifier for a session, and hostname. ```go sessionCtx, err := workflow.RecreateSession(ctx Context, recreateToken []byte, so *SessionOptions) @@ -155,7 +155,7 @@ When a user calls the CreateSession() or RecreateSession() API, a structure that 2. **Hostname**: the hostname of the worker that is responsible for executing the session. (Exported) 3. **ResourceID**: the resource consumed by the session. (Will Be Exported) - + 4. **tasklist**: the resource specific tasklist used by this session. (Not Exported) 5. **sessionState**: the state of the session (Not Exported). It can take one of the three values below: @@ -178,7 +178,7 @@ When scheduling an activity, the workflow worker needs to check the context to s CreateSession() and CompleteSession() really do are **schedule special activities** and get some information from the worker which executes these activities. -* For CreateSession(), a special **session creation activity** will be scheduled on a global tasklist which is only used for this type of activity. During the execution of that activity, a signal containing the resource specific tasklist name will be sent back to the workflow (with other information like hostname). Once the signal is received by the worker, the creation is considered successful and the tasklist name will be stored in the session context. The creation activity also performs periodic heartbeat throughout the whole lifetime of the session. As a result, if the activity worker is down, the workflow can be notified and set the session state to “Failed”. +* For CreateSession(), a special **session creation activity** will be scheduled on a global tasklist which is only used for this type of activity. During the execution of that activity, a signal containing the resource specific tasklist name will be sent back to the workflow (with other information like hostname). Once the signal is received by the worker, the creation is considered successful and the tasklist name will be stored in the session context. The creation activity also performs periodic heartbeat throughout the whole lifetime of the session. As a result, if the activity worker is down, the workflow can be notified and set the session state to “Failed”. * For RecreateSession(), the same thing happens. The only difference is that the session creation activity will be **scheduled on the resource specific task list instead of a global one**. diff --git a/docs/design/2215-synchronous-request-reply.md b/docs/design/2215-synchronous-request-reply.md index a4904d4640f..f93dc754d6f 100644 --- a/docs/design/2215-synchronous-request-reply.md +++ b/docs/design/2215-synchronous-request-reply.md @@ -4,7 +4,7 @@ Authors: Maxim Fateev (@mfateev), Andrew Dawson (@andrewjdawson2016) Last updated: July 16, 2019 -Discussion at [https://github.com/uber/cadence/issues/2215](https://github.com/uber/cadence/issues/2215) +Discussion at [https://github.com/cadence-workflow/cadence/issues/2215](https://github.com/cadence-workflow/cadence/issues/2215) ## Abstract @@ -72,7 +72,7 @@ The next diagram covers situation when query is received when another decision t ![image alt text](2215-synchronous-request-reply_2.png) -The current decision state diagram +The current decision state diagram ![image alt text](2215-synchronous-request-reply_3.png) @@ -99,4 +99,3 @@ This section is not fully designed yet. 1. Design update operation. There is an outstanding design issue for how update works in a multiple DC model. 2. Agree on the client model for synchronous request reply primitives. - diff --git a/docs/design/2290-cadence-ndc.md b/docs/design/2290-cadence-ndc.md index e03909cae58..5ff7ab5f4a4 100644 --- a/docs/design/2290-cadence-ndc.md +++ b/docs/design/2290-cadence-ndc.md @@ -4,7 +4,7 @@ Author: Cadence Team Last updated: Dec 2019 -Reference: [#2290](https://github.com/uber/cadence/issues/2290) +Reference: [#2290](https://github.com/cadence-workflow/cadence/issues/2290) ## Abstract @@ -360,4 +360,3 @@ T = 1: conflict resolution happens, workflow mutable state is rebuilt and histor T = 2: task A is loaded. At this time, due to the rebuilt of workflow mutable state (conflict resolution), task A is no longer relevant (task A's corresponding event belongs to non-current branch). Task processing logic will verify both the event ID and version of the task against corresponding workflow mutable state, then discard task A. - diff --git a/docs/design/graceful-domain-failover/3051-graceful-domain-failover.md b/docs/design/graceful-domain-failover/3051-graceful-domain-failover.md index 1debafaa6cb..7210ec98735 100644 --- a/docs/design/graceful-domain-failover/3051-graceful-domain-failover.md +++ b/docs/design/graceful-domain-failover/3051-graceful-domain-failover.md @@ -4,7 +4,7 @@ Author: Cadence Team Last updated: Mar 2020 -Reference: [#3051](https://github.com/uber/cadence/issues/3051) +Reference: [#3051](https://github.com/cadence-workflow/cadence/issues/3051) ## Abstract @@ -237,6 +237,3 @@ The purpose of the start workflow task is to regenerate the timer tasks and tran The purpose of the mutable state and history event is to record the workflow start event for deduplication and generate replication tasks to other clusters to sync on the workflow data. The generated history events will be replicated to all clusters. This process is required because all clusters should have the same workflow data. - - - diff --git a/docs/design/workflow-shadowing/2547-workflow-shadowing.md b/docs/design/workflow-shadowing/2547-workflow-shadowing.md index e67e6051751..6b9dbf0f7e4 100644 --- a/docs/design/workflow-shadowing/2547-workflow-shadowing.md +++ b/docs/design/workflow-shadowing/2547-workflow-shadowing.md @@ -4,11 +4,11 @@ Author: Yu Xia (@yux0), Yichao Yang (@yycptt) Last updated: Apr 2021 -Reference: [#2547](https://github.com/uber/cadence/issues/2547) +Reference: [#2547](https://github.com/cadence-workflow/cadence/issues/2547) -## Abstract +## Abstract -Cadence client libraries use workflow history and workflow definition to determine the current workflow state which is used by the workflow (decision) task for deciding the next step in the workflow. Workflow histories are immutable and persisted in the Cadence server, while workflow definitions are controlled by users via worker deployments. +Cadence client libraries use workflow history and workflow definition to determine the current workflow state which is used by the workflow (decision) task for deciding the next step in the workflow. Workflow histories are immutable and persisted in the Cadence server, while workflow definitions are controlled by users via worker deployments. A non-deterministic error may happen when re-building the workflow state using a version of the workflow definition that is different than the one generating the workflow history. Typically caused by a non-backward compatible workflow definition change. For example: @@ -17,7 +17,7 @@ A non-deterministic error may happen when re-building the workflow state using a | V1 | 1. Start workflow
2. Start activity A
3. Complete workflow | 1. Workflow started
2. Activity A started | | V2 | 1. Start workflow
2. Start activity B
3. Complete workflow | 1. Workflow started
2. Activity B started | -When a Cadence client uses V1 history and V2 definition to build the workflow state, it will expect information of activity A in the workflow state as it sees an activity started event for it. However, it's unable to find Activity A since another activity is specified in V2 definition. This will lead to a non-deterministic error during workflow (decision) task processing. +When a Cadence client uses V1 history and V2 definition to build the workflow state, it will expect information of activity A in the workflow state as it sees an activity started event for it. However, it's unable to find Activity A since another activity is specified in V2 definition. This will lead to a non-deterministic error during workflow (decision) task processing. Depending on the non-deterministic error handling policy, the workflow will fail immediately or get blocked until manual operation is involved. This type of error has caused several incidents for our customers in the past. What make the situation worse is that understanding and mitigating the non-deterministic error is usually time-consuming. Blindly reverting the bad deployment will only make the situation worse as histories generated by the new workflow definition are also not compatible with the old one. The right mitigation requires not only manual operations, but also a deep understanding of why the non-deterministic error is happening, which will greatly increase the time needed to mitigate the issue. @@ -32,29 +32,29 @@ Depending on the non-deterministic error handling policy, the workflow will fail ## Proposal -The proposal is creating a new worker mode called shadowing which reuses the existing workflow replay test framework available in both Golang and Java clients to only execute replay tests using the new workflow definition and workflow histories from production environments. +The proposal is creating a new worker mode called shadowing which reuses the existing workflow replay test framework available in both Golang and Java clients to only execute replay tests using the new workflow definition and workflow histories from production environments. -When the user worker is running in the shadowing mode, the normal activity and decision worker component will not be started, instead a new replay worker will be started to replay production workflow histories. More specifically, this replay worker will call the `ScanWorkflowExecution` API to get a list of workflow visibility records and then for each workflow, get the workflow history from Cadence server by calling `GetWorkflowExecutionHistory` API and replay it against the new workflow definition using the client side replay test framework. Upon replay failure, metrics and log messages will be emitted to notify the user that there are non-deterministic changes in the new workflow definition. +When the user worker is running in the shadowing mode, the normal activity and decision worker component will not be started, instead a new replay worker will be started to replay production workflow histories. More specifically, this replay worker will call the `ScanWorkflowExecution` API to get a list of workflow visibility records and then for each workflow, get the workflow history from Cadence server by calling `GetWorkflowExecutionHistory` API and replay it against the new workflow definition using the client side replay test framework. Upon replay failure, metrics and log messages will be emitted to notify the user that there are non-deterministic changes in the new workflow definition. -Note that as long as the worker can talk to the Cadence server to fetch visibility records and workflow history, it can be run in the shadowing mode. This means users have the flexibility to run it in the local environment during development to facilitate a rapid development cycle or/and make it part of the deployment process to run on a dedicated host in staging/preprod environment to get a better test coverage and catch rare cases. +Note that as long as the worker can talk to the Cadence server to fetch visibility records and workflow history, it can be run in the shadowing mode. This means users have the flexibility to run it in the local environment during development to facilitate a rapid development cycle or/and make it part of the deployment process to run on a dedicated host in staging/preprod environment to get a better test coverage and catch rare cases. Users will also have the option to specify what kind of workflows they would like to shadow. As an example, if the user doesn’t use the query feature, then shadowing for the closed workflow may not necessary. Or if a user only changed the definition for one workflow type, only workflow histories of that type need to be shadowed and checked. Please check the Implementation section for the detailed options users have. The major advantages of this approach are: - **Simplicity**, very little code change required for Cadence server - **Can be run in both local and staging/preprod/canary environments** -- **Generic approach that fits our internal deployment process, CI/CD pipeline and also open source customers**. +- **Generic approach that fits our internal deployment process, CI/CD pipeline and also open source customers**. The downsides however are: -- **Requires some effort on the user end** to setup the release pipeline and shadowing environment and we need to provide guidelines for them. -- The **load on Cadence server and our ElasticSearch cluster will increase** due to the added `GetWorkflowExecution` and `ScanWorkflowExecutions` call. But this increase can be easily controlled by the server side rate limiter. Check the “Open Questions” section for an estimate for the load. +- **Requires some effort on the user end** to setup the release pipeline and shadowing environment and we need to provide guidelines for them. +- The **load on Cadence server and our ElasticSearch cluster will increase** due to the added `GetWorkflowExecution` and `ScanWorkflowExecutions` call. But this increase can be easily controlled by the server side rate limiter. Check the “Open Questions” section for an estimate for the load. - The dedicated **shadow worker will become idle after the replay test is done**. ![deployment pipeline](2547-deployment-pipeline.png) ## Other Considered Options -### [History Generation with task mocking](https://github.com/uber-go/cadence-client/issues/1050) +### [History Generation with task mocking](https://github.com/cadence-workflow/cadence-go-client/issues/1050) The idea of this approach is similar to the local replay test framework. But instead of asking customers to get an existing workflow history from production, it will ask customers to define possible workflow inputs, activity input/output, signals, etc. and then generate multiple workflow histories based on different combinations of values based on a certain workflow version. Random errors will also be injected to cover failure cases. Upon workflow code change, those generated workflows can be replayed against the new workflow definition to see if there are any non-deterministic changes. @@ -69,30 +69,30 @@ This approach reuses existing history archival code paths to dump workflow histo The approach **doesn’t require too many changes on server end** as most code paths for history/visibility archival can be reused. We only need to figure out when and how to sample the workflow for archiving, and the destination for archiving. It also **avoids the idle worker problem** and **won’t incur any load increase on Cadence server** as the testing framework will talk directly to the storage for storing archived history and visibility records. There are also several downsides associated with this approach: -- We need to decide who owns and manages the blob storage. - - If Cadence owns it, GDPR constraints might be a concern as some workflow histories will contain PII data and can’t be archived for too long. **A set of APIs for managing the storage** will also be necessary. +- We need to decide who owns and manages the blob storage. + - If Cadence owns it, GDPR constraints might be a concern as some workflow histories will contain PII data and can’t be archived for too long. **A set of APIs for managing the storage** will also be necessary. - Register user callbacks for managing storage. - If users own it, it will **increase their burden for managing it**. E.g. regularly clean up the unneeded workflows (may due to GDPR constraints), update query to replay only recent workflows to ensure a reasonable test time. - Extra work for setting additional history and visibility archival systems. -- The integration test may take a long time to replay all the archived workflows **making it not ideal to run locally**. +- The integration test may take a long time to replay all the archived workflows **making it not ideal to run locally**. - Since the test is long running and not part of the deployment process, **users may tend to skip the test** if they believe the change they made is backward-compatible and causing issues when deployed to production. - From an open source perspective, users need to have an archival system setup, which we only support S3 and GCP for now. If those two don't work for the customer, they may have to **implement their own archiver**. On the client side, we can provide a blob storage interface, but customers still need to **implement the logic for fetching history and/or visibility records for blob storage**. -- Finally since this approach takes blob storage as an external dependency, it **won’t be functioning when the dependencies are down**. +- Finally since this approach takes blob storage as an external dependency, it **won’t be functioning when the dependencies are down**. ### Replication based replay test -The final approach is: reusing the replication stack to dispatch decision tasks to workers at the standby cluster and perform the replay check. However this approach only works for global domain users, so it’s not a valid option. +The final approach is: reusing the replication stack to dispatch decision tasks to workers at the standby cluster and perform the replay check. However this approach only works for global domain users, so it’s not a valid option. ## Implementation ### Worker Implementation -In the proposal of workflow shadowing, the worker does shadowing traffic in 3 steps. +In the proposal of workflow shadowing, the worker does shadowing traffic in 3 steps. 1. Get a list of workflow visibility records. 2. For each record, get the workflow history. 3. Run the replay test on the workflow history. -These steps can be implemented in native code or in Cadence workflow. +These steps can be implemented in native code or in Cadence workflow. For the local development environment, we will provide a testing framework for shadowing workflow from production, so that users can validate their change sooner and speed up their development process. This kind of integration test can also be easily integrated with existing release pipelines to enforce a check upon landing code. @@ -102,7 +102,7 @@ For staging/production environment, we incline to use a Cadence workflow to cont 3. **Maintenance**: no special worker code to support the shadowing mode. 4. **Visibility**: shadowing progress and result can easily be viewed using the Cadence Web UI and metric dashboards. -As metrics will be emitted during the shadowing processing, it can also be integrated with release pipelines for checking workflow compatibility before deploying to production environments. The disadvantage is that users are required to start a Cadence server to run the shadow workflow. +As metrics will be emitted during the shadowing processing, it can also be integrated with release pipelines for checking workflow compatibility before deploying to production environments. The disadvantage is that users are required to start a Cadence server to run the shadow workflow. ### Shadow Worker Options @@ -110,7 +110,7 @@ The following options will be added to the Worker Option struct in both Golang a - **EnableShadowWorker**: a boolean flag indicates whether the worker should run in shadow mode. - **ShadowMode**: - **Normal**: stop shadowing after a scan for all workflows, this will be the default selected value. A new iteration will be triggered upon worker restart or deployment. - - **Continuous**: keep shadowing until the exit condition is met. Since open workflows will keep making progress, two replays might lead to different results. This mode is useful for user’s whose workflow will block on a long timer. + - **Continuous**: keep shadowing until the exit condition is met. Since open workflows will keep making progress, two replays might lead to different results. This mode is useful for user’s whose workflow will block on a long timer. - **Domain**: user domain name. - **TaskList**: the name of the task list the shadowing activity worker should poll from. - **ShadowWorkflowQuery**: a visibility query for all the workflows that need to be replayed. If specified the following three options will be ignored. @@ -118,7 +118,7 @@ The following options will be added to the Worker Option struct in both Golang a - **ShadowWorkflowStartTimeFilter**: a time range for the workflow start time. - **ShadowWorkflowStatus**: a list of workflow status. Workflow will be checked only if its close status is in the list. Option for open workflow will also be provided and will be the default. - **SamplingRate**: a float indicating the percentage of workflows (returned by workflow scan) that should be replayed. -- **Concurrency**: concurrency of replay checks, which is the number of parallel replay activities in the shadow workflow. +- **Concurrency**: concurrency of replay checks, which is the number of parallel replay activities in the shadow workflow. ### Local Shadow Test Options @@ -128,12 +128,12 @@ The options provided by the local test framework will be similar to the shadow w We decide to keep the workflow code at the server side due to the following reasons: 1. The shadowing logic for Java and Go clients are identical. By keeping the shadowing workflow definition at the server side, we only need to implement it once, and don’t need to worry about keeping two workflow definitions in sync. -2. Since Cadence server owns the workflow code, we can update the workflow definition at any time and even make non-backward compatible changes without having to ask customers to update their client dependency. +2. Since Cadence server owns the workflow code, we can update the workflow definition at any time and even make non-backward compatible changes without having to ask customers to update their client dependency. The downsides are: 1. Cadence team needs to provide workers to process the decision task generated by the shadowing workflow. 2. Shadowing workflow from all the customers will live in one domain, so they may affect each other’s execution as the total amount of resources allocated to a given domain is limited. - + However, considering the fact that the load generated by shadowing workflow is small and it’s unlikely many users will run the shadowing workflow at the same time, those downsides should not be a concern. ### Metrics @@ -143,4 +143,3 @@ The following metrics will be emitted in the shadow workflow: 2. Latency for each replay test 3. Start/Complete/Continue_as_new of shadow workflow 4. Latency for each shadow workflow - diff --git a/docs/flow.md b/docs/flow.md index 3c78ac33fcd..ff3889fc9b2 100644 --- a/docs/flow.md +++ b/docs/flow.md @@ -36,7 +36,7 @@ flowchart TD ### Cassandra Notes: -In order to leverage [Cassandra Light Weight Transactions](https://www.yugabyte.com/blog/apache-cassandra-lightweight-transactions-secondary-indexes-tunable-consistency/), Cadence stores multiple types of records in the same `executions` table. ([ref](https://github.com/uber/cadence/blob/51758676ce9d9609e736c64f94dc387ed2c75b7c/schema/cassandra/cadence/schema.cql#L338)) +In order to leverage [Cassandra Light Weight Transactions](https://www.yugabyte.com/blog/apache-cassandra-lightweight-transactions-secondary-indexes-tunable-consistency/), Cadence stores multiple types of records in the same `executions` table. ([ref](https://github.com/cadence-workflow/cadence/blob/51758676ce9d9609e736c64f94dc387ed2c75b7c/schema/cassandra/cadence/schema.cql#L338)) - **Shards Store**: Shard records are stored in `executions` table with type=0. - **Executions Store**: Execution records are stored in `executions` table with type=1 - **Tasks Store**: Task records are stored in `executions` table with type={2,3,4,5} diff --git a/docs/non-deterministic-error.md b/docs/non-deterministic-error.md index 612d49e70df..b9c969c089c 100644 --- a/docs/non-deterministic-error.md +++ b/docs/non-deterministic-error.md @@ -7,7 +7,7 @@ This article is for Cadence developers to understand and address issues with non ## Some Internals of Cadence workflow -A Cadence workflow can be viewed as a long running process on a distributed operating system(OS). The process’s state and dispatching is owned by Cadence server, and customers’ workers provide CPU/memory resources to execute the process’s code. For most of the time, this process(workflow) is owned by a worker and running like in other normal OS. But because this is a distributed OS, the workflow ownership can be transferred to other workers and continue to run from the previous state. Unlike other OS, this is not restarting from the beginning. This is how workflow is fault tolerant to certain host failures. +A Cadence workflow can be viewed as a long running process on a distributed operating system(OS). The process’s state and dispatching is owned by Cadence server, and customers’ workers provide CPU/memory resources to execute the process’s code. For most of the time, this process(workflow) is owned by a worker and running like in other normal OS. But because this is a distributed OS, the workflow ownership can be transferred to other workers and continue to run from the previous state. Unlike other OS, this is not restarting from the beginning. This is how workflow is fault tolerant to certain host failures. ![image alt text](images/non-deterministic-err.1.png) @@ -15,53 +15,53 @@ Non-deterministic issues arise during this workflow ownership transfer. Cadence ![image alt text](images/non-deterministic-err.2.png) -Even if a workflow ownership doesn’t change, history replay is also required under some circumstances: +Even if a workflow ownership doesn’t change, history replay is also required under some circumstances: -* A workflow stack(of an execution) lives in worker’s memory, a worker can own many workflow executions. So a worker can run out of memory, it has to kick off some workflow executions (with LRU) and rebuild them when necessary. +* A workflow stack(of an execution) lives in worker’s memory, a worker can own many workflow executions. So a worker can run out of memory, it has to kick off some workflow executions (with LRU) and rebuild them when necessary. * Sometimes the stack can be stale because of some errors. -In Cadence, Workflow ownership is called "stickiness". +In Cadence, Workflow ownership is called "stickiness". -Worker memory cache for workflow stacks is called "sticky cache". +Worker memory cache for workflow stacks is called "sticky cache". -## Cadence History Protocol +## Cadence History Protocol -History replay process is under some basic rules/concepts. +History replay process is under some basic rules/concepts. In Cadence, we use "close" to describe the opposite status of “open”. For activity/decision, close includes complete,timeout,fail. For workflow, close means complete/timeout/fail/terminate/cancel/continueAsNew. ### Decision -* Decision is the key of history replay. It drives the progress of a workflow. +* Decision is the key of history replay. It drives the progress of a workflow. -* The decision state machine must be **Scheduled->Started->Closed** +* The decision state machine must be **Scheduled->Started->Closed** * The first decision task is triggered by server. Starting from that, the rest of the decision tasks are triggered by some events of the workflow itself -- when those events mean something the workflow could be waiting for. For example, Signaled, ActivityClosed, ChildWorkflowClosed, TimerFired events. Events like ActivityStarted won’t trigger decisions. * When a decision is started(internally called in flight), there cannot be any other events written into history before a decision is closed. Those events will be put into a buffer until the decision is closed -- flush buffer will write the events into history. -* **In executing mode**, a decision task will try to complete with some entities: Activities/Timers/Childworkflows/etc Scheduled. +* **In executing mode**, a decision task will try to complete with some entities: Activities/Timers/Childworkflows/etc Scheduled. -* **In history replay mode**, decisions use all of those above to rebuild a stack. **Activities/Timers/ChildWorkflows/etc will not be re-executed during history replay.** +* **In history replay mode**, decisions use all of those above to rebuild a stack. **Activities/Timers/ChildWorkflows/etc will not be re-executed during history replay.** -### Activity +### Activity * State machine is **Scheduled->Started->Closed** * Activity is scheduled by DecisionCompleted -* Activity started by worker. Normally, ActivityStartedEvent can be put at any place in history except for in between DecisionStarted and DecisionClose. +* Activity started by worker. Normally, ActivityStartedEvent can be put at any place in history except for in between DecisionStarted and DecisionClose. -* But Activity with RetryPolicy is a special case. Cadence will only write down Started event when Activity is finally closed. +* But Activity with RetryPolicy is a special case. Cadence will only write down Started event when Activity is finally closed. -* Activity completed/failed by worker, or timed out by server -- they all consider activity closed, and it will trigger a decision task if no decision task is ongoing. +* Activity completed/failed by worker, or timed out by server -- they all consider activity closed, and it will trigger a decision task if no decision task is ongoing. * Like in the above, only ActivityClose events could trigger a decision. -### Local activity +### Local activity -* Local activity is executed within decision is processing in flight. +* Local activity is executed within decision is processing in flight. * Local activity is only recorded with DecisionCompleted -- no state machine needed. @@ -85,11 +85,11 @@ In Cadence, we use "close" to describe the opposite status of “open”. For ac * ChildWorkflow is initiated by DecisionCompleted -* ChildWorkflow is started by server(returning runID). It could trigger a decision. +* ChildWorkflow is started by server(returning runID). It could trigger a decision. -* ChildWorkflow close events can be"canceled/failed/completed". It could trigger a decision. +* ChildWorkflow close events can be"canceled/failed/completed". It could trigger a decision. -### SignalExternal/RequestCancel +### SignalExternal/RequestCancel * State machine is **Initiated->Closed** @@ -97,11 +97,11 @@ In Cadence, we use "close" to describe the opposite status of “open”. For ac * Closed(completed/failed) by server, it could trigger a decision. -### More explanation of BufferedEvents +### More explanation of BufferedEvents -When a decision is in flight, if something like a signal comes in, Cadence has to put it into buffer. That’s because for the next decision, SDK always processes unhandled events starting from last decision completed. There cannot be any other events to record between decision started and close event. [This may cause some issues](https://github.com/uber/cadence/issues/2934) if you are sending a signal to self within a local activities. +When a decision is in flight, if something like a signal comes in, Cadence has to put it into buffer. That’s because for the next decision, SDK always processes unhandled events starting from last decision completed. There cannot be any other events to record between decision started and close event. [This may cause some issues](https://github.com/cadence-workflow/cadence/issues/2934) if you are sending a signal to self within a local activities. -## An Example of history protocol +## An Example of history protocol To understand the protocol, consider an example of the following workflow code: ```go @@ -125,7 +125,7 @@ func Workflow(ctx workflow.Context) error { } ``` -The workflow will execute activityA, then wait for 1 minute, then execute activityB, finally waits for 1 hour to complete. +The workflow will execute activityA, then wait for 1 minute, then execute activityB, finally waits for 1 hour to complete. The history will be as follows if everything runs smoothly (no errors, timeouts, retries, etc): @@ -135,7 +135,7 @@ ID:2 DecisionTaskScheduled : first decision triggered by server ID:3 DecisionTaskStarted ID:4 DecisionTaskCompleted ID:5 ActivityTaskScheduled : activityA is scheduled by decision -ID:6 ActivityTaskStarted : started by worker +ID:6 ActivityTaskStarted : started by worker ID:7 ActivityTaskCompleted : completed with result of var a ID:8 DecisionTaskScheduled : triggered by ActivityCompleted ID:9 DecisionTaskStarted @@ -145,7 +145,7 @@ ID:12 TimerFired : fired after 1 minute ID:13 DecisionTaskScheduled : triggered by TimerFired ID:14 DecisionTaskStarted ID:15 DecisionTaskCompleted -ID:16 ActivityTaskScheduled: activityB scheduled by decision with param a +ID:16 ActivityTaskScheduled: activityB scheduled by decision with param a ID:17 ActivityTaskStarted : started by worker ID:18 ActivityTaskCompleted : completed with nil as error ID:19 DecisionTaskScheduled : triggered by ActivityCompleted @@ -155,7 +155,7 @@ ID:22 TimerStarted : decision scheduled a timer for 1 hour ID:23 TimerFired : fired after 1 hour ID:24 DecisionTaskScheduled : triggered by TimerFired ID:25 DecisionTaskStarted -ID:26 DecisionTaskCompleted +ID:26 DecisionTaskCompleted ID:27 WorkflowCompleted : completed by decision ``` @@ -163,7 +163,7 @@ ID:27 WorkflowCompleted : completed by decision ### Missing decision -[Error message](https://github.com/uber-go/cadence-client/blob/e5081b085b0333bac23f198e57959681e0aee987/internal/internal_task_handlers.go#L1206): +[Error message](https://github.com/cadence-workflow/cadence-go-client/blob/e5081b085b0333bac23f198e57959681e0aee987/internal/internal_task_handlers.go#L1206): ```go fmt.Errorf("nondeterministic workflow: missing replay decision for %s", util.HistoryEventToString(e)) @@ -173,11 +173,11 @@ This means after replay code, the decision is scheduled less than history events ```go workflow.Sleep(time.Hour) ``` -and restart worker, then it will run into this error. Because in the history, the workflow has a timer event that is supposed to fire in one hour. However, during replay, there is no logic to schedule that timer. +and restart worker, then it will run into this error. Because in the history, the workflow has a timer event that is supposed to fire in one hour. However, during replay, there is no logic to schedule that timer. -### Extra decision +### Extra decision -[Error message](https://github.com/uber-go/cadence-client/blob/e5081b085b0333bac23f198e57959681e0aee987/internal/internal_task_handlers.go#L1210): +[Error message](https://github.com/cadence-workflow/cadence-go-client/blob/e5081b085b0333bac23f198e57959681e0aee987/internal/internal_task_handlers.go#L1210): ```go fmt.Errorf("nondeterministic workflow: extra replay decision for %s", util.DecisionToString(d)) @@ -187,7 +187,7 @@ This is basically the opposite of the previous case, which means that during rep ```go err = workflow.ExecuteActivity(ctx, activityB, a).Get(ctx, nil) ``` -to +to ```go fb := workflow.ExecuteActivity(ctx, activityB, a) @@ -202,23 +202,23 @@ if err != nil { } ``` -And restart worker, then it will run into this error. Because in the history, the workflow has scheduled only activityB after the one minute timer, however, during replay, there are two activities scheduled in a decision( in parallel). +And restart worker, then it will run into this error. Because in the history, the workflow has scheduled only activityB after the one minute timer, however, during replay, there are two activities scheduled in a decision( in parallel). ### Decision mismatch -[Error message](https://github.com/uber-go/cadence-client/blob/e5081b085b0333bac23f198e57959681e0aee987/internal/internal_task_handlers.go#L1214): +[Error message](https://github.com/cadence-workflow/cadence-go-client/blob/e5081b085b0333bac23f198e57959681e0aee987/internal/internal_task_handlers.go#L1214): ```go fmt.Errorf("nondeterministic workflow: history event is %s, replay decision is %s",util.HistoryEventToString(e), util.DecisionToString(d)) ``` -This means after replay code, the decision scheduled is different than the one in history. Using the previous history as an example, when the workflow is waiting at the one hour timer(event ID 22), +This means after replay code, the decision scheduled is different than the one in history. Using the previous history as an example, when the workflow is waiting at the one hour timer(event ID 22), if we change the line of : ```go err = workflow.ExecuteActivity(ctx, activityB, a).Get(ctx, nil) ``` -to +to ```go err = workflow.ExecuteActivity(ctx, activityC, a).Get(ctx, nil) ``` @@ -226,13 +226,13 @@ And restart worker, then it will run into this error. Because in the history, th ### Decision State Machine Panic -[Error message](https://github.com/uber-go/cadence-client/blob/e5081b085b0333bac23f198e57959681e0aee987/internal/internal_decision_state_machine.go#L693): +[Error message](https://github.com/cadence-workflow/cadence-go-client/blob/e5081b085b0333bac23f198e57959681e0aee987/internal/internal_decision_state_machine.go#L693): ```go fmt.Sprintf("unknown decision %v, possible causes are nondeterministic workflow definition code"+" or incompatible change in the workflow definition", id) ``` -This usually means workflow history is corrupted due to some bug. For example, the same activity can be scheduled and differentiated by activityID. So ActivityIDs for different activities are supposed to be unique in workflow history. If however we have an ActivityID collision, replay will run into this error. +This usually means workflow history is corrupted due to some bug. For example, the same activity can be scheduled and differentiated by activityID. So ActivityIDs for different activities are supposed to be unique in workflow history. If however we have an ActivityID collision, replay will run into this error. ## What can cause non-deterministic errors @@ -240,7 +240,7 @@ This usually means workflow history is corrupted due to some bug. For example, t * Changing signature of activities -* Changing duration of timer +* Changing duration of timer * Using time.Now() instead of workflow.Now() @@ -248,7 +248,7 @@ This usually means workflow history is corrupted due to some bug. For example, t * Use golang builtin channel instead of "workflow.Channel" for inter goroutines communication -* time.Sleep() will not work, even though it doesn’t cause non-deterministic errors, but changing to workflow.Sleep() will. +* time.Sleep() will not work, even though it doesn’t cause non-deterministic errors, but changing to workflow.Sleep() will. For those needs, see "How to address non-deterministic issues" in the next section. @@ -264,13 +264,13 @@ For those needs, see "How to address non-deterministic issues" in the next secti ## Find the Non-Deterministic Code -Workflow logic can be complicated and changes to workflow code can be non-trivial and non-isolated. In case you are not able to pinpoint the exact code change that introduces the workflow logic and non-deterministic error, you can download the workflow history with non-deterministic error and [replay it locally](https://github.com/uber-go/cadence-client/blob/master/worker/worker.go#L96). This is [an example](https://github.com/uber-common/cadence-samples/blob/03293b934579e0353e08e75c2f46a84a5a7b2df0/cmd/samples/recipes/helloworld/replay_test.go#L39) of using this utility. +Workflow logic can be complicated and changes to workflow code can be non-trivial and non-isolated. In case you are not able to pinpoint the exact code change that introduces the workflow logic and non-deterministic error, you can download the workflow history with non-deterministic error and [replay it locally](https://github.com/cadence-workflow/cadence-go-client/blob/master/worker/worker.go#L96). This is [an example](https://github.com/cadence-workflow/cadence-samples/blob/03293b934579e0353e08e75c2f46a84a5a7b2df0/cmd/samples/recipes/helloworld/replay_test.go#L39) of using this utility. -You can do this with different versions of your workflow code to see the difference in behavior. +You can do this with different versions of your workflow code to see the difference in behavior. However, it could be hard if you have too many versions or your code is too complicated to debug. In this case, you can run replay in debug mode to help you to step into your workflow logic. -To do this you first change [this code](https://github.com/uber-go/cadence-client/blob/cc25a04f6f74c54ea9ae330741f63ae6df15f4df/internal/internal_event_handlers.go#L429) into +To do this you first change [this code](https://github.com/cadence-workflow/cadence-go-client/blob/cc25a04f6f74c54ea9ae330741f63ae6df15f4df/internal/internal_event_handlers.go#L429) into ```go func (wc *workflowEnvironmentImpl) GenerateSequence() int32 { @@ -285,11 +285,11 @@ func (wc *workflowEnvironmentImpl) GenerateSequence() int32 { THE_ID_WITH_ERROR is the ActivityID/TimerID of activities/timers of decision runs into non-deterministic error during your replay with current code. When you pause the replay thread in the fmt.Println("PAUSE HERE IN DEBUG MODE") , trace back in the stack you will see the position of your workflow code that run into non-deterministic error. -Currently this debugging experience is not very ideal. [This proposal will help](https://github.com/uber/cadence/issues/2801). +Currently this debugging experience is not very ideal. [This proposal will help](https://github.com/cadence-workflow/cadence/issues/2801). ## Automatic history replay and non-deterministic error detection -We have ideas on how non-deterministic errors can be detected automatically and safe rollout can be achieved. See [this issue](https://github.com/uber/cadence/issues/2547). +We have ideas on how non-deterministic errors can be detected automatically and safe rollout can be achieved. See [this issue](https://github.com/cadence-workflow/cadence/issues/2547). # What to do with non-deterministic errors @@ -299,17 +299,17 @@ We have ideas on how non-deterministic errors can be detected automatically and This is a worker option to decide what to do with non-deterministic workflows. Default is **NonDeterministicWorkflowPolicyBlockWorkflow** -Another option is **NonDeterministicWorkflowPolicyFailWorkflow** which will fail the workflow immediately. You may want that if it is Okay to fail the trouble workflows for now(you can reopen later with reset) so that workers won’t keep on retrying the workflows. +Another option is **NonDeterministicWorkflowPolicyFailWorkflow** which will fail the workflow immediately. You may want that if it is Okay to fail the trouble workflows for now(you can reopen later with reset) so that workers won’t keep on retrying the workflows. ### DisableStickyExecution -It defaults to false which means all workflows will stay in stickyCache unless there is memory pressure which causes them to be evicted. This is the desired behavior in production as it saves replay efforts. However it could hide potential non-deterministic errors exactly because of this reason. +It defaults to false which means all workflows will stay in stickyCache unless there is memory pressure which causes them to be evicted. This is the desired behavior in production as it saves replay efforts. However it could hide potential non-deterministic errors exactly because of this reason. -When troubleshooting it might be helpful to let a small number of workers run with stickiness disabled, so that it always replays the whole history when execution decision tasks. +When troubleshooting it might be helpful to let a small number of workers run with stickiness disabled, so that it always replays the whole history when execution decision tasks. ## GetVersion() & SideEffect() -If you know some code change will cause non-deterministic errors, then use +If you know some code change will cause non-deterministic errors, then use [workflow.GetVersion()](https://cadenceworkflow.io/docs/07_goclient/14_workflow_versioning) API to prevent it. This API will let the workflow that has finished the changing code to go with old code path, but let the workflows that hasn’t to go with new code path. @@ -321,20 +321,20 @@ if v == workflow.DefaultVersion{ workflow.Sleep(time.Hour) }else{ // run with new code - // do nothing as we are deleting the timer + // do nothing as we are deleting the timer } ``` -Another similar API to prevent non-deterministic error is [workflow.SideEffect()](https://cadenceworkflow.io/docs/07_goclient/10_side_effect). +Another similar API to prevent non-deterministic error is [workflow.SideEffect()](https://cadenceworkflow.io/docs/07_goclient/10_side_effect). ## BinaryChecksum & workflow.Now() -What if Non-deterministic code change has been deployed without GetVersions()? +What if Non-deterministic code change has been deployed without GetVersions()? -Sometimes we may forget to use GetVersions(), or misuse it. This could be a serious problem because after deployment, we probably cannot rollback: because some workflows has run with new code but some workflows has stuck. Rollback will save the stuck workflows but also stuck other workflows. +Sometimes we may forget to use GetVersions(), or misuse it. This could be a serious problem because after deployment, we probably cannot rollback: because some workflows has run with new code but some workflows has stuck. Rollback will save the stuck workflows but also stuck other workflows. -The best way is to use BinaryChecksum and Now() to let workflow diverge at the breaking changes. **workflow.GetInfo().[BinaryChecksum](https://github.com/uber-go/cadence-client/issues/925)** is the checksum of the binary that made that decision. **workflow.[Now](https://github.com/uber-go/cadence-client/issues/926)()** is timestamp that the decision is made. For better experience, you should integrate with binaryChecksum is in a format of "**Your GIT_REF**" by **worker.SetBinaryChecksum()** API. +The best way is to use BinaryChecksum and Now() to let workflow diverge at the breaking changes. **workflow.GetInfo().[BinaryChecksum](https://github.com/cadence-workflow/cadence-go-client/issues/925)** is the checksum of the binary that made that decision. **workflow.[Now](https://github.com/cadence-workflow/cadence-go-client/issues/926)()** is timestamp that the decision is made. For better experience, you should integrate with binaryChecksum is in a format of "**Your GIT_REF**" by **worker.SetBinaryChecksum()** API. Use the "extra decision" as an example. After deploy the code change, then there are workflow W1 stuck because of extra decision, however workflow W2 has started the two in-parallel activities. If we rollback the code, W1 will be fixed but W2 will be stuck. We can fix it by changing the code to: @@ -361,35 +361,35 @@ if *(workflow.GetInfo(ctx).BinaryChecksum)== "BINARY_BEFORE_BAD_DEPLOYMENT" || w ``` -BINARY_BEFORE_BAD_DEPLOYMENT is the previous binary checksum, DeploymentStartTime that deployment start time. This will tell the workflows that has finished the decision with previous binary, or the decision is old enough that is not finished with new code, then it should go with the old code. We need DeploymentStartTime because W1 could be started by different binary checksums. +BINARY_BEFORE_BAD_DEPLOYMENT is the previous binary checksum, DeploymentStartTime that deployment start time. This will tell the workflows that has finished the decision with previous binary, or the decision is old enough that is not finished with new code, then it should go with the old code. We need DeploymentStartTime because W1 could be started by different binary checksums. ## [Reset workflow](https://cadenceworkflow.io/docs/08_cli#restart-reset-workflow) -The last solution is to reset the workflows. A process in real OS can only move forward but never go back to previous state. However, a Cadence workflow can go back to previous state since we have stored the history as a list. +The last solution is to reset the workflows. A process in real OS can only move forward but never go back to previous state. However, a Cadence workflow can go back to previous state since we have stored the history as a list. -Internally reset a workflow will use history as a tree. It takes a history as base, and fork a new branch from it. So that you will reset many times without losing the history(until history is deleted after retention). +Internally reset a workflow will use history as a tree. It takes a history as base, and fork a new branch from it. So that you will reset many times without losing the history(until history is deleted after retention). -Reset will start a new run with new runID like continueAsNew. +Reset will start a new run with new runID like continueAsNew. -After forking from the base, reset will also collect all the signals along the chain of continueAsNew from the base history. This is an important feature as we can consider signals are external events that we don’t want to lose. +After forking from the base, reset will also collect all the signals along the chain of continueAsNew from the base history. This is an important feature as we can consider signals are external events that we don’t want to lose. However, reset will schedule and execute some activities that has done before. So after reset, you may see some activities are re-executed. Same applies for timer/ChildWorkflows. If you don’t want to re-execute, you can emit a signal to self to identify that this activity/timer/childWorkflow is done -- since reset will collect signals after resetting point. -Note that reset with [child workflows](https://github.com/uber/cadence/issues/2951) is not fully supported yet. +Note that reset with [child workflows](https://github.com/cadence-workflow/cadence/issues/2951) is not fully supported yet. ## Primitive Reset CLI Command -This reset command is for resetting one workflow. We may use this command to manually resetting particular workflow for experiment or mitigation. +This reset command is for resetting one workflow. We may use this command to manually resetting particular workflow for experiment or mitigation. -It takes workflowID/runID for base history. +It takes workflowID/runID for base history. -It takes either eventID or eventType for forking point. +It takes either eventID or eventType for forking point. -EventID has to be decisionCloseEventID as we designed reset must be done by decision boundary. +EventID has to be decisionCloseEventID as we designed reset must be done by decision boundary. ResetType support these: LastDecisionCompleted, LastContinuedAsNew, BadBinary ,FirstDecisionCompleted. -If the workflowID has an open run, you need to be aware of [this race condition](https://github.com/uber/cadence/issues/2930) when resetting it. +If the workflowID has an open run, you need to be aware of [this race condition](https://github.com/cadence-workflow/cadence/issues/2930) when resetting it. ## Batch Reset CLI Command @@ -409,7 +409,7 @@ Reset-batch is for resetting a list of workflows. Usually reset-batch is more us 2. Where to reset -* ResetType +* ResetType * ResetBadBinaryChecksum @@ -417,29 +417,28 @@ Other arguments: Parallism will decide how fast you want to reset. -To be safe, you may use DryRun option for only printing some logs before actually executing it. +To be safe, you may use DryRun option for only printing some logs before actually executing it. For example, in the case of "Decision State Machine Panic", we might have to reset the workflows by command: *$nohup cadence --do samples-domain --env prod wf reset-batch --reason "fix outage" --query “WorkflowType=’SampleWorkflow’ AND CloseTime=missing” --dry-run --reset-type --non_deterministic_only --skip_base_not_current &> reset.log &* -For reset type, you may try LastDecisionCompleted. Then try FirstDecisionCompleted. We should also provide [firstPanicDecision](https://github.com/uber/cadence/issues/2952) resetType . +For reset type, you may try LastDecisionCompleted. Then try FirstDecisionCompleted. We should also provide [firstPanicDecision](https://github.com/cadence-workflow/cadence/issues/2952) resetType . ## AutoReset Workflow -Cadence also provides a command to reset all progress made by any binary given a binaryChecksum. +Cadence also provides a command to reset all progress made by any binary given a binaryChecksum. -The way it works is to store the first decision completed ID as an **auto-reset point** for any binaryChecksum. Then when a customer mark a binary checksum is bad, the badBinaryChecksum will be stored in domainConfig. Whenever an open workflow make any progress, it will reset the workflow to the auto-reset point. +The way it works is to store the first decision completed ID as an **auto-reset point** for any binaryChecksum. Then when a customer mark a binary checksum is bad, the badBinaryChecksum will be stored in domainConfig. Whenever an open workflow make any progress, it will reset the workflow to the auto-reset point. There are some limitations: -1. There are only a limited number(20 by default) of auto-reset points for each workflow. Beyond that the auto-reset points will be rotated. +1. There are only a limited number(20 by default) of auto-reset points for each workflow. Beyond that the auto-reset points will be rotated. -2. It only applies to open workflows. +2. It only applies to open workflows. 3. It only reset when the open workflow make a decision respond -4. [It could be much improved by this proposal, ](https://github.com/uber/cadence/issues/2810) - -However, you can use reset batch command to achieve the same to both open/closed workflows and without waiting for making decision respond. +4. [It could be much improved by this proposal, ](https://github.com/cadence-workflow/cadence/issues/2810) +However, you can use reset batch command to achieve the same to both open/closed workflows and without waiting for making decision respond. diff --git a/docs/persistence.md b/docs/persistence.md index f2188959560..9ce3a4aeb05 100644 --- a/docs/persistence.md +++ b/docs/persistence.md @@ -1,10 +1,10 @@ # Overview Cadence has a well defined API interface at the persistence layer. Any database that supports multi-row transactions on a single shard or partition can be made to work with cadence. This includes cassandra, dynamoDB, auroraDB, MySQL, -Postgres and may others. There are currently three supported database implementations at the persistence layer - +Postgres and may others. There are currently three supported database implementations at the persistence layer - cassandra and MySQL/Postgres. This doc shows how to run cadence with cassandra and MySQL(Postgres is mostly the same). It also describes the steps involved in adding support for a new database at the persistence layer. - + # Getting started on mac ## Cassandra ### Start cassandra @@ -23,8 +23,8 @@ make install-schema ``` cd $GOPATH/github.com/uber/cadence ./cadence-server start --services=frontend,matching,history,worker -``` - +``` + ## MySQL ### Start MySQL server ``` @@ -36,8 +36,8 @@ brew services start mysql cd $GOPATH/github.com/uber/cadence make install-schema-mysql ``` -When run tests and CLI command locally, Cadence by default uses a user `uber` with password `uber`, with privileges of creating databases. -You can use the following command to create user(role) and grant access. +When run tests and CLI command locally, Cadence by default uses a user `uber` with password `uber`, with privileges of creating databases. +You can use the following command to create user(role) and grant access. In the mysql shell: ``` > CREATE USER 'uber'@'%' IDENTIFIED BY 'uber'; @@ -64,7 +64,7 @@ postgres=# CREATE USER postgres WITH PASSWORD 'cadence'; CREATE ROLE postgres=# ALTER USER postgres WITH SUPERUSER; ALTER ROLE -``` +``` ### Install cadence schema ``` cd $GOPATH/github.com/uber/cadence @@ -81,16 +81,16 @@ cp config/development_postgres.yaml config/development.yaml # Configuration ## Common to all persistence implementations There are two major sub-subsystems within cadence that need persistence - cadence-core and visibility. cadence-core is -the workflow engine that uses persistence to store state tied to domains, workflows, workflow histories, task lists -etc. cadence-core powers almost all of the cadence APIs. cadence-core could be further broken down into multiple -subs-systems that have slightly different persistence workload characteristics. But for the purpose of simplicity, we -don't expose these sub-systems today but this may change in future. Visibility is the sub-system that powers workflow -search. This includes APIs such as ListOpenWorkflows and ListClosedWorkflows. Today, it is possible to run a cadence -server with cadence-core backed by one database and cadence-visibility backed by another kind of database.To get the full -feature set of visibility, the recommendation is to use elastic search as the persistence layer. However, it is also possible -to run visibility with limited feature set against Cassandra or MySQL today. The top level persistence configuration looks +the workflow engine that uses persistence to store state tied to domains, workflows, workflow histories, task lists +etc. cadence-core powers almost all of the cadence APIs. cadence-core could be further broken down into multiple +subs-systems that have slightly different persistence workload characteristics. But for the purpose of simplicity, we +don't expose these sub-systems today but this may change in future. Visibility is the sub-system that powers workflow +search. This includes APIs such as ListOpenWorkflows and ListClosedWorkflows. Today, it is possible to run a cadence +server with cadence-core backed by one database and cadence-visibility backed by another kind of database.To get the full +feature set of visibility, the recommendation is to use elastic search as the persistence layer. However, it is also possible +to run visibility with limited feature set against Cassandra or MySQL today. The top level persistence configuration looks like the following: - + ``` persistence: @@ -107,7 +107,7 @@ persistence: ``` ## Note on numHistoryShards -Internally, cadence uses shards to distribute workflow ownership across different hosts. Shards are necessary for the +Internally, cadence uses shards to distribute workflow ownership across different hosts. Shards are necessary for the horizontal scalability of cadence service. The number of shards for a cadence cluster is picked at cluster provisioning time and cannot be changed after that. One way to think about shards is the following - if you have a cluster with N shards, then cadence cluster can be of size 1 to N. But beyond N, you won't be able to add more hosts to scale. In future, @@ -122,7 +122,7 @@ persistence: datastore1: nosql: pluginName: "cassandra" - hosts: "127.0.0.1,127.0.0.2" -- CSV of cassandra hosts to connect to + hosts: "127.0.0.1,127.0.0.2" -- CSV of cassandra hosts to connect to User: "user-name" Password: "password" keyspace: "cadence" -- Name of the cassandra keyspace @@ -131,11 +131,11 @@ persistence: ``` ## MySQL/Postgres -The default isolation level for MySQL/Postgres is READ-COMMITTED. +The default isolation level for MySQL/Postgres is READ-COMMITTED. -Note that for MySQL 5.6 and below only, the isolation level needs to be +Note that for MySQL 5.6 and below only, the isolation level needs to be specified explicitly in the config via connectAttributes. - + ``` persistence: ... @@ -146,7 +146,7 @@ persistence: databaseName: "cadence" -- name of the database to connect to connectAddr: "127.0.0.1:3306" -- connection address, could be ip address or domain socket connectProtocol: "tcp" -- connection protocol, tcp or anything that SQL Data Source Name accepts - user: "uber" + user: "uber" password: "uber" maxConns: 20 -- max number of connections to sql server from one host (optional) maxIdleConns: 20 -- max number of idle conns to sql server from one host (optional) @@ -156,9 +156,9 @@ persistence: ``` ## Multiple SQL(MySQL/Postgres) databases -To run Cadence clusters in a much larger scale using SQL database, multiple databases can be used as a sharded SQL database cluster. +To run Cadence clusters in a much larger scale using SQL database, multiple databases can be used as a sharded SQL database cluster. -Set `useMultipleDatabases` to `true` and specify all databases' user/password/address using `multipleDatabasesConfig`: +Set `useMultipleDatabases` to `true` and specify all databases' user/password/address using `multipleDatabasesConfig`: ```yaml persistence: ... @@ -170,7 +170,7 @@ persistence: maxConnLifetime: "1h" -- max connection lifetime before it is discarded (optional) useMultipleDatabases: true -- this enabled the multiple SQL databases as sharded SQL cluster nShards: 4 -- the number of shards -- in this mode, it needs to be greater than one and equalt to the length of multipleDatabasesConfig - multipleDatabasesConfig: -- each entry will represent a shard of the cluster + multipleDatabasesConfig: -- each entry will represent a shard of the cluster - user: "root" password: "cadence" connectAddr: "127.0.0.1:3306" @@ -186,7 +186,7 @@ persistence: - user: "root" password: "cadence" connectAddr: "127.0.0.1:3306" - databaseName: "cadence3" + databaseName: "cadence3" ``` @@ -195,8 +195,8 @@ How Cadence implement the sharding: * Workflow execution and historyShard records are sharded based on historyShardID(which is calculated `historyShardID =hash(workflowID) % numHistoryShards` ), `dbShardID = historyShardID % numDBShards` * Workflow History is sharded based on history treeID(a treeID usually is the runID unless it has reset. In case of reset, it will share the same tree as the base run). In that case, `dbShardID = hash(treeID) % numDBShards` * Workflow tasks(for workflow/activity workers) is sharded based on domainID + tasklistName. `dbShardID = hash(domainID + tasklistName ) % numDBShards` -* Workflow visibility is sharded based on domainID like we said above. `dbShardID = hash(domainID ) % numDBShards` - * However, due to potential scalability issue, Cadence requires advanced visibility to run with multiple SQL database mode. +* Workflow visibility is sharded based on domainID like we said above. `dbShardID = hash(domainID ) % numDBShards` + * However, due to potential scalability issue, Cadence requires advanced visibility to run with multiple SQL database mode. * Internal domain records is using single shard, it’s only writing when register/update domain, and read is protected by domainCache `dbShardID = DefaultShardID(0)` * Internal queue records is using single shard. Similarly, the read/write is low enough that it’s okay to not sharded. `dbShardID = DefaultShardID(0)` @@ -207,19 +207,19 @@ As there are many shared concepts and functionalities in SQL database, we abstra This interface is tied to a specific schema i.e. the way data is laid out across tables and the table names themselves are fixed. However, you get the flexibility wrt how you store the data within a table (i.e. column names and -types are not fixed). The API interface can be found [here](https://github.com/uber/cadence/blob/master/common/persistence/sql/plugins/interfaces.go). +types are not fixed). The API interface can be found [here](https://github.com/cadence-workflow/cadence/blob/master/common/persistence/sql/plugins/interfaces.go). It's basically a CRUD API for every table in the schema. A sample schema definition for mysql that uses this interface -can be found [here](https://github.com/uber/cadence/blob/master/schema/mysql/v8/cadence/schema.sql) +can be found [here](https://github.com/cadence-workflow/cadence/blob/master/schema/mysql/v8/cadence/schema.sql) -Any database that supports this interface can be plugged in with cadence server. -We have implemented Postgres within the repo, and also here is [**an example**](https://github.com/longquanzheng/cadence-extensions/tree/master/cadence-sqlite) to implement any database externally. +Any database that supports this interface can be plugged in with cadence server. +We have implemented Postgres within the repo, and also here is [**an example**](https://github.com/longquanzheng/cadence-extensions/tree/master/cadence-sqlite) to implement any database externally. ## For other Non-SQL Database For databases that don't support SQL operations like explicit transaction(with pessimistic locking), Cadence requires at least supporting: 1. Multi-row single shard conditional write(also called LightWeight transaction by Cassandra terminology) - 2. Strong consistency Read/Write operations - -This NoSQL persistence API interface can be found [here](https://github.com/uber/cadence/blob/master/common/persistence/nosql/nosqlplugin/interfaces.go). -Currently this is only implemented with Cassandra. DynamoDB and MongoDB are in progress. \ No newline at end of file + 2. Strong consistency Read/Write operations + +This NoSQL persistence API interface can be found [here](https://github.com/cadence-workflow/cadence/blob/master/common/persistence/nosql/nosqlplugin/interfaces.go). +Currently this is only implemented with Cassandra. DynamoDB and MongoDB are in progress. diff --git a/docs/roadmap.md b/docs/roadmap.md index c01479b6b3a..c1847df6fd8 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -4,23 +4,23 @@ The following is a high-level quarterly roadmap of the [Cadence](https://cadence ## Q3 2019 -* [Resource-Specific Tasklist (a.k.a session)](https://github.com/uber/cadence/blob/master/docs/design/1533-host-specific-tasklist.md) -* [Visibility on Elastic Search](https://github.com/uber/cadence/blob/master/docs/visibility-on-elasticsearch.md) +* [Resource-Specific Tasklist (a.k.a session)](https://github.com/cadence-workflow/cadence/blob/master/docs/design/1533-host-specific-tasklist.md) +* [Visibility on Elastic Search](https://github.com/cadence-workflow/cadence/blob/master/docs/visibility-on-elasticsearch.md) * Scalable tasklist * MySQL support ## Q4 2019 -* [Multi-DC support for Cadence replication](https://github.com/uber/cadence/blob/master/docs/design/2290-cadence-ndc.md) +* [Multi-DC support for Cadence replication](https://github.com/cadence-workflow/cadence/blob/master/docs/design/2290-cadence-ndc.md) * Workflow history archival * Workflow visibility archival -* [Synchronous Request Reply](https://github.com/uber/cadence/blob/master/docs/design/2215-synchronous-request-reply.md) +* [Synchronous Request Reply](https://github.com/cadence-workflow/cadence/blob/master/docs/design/2215-synchronous-request-reply.md) * Postgres SQL support ## Q1 2020 * Service availability and reliability improvements -* [Domain level AuthN and AuthZ support](https://github.com/uber/cadence/issues/2833) +* [Domain level AuthN and AuthZ support](https://github.com/cadence-workflow/cadence/issues/2833) * Graceful domain failover design * UI bug fixes and performance improvements diff --git a/docs/scalable_tasklist.md b/docs/scalable_tasklist.md index 35ea3f62aeb..919d924d9cb 100644 --- a/docs/scalable_tasklist.md +++ b/docs/scalable_tasklist.md @@ -17,11 +17,11 @@ The partitions are organized in a tree-structure. The number of child nodes is c # Configuration The number of partitions of a tasklist is configured by 2 dynamicconfigs: -1. [matching.numTasklistReadPartitions](https://github.com/uber/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3350) -2. [matching.numTasklistWritePartitions](https://github.com/uber/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3344) +1. [matching.numTasklistReadPartitions](https://github.com/cadence-workflow/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3350) +2. [matching.numTasklistWritePartitions](https://github.com/cadence-workflow/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3344) We're migrating this configuration from dynamicconfig to database. The following dynamicconfig is used to control where to read the number of partitions from: -- [matching.enableGetNumberOfPartitionsFromCache](https://github.com/uber/cadence/blob/v1.2.15-prerelease02/common/dynamicconfig/constants.go#L4008) +- [matching.enableGetNumberOfPartitionsFromCache](https://github.com/cadence-workflow/cadence/blob/v1.2.15-prerelease02/common/dynamicconfig/constants.go#L4008) To update the number of partitions, use the following CLI command: ``` cadence admin tasklist update-partition -h @@ -33,13 +33,13 @@ cadence admin tasklist describe -h The tree-structure and forwarding mechanism is configured by these dynamicconfigs: -1. [matching.forwarderMaxChildrenPerNode](https://github.com/uber/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3374) -2. [matching.forwarderMaxOutstandingPolls](https://github.com/uber/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3356) -3. [matching.forwarderMaxOutstandingTasks](https://github.com/uber/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3362) -4. [matching.forwarderMaxRatePerSecond](https://github.com/uber/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3368) +1. [matching.forwarderMaxChildrenPerNode](https://github.com/cadence-workflow/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3374) +2. [matching.forwarderMaxOutstandingPolls](https://github.com/cadence-workflow/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3356) +3. [matching.forwarderMaxOutstandingTasks](https://github.com/cadence-workflow/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3362) +4. [matching.forwarderMaxRatePerSecond](https://github.com/cadence-workflow/cadence/blob/v1.2.13/common/dynamicconfig/constants.go#L3368) # Selection Algorithms -The selection algorithms are implemented as a LoadBalancer in [client/matching package](https://github.com/uber/cadence/blob/v1.2.13/client/matching/loadbalancer.go#L37). +The selection algorithms are implemented as a LoadBalancer in [client/matching package](https://github.com/cadence-workflow/cadence/blob/v1.2.13/client/matching/loadbalancer.go#L37). ## Random Selection This is the first algorithm and it's been widely adopted in production. It's completely stateless and uses a shared nothing architecture. The probabilistic model of discrete uniform distribution guarantees the fairness of the distribution of requests. And the utilization is improved by the tree-structure. For example, as shown in the diagram, if a task is produced to partition-5, but a poller is assigned to partition-3, we don't want the poller to wait at partition-3 for 60s and retry the poll request. And the retry has a 5/6 probability of not hitting partition-5. With the tree-structure and forwarding mechanism, the poller request and task are forwarded to root partition. So an idle poller waiting at partition-3 is utilized in this case. diff --git a/docs/visibility-on-elasticsearch.md b/docs/visibility-on-elasticsearch.md index 4d184c8134e..1442b0ff2f8 100644 --- a/docs/visibility-on-elasticsearch.md +++ b/docs/visibility-on-elasticsearch.md @@ -1,17 +1,17 @@ # Overview -Cadence visibility APIs allow users to list open or closed workflows with filters such as WorkflowType or WorkflowID. -With Cassandra, there are issues around scalability and performance, for example: +Cadence visibility APIs allow users to list open or closed workflows with filters such as WorkflowType or WorkflowID. +With Cassandra, there are issues around scalability and performance, for example: - list large amount of workflows may kill Cassandra node. - data is partitioned by domain, which means writing large amount of workflow to one domain will cause Cassandra nodes hotspots. - query with filter is slow for large data. (With MySQL, there might be similar issues but not tested) -In addition, Cadence want to support users to perform flexible query with multiple filters on even custom info. -That's why Cadence add support for enhanced visibility features on top of ElasticSearch (Note as ES below), which includes: +In addition, Cadence want to support users to perform flexible query with multiple filters on even custom info. +That's why Cadence add support for enhanced visibility features on top of ElasticSearch (Note as ES below), which includes: - new visibility APIs to List/Scan/Count workflows with SQL like query - search attributes feature to support users provide custom info - + # Quick Start ## Local Cadence Docker Setup 1. Increase docker memory to higher 6GB. Docker -> Preference -> advanced -> memory limit @@ -19,9 +19,9 @@ That's why Cadence add support for enhanced visibility features on top of Elasti 3. Start cadence docker which contains Kafka, Zookeeper and ElasticSearch. Run `docker-compose -f docker-compose-es.yml up` 4. From docker output log, make sure ES and cadence started correctly. If encounter disk space not enough, try `docker system prune -a --volumes` 5. Register local domain and start using it. `cadence --do samples-domain d re` - -## CLI Search Attributes Support + +## CLI Search Attributes Support Make sure Cadence CLI version is 0.6.4+ @@ -31,9 +31,9 @@ Make sure Cadence CLI version is 0.6.4+ cadence --do samples-domain wf list -q 'WorkflowType = "main.Workflow" and (WorkflowID = "1645a588-4772-4dab-b276-5f9db108b3a8" or RunID = "be66519b-5f09-40cd-b2e8-20e4106244dc")' cadence --do samples-domain wf list -q 'WorkflowType = "main.Workflow" StartTime > "2019-06-07T16:46:34-08:00" and CloseTime = missing' ``` -To list only open workflows, add `CloseTime = missing` to the end of query. +To list only open workflows, add `CloseTime = missing` to the end of query. -### start workflow with search attributes +### start workflow with search attributes ``` cadence --do samples-domain workflow start --tl helloWorldGroup --wt main.Workflow --et 60 --dt 10 -i '"input arg"' -search_attr_key 'CustomIntField | CustomKeywordField | CustomStringField | CustomBoolField | CustomDatetimeField' -search_attr_value '5 | keyword1 | my test | true | 2019-06-07T16:16:36-08:00' @@ -41,19 +41,19 @@ cadence --do samples-domain workflow start --tl helloWorldGroup --wt main.Workfl Note: start workflow with search attributes but without ES will succeed as normal, but will not be searchable and will not be shown in list result. -### search workflow with new list API +### search workflow with new list API ``` cadence --do samples-domain wf list -q '(CustomKeywordField = "keyword1" and CustomIntField >= 5) or CustomKeywordField = "keyword2"' -psa cadence --do samples-domain wf list -q 'CustomKeywordField in ("keyword2", "keyword1") and CustomIntField >= 5 and CloseTime between "2018-06-07T16:16:36-08:00" and "2019-06-07T16:46:34-08:00" order by CustomDatetimeField desc' -psa ``` -(Search attributes can be updated inside workflow, see example [here](https://github.com/uber-common/cadence-samples/tree/master/cmd/samples/recipes/searchattributes). +(Search attributes can be updated inside workflow, see example [here](https://github.com/cadence-workflow/cadence-samples/tree/master/cmd/samples/recipes/searchattributes). # Details ## Dependencies - Zookeeper - for Kafka to start -- Kafka - message queue for visibility data +- Kafka - message queue for visibility data - ElasticSearch v6+ - for data search (early ES version may not support some queries) ## Configuration @@ -71,9 +71,9 @@ persistence: indices: visibility: cadence-visibility-dev ``` -This part is used to config advanced visibility store to ElasticSearch. - - `url` is for Cadence to discover ES - - `indices/visibility` is ElasticSearch index name for the deployment. +This part is used to config advanced visibility store to ElasticSearch. + - `url` is for Cadence to discover ES + - `indices/visibility` is ElasticSearch index name for the deployment. Optional TLS Support can be enabled by setting the TLS config as follows: ```yaml @@ -93,7 +93,7 @@ elasticsearch: sslmode: false ``` -Also need to add a kafka topic to visibility, as shown below. +Also need to add a kafka topic to visibility, as shown below. ``` kafka: ... @@ -102,12 +102,11 @@ kafka: topic: cadence-visibility-dev dlq-topic: cadence-visibility-dev-dlq ... -``` +``` There are dynamic configs to control ElasticSearch visibility features: -- `system.advancedVisibilityWritingMode` is an int property to control how to write visibility to data store. -`"off"` means do not write to advanced data store, -`"on"` means only write to advanced data store, +- `system.advancedVisibilityWritingMode` is an int property to control how to write visibility to data store. +`"off"` means do not write to advanced data store, +`"on"` means only write to advanced data store, `"dual"` means write to both DB (Cassandra or MySQL) and advanced data store - `system.enableReadVisibilityFromES` is a boolean property to control whether Cadence List APIs should use ES as source or not. - diff --git a/service/worker/README.md b/service/worker/README.md index 8ff74c26aec..85f2d5194fa 100644 --- a/service/worker/README.md +++ b/service/worker/README.md @@ -34,7 +34,7 @@ make install-schema-xdc ``` cadence --do sample domain register --ac cluster0 --cl cluster0 cluster1 cluster2 ``` -Then run a helloworld from [Go Client Sample](https://github.com/uber-common/cadence-samples/) or [Java Client Sample](https://github.com/uber/cadence-java-samples) +Then run a helloworld from [Go Client Sample](https://github.com/cadence-workflow/cadence-samples/) or [Java Client Sample](https://github.com/cadence-workflow/cadence-java-samples) 4. Failover a domain between clusters: @@ -51,15 +51,15 @@ Failback to cluster0: cadence --do sample samples-domain update --ac cluster0 ``` -## Multiple region setup -In a multiple region setup, use another set of config instead. +## Multiple region setup +In a multiple region setup, use another set of config instead. ``` ./cadence-server --zone cross_region_cluster0 start ./cadence-server --zone cross_region_cluster1 start ./cadence-server --zone cross_region_cluster2 start ``` -Right now the only difference is at clusterGroupMetadata.clusterRedirectionPolicy. +Right now the only difference is at clusterGroupMetadata.clusterRedirectionPolicy. In multiple region setup, network communication overhead between clusters is high so should use "selected-apis-forwarding". workflow/activity workers need to be connected to each cluster to keep high availability. @@ -68,4 +68,4 @@ Archiver Archiver is used to handle archival of workflow execution histories. It does this by hosting a cadence client worker and running an archival system workflow. The archival client gets used to initiate archival through signal sending. The archiver -shards work across several workflows. +shards work across several workflows. diff --git a/tools/cassandra/README.md b/tools/cassandra/README.md index 82faa8d0966..82fa3c05c4e 100644 --- a/tools/cassandra/README.md +++ b/tools/cassandra/README.md @@ -2,7 +2,7 @@ This package contains the tooling for cadence cassandra operations. ## For localhost development -``` +``` make install-schema ``` > NOTE: See [CONTRIBUTING](/CONTRIBUTING.md) for prerequisite of make command. @@ -12,9 +12,9 @@ make install-schema ### Get the Cassandra Schema tool * Use brew to install CLI: `brew install cadence-workflow` which includes `cadence-cassandra-tool` * The schema files are located at `/usr/local/etc/cadence/schema/`. - * Follow the [instructions](https://github.com/uber/cadence/discussions/4457) if you need to install older versions of schema tools via homebrew. - However, easier way is to use new versions of schema tools with old versions of schemas. - All you need is to check out the older version of schemas from this repo. Run `git checkout v0.21.3` to get the v0.21.3 schemas in [the schema folder](/schema). + * Follow the [instructions](https://github.com/cadence-workflow/cadence/discussions/4457) if you need to install older versions of schema tools via homebrew. + However, easier way is to use new versions of schema tools with old versions of schemas. + All you need is to check out the older version of schemas from this repo. Run `git checkout v0.21.3` to get the v0.21.3 schemas in [the schema folder](/schema). * Or build yourself, with `make cadence-cassandra-tool`. See [CONTRIBUTING](/CONTRIBUTING.md) for prerequisite of make command. > Note: The binaries can also be found in the `ubercadence/server` docker images @@ -46,4 +46,3 @@ You can only upgrade to a new version after the initial setup done above. ./cadence-cassandra-tool -ep 127.0.0.1 -k cadence_visibility update-schema -d ./schema/cassandra/visibility/versioned -v x.x -y -- executes a dryrun of upgrade to version x.x ./cadence-cassandra-tool -ep 127.0.0.1 -k cadence_visibility update-schema -d ./schema/cassandra/visibility/versioned -v x.x -- actually executes the upgrade to version x.x ``` - diff --git a/tools/sql/README.md b/tools/sql/README.md index 6dc974bbc58..f3a2758454e 100644 --- a/tools/sql/README.md +++ b/tools/sql/README.md @@ -1,10 +1,10 @@ ## Using the SQL schema tool - + This package contains the tooling for cadence sql operations. The tooling itself is agnostic of the storage engine behind the sql interface. So, this same tool can be used against, say, OracleDB and MySQLDB ## For localhost development -``` +``` SQL_USER=$USERNAME SQL_PASSWORD=$PASSWD make install-schema-mysql ``` > NOTE: See [CONTRIBUTING](/CONTRIBUTING.md) for prerequisite of make command. @@ -14,8 +14,8 @@ SQL_USER=$USERNAME SQL_PASSWORD=$PASSWD make install-schema-mysql ### Get the SQL Schema tool * Use brew to install CLI: `brew install cadence-workflow` which includes `cadence-sql-tool` * The schema files are located at `/usr/local/etc/cadence/schema/`. - * Follow the [instructions](https://github.com/uber/cadence/discussions/4457) if you need to install older versions of schema tools via homebrew. - However, easier way is to use new versions of schema tools with old versions of schemas. + * Follow the [instructions](https://github.com/cadence-workflow/cadence/discussions/4457) if you need to install older versions of schema tools via homebrew. + However, easier way is to use new versions of schema tools with old versions of schemas. All you need is to check out the older version of schemas from this repo. Run `git checkout v0.21.3` to get the v0.21.3 schemas in [the schema folder](/schema). * Or build yourself, with `make cadence-sql-tool`. See [CONTRIBUTING](/CONTRIBUTING.md) for prerequisite of make command. @@ -47,4 +47,3 @@ You can only upgrade to a new version after the initial setup done above. ./cadence-sql-tool --ep $SQL_HOST_ADDR -p $port --plugin mysql --db cadence_visibility update-schema -d ./schema/mysql/v8/visibility/versioned -v x.x --dryrun -- executes a dryrun of upgrade to version x.x ./cadence-sql-tool --ep $SQL_HOST_ADDR -p $port --plugin mysql --db cadence_visibility update-schema -d ./schema/mysql/v8/visibility/versioned -v x.x -- actually executes the upgrade to version x.x ``` -