-
Notifications
You must be signed in to change notification settings - Fork 79
CI Pipeline
For our CI pipeline we've chosen Concourse. In their own words Concourse is a "pipeline-based continuous thing-doer". If you're unfamilliar, it lets you define pipelines using yaml, which are comprised of resources and jobs that act on those resources. Resources can be anything from git repositories to docker images and s3 buckets. Jobs are a series of steps, like getting or uploading a resource, or running a generic shell script, and everything is run inside a container and can run them on both linux and windows. In fact, Concourse actually uses Garden to run these containers. For more information, please have a look at the docs and the examples pages.
On the Garden team, we manage our own concourse deployment using BOSH and you can look at the Dasboard over at garden.ci.cf-app.com.
Like we mentioned, Concourse uses yaml to define a pipeline and our yaml definitions can be found on the garden-ci GitHub repository. It contains the pipeline definition as well as the all the BOSH deployments of Garden that the pipeline uses to run tests against. The yaml is under the ci directory and has the following structure:
ci/
├── pipelines/ -> defininitions for pipelines
├── common -> resources or jobs used in more than one pipeline
├── garden-godoc -> definitions for the garden-godoc pipeline
├── main -> definitions for the main pipeline
├── scripts/ -> scripts used by tasks
├── tasks/ -> task yaml definitions
├── vars/ -> yaml files containing [variables](https://concourse-ci.org/vars.html) when reconfiguring the pipeline
Additionally the structure of the a pipiline directory (like ci/pipelines/main) can be found here
The tasks and scripts directories do not contain all of the tasks definitions
that the pipelines use. Other repositories contain task and script definitions
in the ci
directory in the root of the repository. Here is a list of them:
- garden-runc-release
- grootfs
- garden
- groot
- garden-performance-acceptance-tests
- idmapper
- cpu-entitlement-plugin
For example the guardian job, which runs test for guardian has its task definition and script in garden-runc-release.
If you want to make changes to any of those task definitions and test them
locally before pushing them and run them in the web ui you can use fly
- the
Concourse cli. You can download fly
from the homepage of Concourse or form
their release page on
github. Here is how to run the aforementioned guardiand task:
fly -t garden-ci execute --inputs-from main/garden --image garden-ci-image -p -c "$HOME/workspace/garden-runc-release/ci/tasks/guardian.yml" -i gr-release-develop="$HOME/workspace/garden-runc-release"
You have to provide all inputs declared in the task yaml, some of which might be outputs of other tasks. For more information, please check out the fly help page
The scripts/remote-fly
script in garden-runc-release makes this even easier:
"$HOME/workspace/garden-runc-release/scripts/remote-fly" "$HOME/workspace/garden-runc-release/ci/tasks/guardian.yml"
All concourse jobs in Garden CI run on top the cfgarden/garden-ci docker image. This image contains all dependencies of our tests and the scripts used to run them. If you need to add something in that image, please do the following:
- modify the garden-ci Dockerfile and build a new version of the image by
running
make garden-ci
- tag the image with the dessired new version (check the latest version here
- push the new version of the image
- change the version of the concourse resource and reconfigure the pipeline
- run
garden-runc-release/scripts/test
-a to make sure nothing is broken (repeat steps above if it is)
To apply your changes to the pipeline you simply need to run the scripts/reconfigure-pipeline in the garden-ci repository. Note that you need LastPass access to the garden shared vault in order to do this. Also keep in mind that in order to change tasks or scripts you need to push to the repository containing them.
To upgrade the concourse version checkout the desired version tag in the concourse-bosh-deployment submodule. Then just run the deploy script on the bosh deployment in eden. If you run into problems check out this wiki page for some troubleshooting suggestions.
The main pipeline has several groups (tabs) which do different things. They are:
-
gating - this is the the part of the pipeline that gets triggered after a push to garden-runc release, so make sure everything goes green after committing a new change. It will run all the tests for individual components like guardian and garden. If those pass it will create a release candidate and deploy this version with different features enabled. Go to this page for a complete list.
-
non-gating - these are a collection of jobs that do various things, most notably run benchmarks against a CF deployment (how much time does it take to do a
cf push
andcf scale
), upload new releases of stemcells -
periodics - the periodics are the same tests that run against the different garden environments in the gating pipeline. They run every 40 minutes and are there so they can catch flakes
-
release - running the
garden-runc-shipit
will create a new release on GitHub, upload the new bosh release to bosh.io and advance themaster
branch. Similarly, runningcpu-entitlement-plugin-shipit
will release the cpu-entitlement-plugin. -
dependachore - deploys dependachore a Google Cloud Function that moves the dependabot PR stories generated in the Garden tracker to their own section in the icebox and converts them to chores.
-
groot - runs groot tests for both linux and windows (also has a periodic). It is separate since it is not a part of garden-runc-release
-
cpu-entitlements-plugin - runs tests for the cpu-entitlement-plugin
There is only one other pipeline: the garden-godoc. Since Garden is a client ment as the main way for users to interact with us, it needs to have its godocs refreshed whenever we make a change.
As already mentioned above, garden-runc-release is released by running
garden-runc-shipit
job. By default this will result in a new patch version
release. If you want to bump major or minor version, you should run
garden-runc-bum-major-version
or garden-runc-bump
minor-version
accordingly. The
version
is a smever concourse resource
backed by a file in an s3 bucket.
When you hit shipit the following steps take place:
- A new bosh release is created from the latest release candidate
- The bosh release tarball is uploaded to an s3 bucket (so that it is available on https://bosh.io/releases/github.com/cloudfoundry/garden-runc-release)
- A new github draft release is created with the all-in-one garden binary attached
- The new bosh release yml is committed and pushed to master
- The
garden-runc-merge-master
job merges the release commit back into develop
If everything is successful, you need to add some release notes to the draft github release and make it public. If something fails, I have bad news for you. You have to determine what part of the release process succeeded and what failed and decide what to do - either retrigger and hope nothing corrupt was pushed to s3/github or sacrifice a patch version and try again.
Given the amount of tests running in our pipeline it often happens that a couple of tests would faild sporadically. We call this a snowflake or just flake for short. Sometimes these flakes indicate problems in the code, but more often are test problems or simply instabilities of the environments. We tend to ignore problems that occur just once or twice or are related toj external factors (e.g. docker hub is down), but if a problem persists we look into it in order to make our code and tests more stable.
But how do you know if a test failure is a one-off or a regular flake? We have built a tool to search for particular test failures in the Concourse history. It is called flake-hunter. Here is how you use it:
concourse-flake-hunter -c https://garden.ci.cf-app.com/ -n main search <regexp>
The tool will keep listing matching build failures until you Ctrl+c or the history is exhausted. This way you can determine how a flake behaves over time.
For more information on our experiences with flakes, check out this blog post by one of Garden's former team members
Another nasty type of test failure is when a test hangs indefinitley. This is hard to spot, since it appears the same as any other running job, but it blocks the pipeline. In order to spot these problems we are running all garden-runc-release tests with a tool called slowmobius. Slowmobius watches the tests and if they take too long fails the job and posts to slack. The timeout as well as the slowmobius slack icon can be configured here. The pipeline needs to be reloaded for the changes to take effect.