Skip to content

Commit

Permalink
chaos engineering section in user guides: FIS experiment for devs; FI…
Browse files Browse the repository at this point in the history
…S experiment for architects; outages for infra team (WIP); FIS on webapp template (pending UI PR)
  • Loading branch information
tinyg210 committed Nov 7, 2023
1 parent 5f0f672 commit 5bcdc6c
Show file tree
Hide file tree
Showing 9 changed files with 655 additions and 0 deletions.
22 changes: 22 additions & 0 deletions content/en/user-guide/chaos-engineering/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: "Chaos Engineering"
linkTitle: "Chaos Engineering"
weight: 11
description: >
Chaos Engineering with LocalStack enables you to build resilient systems early on in the development phase.
cascade:
type: docs
---

## Introduction

Chaos engineering with LocalStack presents a proactive approach to building resilient systems by introducing
controlled disruptions. This versatile practice varies in its application; for software developers, it might
mean application behavior and error handling, for architects, ensuring the robustness of system design, and for
operations teams, examining the reliability of infrastructure provisioning. By integrating chaos experiments early
in the development cycle, teams can uncover and address potential weaknesses, forging systems that withstand
turbulent conditions. In this section's subchapters, we will have a look at some of these scenarios using examples:

- **Software behavior and error handling** using Fault Injection Simulator experiments.
- **Robust architecture** as a result or Route53 failover tested with FIS experiments.
- **Infrastructure provisioning reliability** when faced with outages and anomalies, as part of automated provisioning processes.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
280 changes: 280 additions & 0 deletions content/en/user-guide/chaos-engineering/fis-experiments/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
---
title: "Fault Injection Simulator Experiments"
linkTitle: "Fault Injection Simulator Experiments"
weight: 1
description: Perform controlled experiments on your AWS infrastructure, allowing you to simulate faults and observe their impact to build more resilient applications.
---

## Introduction

AWS Fault Injection Simulator (FIS) is a service that facilitates controlled chaos engineering experiments on AWS
infrastructure to identify weaknesses and enhance system resilience. It provides a framework for injecting failures
and monitoring their effects, enabling developers to proactively prepare for real-world outages.

## Getting started

This guide is designed for users new to the Fault Injection Simulator and assumes basic knowledge of the AWS CLI and our
[`awslocal`](https://github.com/localstack/awscli-local) wrapper script. To read extensively about the FIS service, please
refer to the dedicated [documentation page](/user-guide/aws/fis/).


In this example of utilizing AWS Fault Injection Simulator (FIS) to cause controlled outages to a DynamoDB database we will
demonstrate testing software behavior and error handling. This kind of test helps to ensure that the software can handle
database downtime gracefully by implementing strategies such as queuing requests to prevent data loss. This proactive error
handling ensures that the system can maintain its operations despite partial failures. You can follow along with the full solution
in this GitHub [repository](https://github.com/localstack-samples/samples-chaos-engineering/tree/main/FIS-experiments).

Start LocalStack using the `docker-compose.yml` file from the repository and make sure you provide your API key as an environment
variable:

{{< command >}}
$ LOCALSTACK_API_KEY=<YOUR_LOCALSTACK_API_KEY>
$ docker compose up
{{< /command >}}

{{< figure src="fis-experiment-1.png" >}}

The resources will be created upon the LocalStack start.

## Creating an experiment template

Before creating any FIS experiments, let's make sure our system works as expected by creating an entity and persist it.
We'll call the API Gateway endpoint for the POST method via cURL:

```bash
$ curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-2004",
"name": "Ultimate Gadget",
"price": "49.99",
"description": "The Ultimate Gadget is the perfect tool for tech enthusiasts looking for the next level in gadgetry. Compact, powerful, and loaded with features."
}
'

Product added/updated successfully.
```

We create a file containing the FIS experiment called `experiment-ddb.json`. This has a JSON configuration that will be utilized
during the subsequent invocation of the `CreateExperimentTemplate` API in the FIS resource.

```bash
$ cat experiment-ddb.json
{
"actions": {
"Test action 1": {
"actionId": "localstack:generic:api-error",
"parameters": {
"service": "dynamodb",
"api": "all",
"percentage": "100",
"exception": "DynamoDbException",
"errorCode": "500"
}
}
},
"description": "Template for interfering with the DynamoDB service",
"stopConditions": [{
"source": "none"
}],
"roleArn": "arn:aws:iam:000000000000:role/ExperimentRole"
}
```

With this template definition we are targeting all APIs of the DynamoDB resource. Specific operations, such as `PutItem` or `GetItem` can also
be specified, but in this case, we just want to cut off the database completely. This configuration will result in a 100% failure rate
for all API calls, each accompanied by an HTTP 500 status code, with a DynamoDbException.

```bash
$ awslocal fis create-experiment-template --cli-input-json file://experiment-ddb.json
{
"experimentTemplate": {
"id": "895591e8-11e6-44c4-adc3-86592010562b",
"description": "Template for interfering with the DynamoDB service",
"actions": {
"Test action 1": {
"actionId": "localstack:generic:api-error",
"parameters": {
"service": "dynamodb",
"api": "all",
"percentage": "100",
"exception": "DynamoDbException",
"errorCode": "500"
}
}
},
"stopConditions": [
{
"source": "none"
}
],
"creationTime": 1699308754.415716,
"lastUpdateTime": 1699308754.415716,
"roleArn": "arn:aws:iam:000000000000:role/ExperimentRole"
}
}
```

We take note of the template ID for the next command.

## Starting the experiment

Based on the experiment template that was just created, a new experiment can be started, using the template ID.

```bash
$ awslocal fis start-experiment --experiment-template-id 895591e8-11e6-44c4-adc3-86592010562b
{
"experiment": {
"id": "1b1238fd-316d-4956-93e7-5ada677a6f69",
"experimentTemplateId": "895591e8-11e6-44c4-adc3-86592010562b",
"roleArn": "arn:aws:iam:000000000000:role/ExperimentRole",
"state": {
"status": "running"
},
"actions": {
"Test action 1": {
"actionId": "localstack:generic:api-error",
"parameters": {
"service": "dynamodb",
"api": "all",
"percentage": "100",
"exception": "DynamoDbException",
"errorCode": "500"
}
}
},
"stopConditions": [
{
"source": "none"
}
],
"creationTime": 1699308823.74327,
"startTime": 1699308823.74327
}
}
```

## The outage

Now that the experiment is started, the database will be inaccessible, meaning the user can't retrieve and can't add any new
products. The API Gateway will return an Internal Server Error. This is obviously problematic, as anyone who has ever worked
with enterprise applications can tell you, downtime and data loss are two things crucial to avoid.
Luckily, this potential issue has been caught early enough in the development phase, that the developer can include proper error handling and a mechanism
that prevents data loss in case of an outage of the database. This of course is not limited to DynamoDB, an outage can be
simulated for any storage resource.

## The solution

![fis-experiment-2](fis-experiment-2.png)

The potential solution could be deploying an SNS topic, an SQS queue and a Lambda function that will pick up the queued element and retry the
`PutItem` operation on the database. In case DynamoDB is still unavailable, the item will be re-queued.

```bash
$ curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-1003",
"name": "Super Widget",
"price": "29.99",
"description": "A versatile widget that can be used for a variety of purposes. Durable, reliable, and affordable."
}
'

A DynamoDB error occurred. Message sent to queue.⏎

```

If we check the logs, we can see that the `DynamoDbException` is handled gracefully:

```bash
2023-11-06T22:21:40.789 DEBUG --- [ asgi_gw_2] l.services.fis.handler : FIS handler called with configs: {'dynamodb': {None: [(100, 'DynamoDbException', '500')]}}
2023-11-06T22:21:40.789 INFO --- [ asgi_gw_2] localstack.request.aws : AWS dynamodb.PutItem => 500 (DynamoDbException)
2023-11-06T22:21:40.834 DEBUG --- [ asgi_gw_4] l.services.sns.publisher : Topic 'arn:aws:sns:us-east-1:000000000000:ProductEventsTopic' publishing '5520d37a-fc21-4a73-b1bf-f9b9afce5908' to subscribed
'arn:aws:sqs:us-east-1:000000000000:ProductEventsQueue' with protocol 'sqs' (subscription 'arn:aws:sns:us-east-1:000000000000:ProductEventsTopic:0a4abf8c-744a-404a-9ff9-f132e25d1b30')
```

Now this element sits in the queue, until the outage is over.

## Stopping the experiment

We can stop the experiment by using the following command:

```bash
$ awslocal fis stop-experiment --id 1b1238fd-316d-4956-93e7-5ada677a6f69
{
"experiment": {
"id": "1b1238fd-316d-4956-93e7-5ada677a6f69",
"experimentTemplateId": "895591e8-11e6-44c4-adc3-86592010562b",
"roleArn": "arn:aws:iam:000000000000:role/ExperimentRole",
"state": {
"status": "stopped"
},
"actions": {
"Test action 1": {
"actionId": "localstack:generic:api-error",
"parameters": {
"service": "dynamodb",
"api": "all",
"percentage": "100",
"exception": "DynamoDbException",
"errorCode": "500"
},
"startTime": 1699308823.750742,
"endTime": 1699309736.259625
}
},
"stopConditions": [
{
"source": "none"
}
],
"creationTime": 1699308823.74327,
"startTime": 1699308823.74327,
"endTime": 1699309736.259646
}
}
```

The experiment ID comes from the prior used `start-experiment` command.
The experiment has been stopped, meaning that the Product that initially has not reached the database, has finally reached
the destination. We can verify that by scanning the database:

```bash
$ awslocal dynamodb scan --table-name Products
{
"Items": [
{
"name": {
"S": "Super Widget"
},
"description": {
"S": "A versatile widget that can be used for a variety of purposes. Durable, reliable, and affordable."
},
"id": {
"S": "prod-1003"
},
"price": {
"N": "29.99"
}
},
{
"name": {
"S": "Ultimate Gadget"
},
"description": {
"S": "The Ultimate Gadget is the perfect tool for tech enthusiasts looking for the next level in gadgetry. Compact, powerful, and loaded with features."
},
"id": {
"S": "prod-2004"
},
"price": {
"N": "49.99"
}
}
],
"Count": 2,
"ScannedCount": 2,
"ConsumedCapacity": null
}
```
8 changes: 8 additions & 0 deletions content/en/user-guide/chaos-engineering/fis-webapp/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
title: "WebApp Fault Injection Simulator"
linkTitle: "WebApp Fault Injection Simulator"
weight: 1
description: WebApp Fault Injection Simulator
---

## Introduction
14 changes: 14 additions & 0 deletions content/en/user-guide/chaos-engineering/outages-extension/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
title: "Outages Extension"
linkTitle: "Outages Extension"
weight: 1
description: Outages Extension
---

## Introduction

Outages Extension

## Getting started

<Demo coming>
Loading

0 comments on commit 5bcdc6c

Please sign in to comment.