Skip to content

Commit

Permalink
re-write fis experiments doc
Browse files Browse the repository at this point in the history
  • Loading branch information
HarshCasper committed Nov 15, 2023
1 parent 8d63a37 commit bdec2c5
Show file tree
Hide file tree
Showing 4 changed files with 140 additions and 107 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
216 changes: 117 additions & 99 deletions content/en/user-guide/chaos-engineering/fis-experiments/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,59 +7,58 @@ description: Conduct experiments on your AWS infrastructure to simulate faults a

## Introduction

AWS Fault Injection Simulator (FIS) is a service that facilitates controlled chaos engineering experiments on AWS
infrastructure to identify weaknesses and enhance system resilience. It provides a framework for injecting failures
and monitoring their effects, enabling developers to proactively prepare for real-world outages.
Fault Injection Simulator (FIS) is a service designed for conducting controlled chaos engineering tests on AWS infrastructure. Its purpose is to uncover vulnerabilities and improve system robustness. FIS offers a means to deliberately introduce failures and observe their impacts, helping developers to better equip their systems against actual outages. To read about the FIS service, refer to the dedicated [FIS documentation](https://docs.localstack.cloud/user-guide/aws/fis/).

## Getting started

This guide is designed for users new to the Fault Injection Simulator and assumes basic knowledge of the AWS CLI and our
[`awslocal`](https://github.com/localstack/awscli-local) wrapper script. To read extensively about the FIS service, please
refer to the dedicated [documentation page](https://docs.localstack.cloud/user-guide/aws/fis/).
[`awslocal`](https://github.com/localstack/awscli-local) wrapper script. In this example, we will use the FIS to create controlled outages in a DynamoDB database. The aim is to test the software's behavior and error handling capabilities.

For this particular example, we'll be using a [sample application repository](https://github.com/localstack-samples/samples-chaos-engineering/tree/main/FIS-experiments). Clone the repository, and follow the instructions below to get started.

In this example of utilizing AWS Fault Injection Simulator (FIS) to cause controlled outages to a DynamoDB database we will
demonstrate testing software behavior and error handling. This kind of test helps to ensure that the software can handle
database downtime gracefully by implementing strategies such as queuing requests to prevent data loss. This proactive error
handling ensures that the system can maintain its operations despite partial failures. You can follow along with the full solution
in this GitHub [repository]().
### Prerequisites

Start LocalStack using the `docker-compose.yml` file from the repository and make sure you provide your API key as an environment
variable:
The general prerequisites for this guide are:

- LocalStack Pro with [LocalStack API key](https://docs.localstack.cloud/getting-started/api-key/)
- [AWS CLI](https://docs.localstack.cloud/user-guide/integrations/aws-cli/) with the [`awslocal` wrapper](https://docs.localstack.cloud/user-guide/integrations/aws-cli/#localstack-aws-cli-awslocal)
- [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/install/)

Start LocalStack by using the `docker-compose.yml` file from the repository. Ensure to set your API key as an environment variable during this process. The cloud resources will be automatically created upon the LocalStack start.

{{< command >}}
$ LOCALSTACK_API_KEY=<YOUR_LOCALSTACK_API_KEY>
$ docker compose up
{{< /command >}}

{{< figure src="fis-experiment-1.png" >}}
### Application Architecture

The resources will be created upon the LocalStack start.
The following diagram shows the architecture that this application builds and deploys:

## Creating an experiment template
{{< figure src="fis-experiment-1.png" width="800">}}

Before creating any FIS experiments, let's make sure our system works as expected by creating an entity and persist it.
We'll call the API Gateway endpoint for the POST method via cURL:
### Creating an experiment template

```bash
$ curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-2004",
"name": "Ultimate Gadget",
"price": "49.99",
"description": "The Ultimate Gadget is the perfect tool for tech enthusiasts looking for the next level in gadgetry. Compact, powerful, and loaded with features."
}
'
Before starting any FIS experiments, it's important to verify that our application is functioning correctly. Start by creating an entity and saving it. To do this, use `cURL` to call the API Gateway endpoint for the POST method:

{{< command >}}
$ curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-2004",
"name": "Ultimate Gadget",
"price": "49.99",
"description": "The Ultimate Gadget is the perfect tool for tech enthusiasts looking for the next level in gadgetry. Compact, powerful, and loaded with features."
}'
<disable-copy>
Product added/updated successfully.
```
</disable-copy>
{{< /command >}}

We create a file containing the FIS experiment called `experiment-ddb.json`. This has a JSON configuration that will be utilized
during the subsequent invocation of the `CreateExperimentTemplate` API in the FIS resource.
You can use the file named `experiment-ddb.json` that contains the FIS experiment configuration. This file will be used in the upcoming call to the [`CreateExperimentTemplate`](https://docs.aws.amazon.com/fis/latest/APIReference/API_CreateExperimentTemplate.html) API within the FIS resource.

```bash
$ cat experiment-ddb.json
$ cat experiment-ddb.json
{
"actions": {
"Test action 1": {
Expand All @@ -81,12 +80,13 @@ during the subsequent invocation of the `CreateExperimentTemplate` API in the FI
}
```

With this template definition we are targeting all APIs of the DynamoDB resource. Specific operations, such as `PutItem` or `GetItem` can also
be specified, but in this case, we just want to cut off the database completely. This configuration will result in a 100% failure rate
for all API calls, each accompanied by an HTTP 500 status code, with a DynamoDbException.
This template is designed to target all APIs of the DynamoDB resource. While it's possible to specify particular operations like `PutItem` or `GetItem`, the objective here is to entirely disconnect the database.

```bash
As a result, this configuration will cause all API calls to fail with a 100% failure rate, each resulting in an HTTP 500 status code and a `DynamoDbException`.

{{< command >}}
$ awslocal fis create-experiment-template --cli-input-json file://experiment-ddb.json
<disable-copy>
{
"experimentTemplate": {
"id": "895591e8-11e6-44c4-adc3-86592010562b",
Expand All @@ -113,16 +113,18 @@ $ awslocal fis create-experiment-template --cli-input-json file://experiment-ddb
"roleArn": "arn:aws:iam:000000000000:role/ExperimentRole"
}
}
```
</disable-copy>
{{< /command >}}

We take note of the template ID for the next command.
Take note of the `id` field in the response. This is the ID of the experiment template that will be used in the next step.

## Starting the experiment
### Starting the experiment

Based on the experiment template that was just created, a new experiment can be started, using the template ID.
Following the creation of the experiment template, you can create a new experiment using the template's ID.

```bash
$ awslocal fis start-experiment --experiment-template-id 895591e8-11e6-44c4-adc3-86592010562b
{{< command >}}
$ awslocal fis start-experiment --experiment-template-id <EXPERIMENT_TEMPLATE_ID>
<disable-copy>
{
"experiment": {
"id": "1b1238fd-316d-4956-93e7-5ada677a6f69",
Expand Down Expand Up @@ -152,40 +154,40 @@ $ awslocal fis start-experiment --experiment-template-id 895591e8-11e6-44c4-adc3
"startTime": 1699308823.74327
}
}
```
</disable-copy>
{{< /command >}}

## The outage
Replace the `<EXPERIMENT_TEMPLATE_ID>` placeholder with the ID of the experiment template that was created in the previous step.

Now that the experiment is started, the database will be inaccessible, meaning the user can't retrieve and can't add any new
products. The API Gateway will return an Internal Server Error. This is obviously problematic, as anyone who has ever worked
with enterprise applications can tell you, downtime and data loss are two things crucial to avoid.
Luckily, this potential issue has been caught early enough in the development phase, that the developer can include proper error handling and a mechanism
that prevents data loss in case of an outage of the database. This of course is not limited to DynamoDB, an outage can be
simulated for any storage resource.
### Simulating an outage

## The solution
Once the experiment starts, the database becomes inaccessible. This means users cannot retrieve or add new products, resulting in the API Gateway returning an Internal Server Error. Downtime and data loss are critical issues to avoid in enterprise applications.

![fis-experiment-2](fis-experiment-2.png)
Fortunately, encountering this issue early in the development phase allows developers to implement effective error handling and develop mechanisms to prevent data loss during a database outage.

The potential solution could be deploying an SNS topic, an SQS queue and a Lambda function that will pick up the queued element and retry the
`PutItem` operation on the database. In case DynamoDB is still unavailable, the item will be re-queued.
It's important to note that this approach is not limited to DynamoDB; outages can be simulated for any storage resource.

```bash
### Setting up a solution

{{< figure src="fis-experiment-2.png" width="800">}}

A possible solution involves setting up an SNS topic, an SQS queue, and a Lambda function. The Lambda function will be responsible for retrieving queued items and attempting to re-execute the `PutItem` operation on the database. If DynamoDB remains unavailable, the item will be placed back in the queue for a later retry.

{{< command >}}
$ curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-1003",
"name": "Super Widget",
"price": "29.99",
"description": "A versatile widget that can be used for a variety of purposes. Durable, reliable, and affordable."
}
'

A DynamoDB error occurred. Message sent to queue.⏎

```
--header 'Content-Type: application/json' \
--data '{
"id": "prod-1003",
"name": "Super Widget",
"price": "29.99",
"description": "A versatile widget that can be used for a variety of purposes. Durable, reliable, and affordable."
}'
<disable-copy>
A DynamoDB error occurred. Message sent to queue.
</disable-copy>
{{< /command >}}

If we check the logs, we can see that the `DynamoDbException` is handled gracefully:
If we review the logs, it will show that the `DynamoDbException` has been managed effectively.

```bash
2023-11-06T22:21:40.789 DEBUG --- [ asgi_gw_2] l.services.fis.handler : FIS handler called with configs: {'dynamodb': {None: [(100, 'DynamoDbException', '500')]}}
Expand All @@ -194,14 +196,15 @@ If we check the logs, we can see that the `DynamoDbException` is handled gracefu
'arn:aws:sqs:us-east-1:000000000000:ProductEventsQueue' with protocol 'sqs' (subscription 'arn:aws:sns:us-east-1:000000000000:ProductEventsTopic:0a4abf8c-744a-404a-9ff9-f132e25d1b30')
```

Now this element sits in the queue, until the outage is over.
This element will remain in the queue until the outage is resolved.

## Stopping the experiment
### Stopping the experiment

We can stop the experiment by using the following command:
To stop the experiment, use the following command:

```bash
$ awslocal fis stop-experiment --id 1b1238fd-316d-4956-93e7-5ada677a6f69
{{< command >}}
$ awslocal fis stop-experiment --id <EXPERIMENT_ID>
<disable-copy>
{
"experiment": {
"id": "1b1238fd-316d-4956-93e7-5ada677a6f69",
Expand Down Expand Up @@ -234,14 +237,16 @@ $ awslocal fis stop-experiment --id 1b1238fd-316d-4956-93e7-5ada677a6f69
"endTime": 1699309736.259646
}
}
```
</disable-copy>
{{< /command >}}

The experiment ID comes from the prior used `start-experiment` command.
The experiment has been stopped, meaning that the Product that initially has not reached the database, has finally reached
the destination. We can verify that by scanning the database:
Replace the `<EXPERIMENT_ID>` placeholder with the ID of the experiment that was created in the previous step.

```bash
The experiment has been terminated, allowing the Product that initially failed to reach the database to finally be stored successfully. This can be confirmed by scanning the database.

{{< command >}}
$ awslocal dynamodb scan --table-name Products
<disable-copy>
{
"Items": [
{
Expand Down Expand Up @@ -277,11 +282,12 @@ $ awslocal dynamodb scan --table-name Products
"ScannedCount": 2,
"ConsumedCapacity": null
}
```
</disable-copy>
{{< /command >}}

## Adding latency
### Configuring the latency

The LocalStack FIS service is also capable of adding latency by using the following experiment template:
The LocalStack FIS service can also introduce latency using the following experiment template:

```bash
{
Expand All @@ -302,11 +308,11 @@ The LocalStack FIS service is also capable of adding latency by using the follow
"roleArn": "arn:aws:iam:000000000000:role/ExperimentRole"
}
```
Save this template as `latency-experiment.json` and use it to create an experiment definition through the FIS service:

Let's add this experiment definition to a JSON file and create an experiment template via the FIS service:

```bash
{{< command >}}
$ awslocal fis create-experiment-template --cli-input-json file://latency-experiment.json
<disable-copy>
{
"experimentTemplate": {
"id": "966f5632-4e2c-4567-b99c-436c333e523f",
Expand All @@ -329,23 +335,35 @@ $ awslocal fis create-experiment-template --cli-input-json file://latency-experi
"roleArn": "arn:aws:iam:000000000000:role/ExperimentRole"
}
}
</disable-copy>
$ awslocal fis start-experiment --experiment-template-id <EXPERIMENT_TEMPLATE_ID>
{{< /command >}}

$ awslocal fis start-experiment --experiment-template-id 966f5632-4e2c-4567-b99c-436c333e523f
```
Replace the `<EXPERIMENT_TEMPLATE_ID>` placeholder with the ID of the experiment template that was created in the previous step.

With the experiment active, we can try using the same sample stack to better understand what happens when there's a 4 second delay on
each service call:
While the experiment is active, you can use the same sample stack to observe and understand the effects of a 4-second delay on each service call.

```bash
curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-1088",
"name": "Super Widget",
"price": "29.99",
"description": "A versatile widget that can be used for a variety of purposes. Durable, reliable, and affordable."
}
'
An error occurred (InternalError) when calling the GetResources operation (reached max retries: 4): Failing as per Fault Injection Simulator configuration⏎
```
{{< command >}}
$ curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-1088",
"name": "Super Widget",
"price": "29.99",
"description": "A versatile widget that can be used for a variety of purposes. Durable, reliable, and affordable."
}'
<disable-copy>
An error occurred (InternalError) when calling the GetResources operation (reached max retries: 4): Failing as per Fault Injection Simulator configuration
</disable-copy>
{{< /command >}}

## Web Application

LocalStack Web Application provides a dashboard for conducting FIS experiments in user stacks. This control panel offers various FIS experiment options, which includes:

- **500 Internal Error**: This experiment randomly terminates incoming requests, returning an 'internal error' with a response code of 500.
- **Service Unavailable**: This test causes all calls to specified services to receive a 503 'service unavailable' response.
- **AWS Region Unavailable**: This experiment simulates regional outages and failovers by disabling entire AWS regions.
- **Latency**: This test introduces specified latency to every API call, useful for simulating network latency or degraded network performance.

{{< figure src="FIS-Dashboard.png" width="900" >}}
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,11 @@ description: Use LocalStack Outages Extension to mimic service outages by testin

## Getting started

This guide is designed for users who are new to Outages Extension. For this particular example, we'll be using a Terraform configuration file from a [sample application repository](https://github.com/localstack-samples/samples-chaos-engineering/tree/main/extension-outages). We'll simulate partial outages by interrupting specific services, such as halting an ECS instance creation or disrupting a database service. By closely watching Terraform's responses and the status of AWS resources, you'll learn how Terraform manages these disruptions.
This guide is designed for users who are new to Outages Extension. We'll simulate partial outages by interrupting specific services, such as halting an ECS instance creation or disrupting a database service. By closely watching Terraform's responses and the status of AWS resources, you'll learn how Terraform manages these disruptions.

For this particular example, we'll be using a Terraform configuration file from a [sample application repository](https://github.com/localstack-samples/samples-chaos-engineering/tree/main/extension-outages). Clone the repository, and follow the instructions below to get started.

### Prerequisites

The general prerequisites for this guide are:

Expand All @@ -20,6 +24,13 @@ The general prerequisites for this guide are:
- [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/install/)
- [Terraform](https://www.terraform.io/downloads.html) and [`tflocal` wrapper](https://docs.localstack.cloud/user-guide/integrations/terraform/#tflocal-wrapper-script).

Start LocalStack by using the `docker-compose.yml` file from the repository. Ensure to set your API key as an environment variable during this process.

{{< command >}}
$ LOCALSTACK_API_KEY=<YOUR_LOCALSTACK_API_KEY>
$ docker compose up
{{< /command >}}

### Installing the extension

To install the LocalStack Outages Extension, first set up your LocalStack API key in your environment. Once the API key is configured, use the command below to install the extension:
Expand Down
Loading

0 comments on commit bdec2c5

Please sign in to comment.