This project is part of Udacitys "Site Reliability Engineer".
The project demonstrates how to deploy high-available infrastructure to AWS using Terraform.
First step in this project is to deploy the infrastructure that you can run Prometheus and Grafana on.
Next we use the servers deployed to create an SLO/SLI dashboard.
The code deploys a highly-available infrastructure to AWS in multiple zones using Terrafrom.
Beside the monitoring stack deployed on AWS EKS, part of this is an RDS database cluster that has a replica in the alternate zone.
- Enable VPC to have IPs in multiple availability zones
- Configure replication of secondary database (RDS-S) from primary db
- Add a load balancer (ALB) along with VPC for zone 2
- Tackle warning "Reference to undefined provider" on main.tf line 41, in module "vpc_west": (aws = aws.usw1)
- AWS CLI Configuration basics
- AWS CLI Configuration and credential file settings
- AWS CLI Environment variables to configure the AWS CLI
- Google SRE Book
- Building Secure and Reliable Systems - a book by Google
- SLI/SLO article
- AWS DR strategies
- The Role of SREs in Observability
- Benefits of Observability for Site Reliability Engineers
Clone the appropriate git repo with the starter code. There will be 2 folders. Zone1 and zone2. This is where you will run the code from in your AWS Cloudshell terminal.
remote
Open your AWS console and ensure it is set for region us-east-1
.
Open the CloudShell by clicking the little shell icon in the toolbar at the top near the search box.
locally
Set up your aws credentials from Udacity AWS Gateway locally
- https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
- Set your region to
us-east-1
Based on Set Up the AWS CLI proceed as follows:
- Install the AWS CLI and verify with:
aws --version
- Create new profile with
aws configure --profile <profile_name>
and set:- AWS Access Key ID
- AWS Secret Access Key
- Default Region (e.g. us-east-1)
- Default Output Format (i.e. json)
- Get the available profiles with
aws configure list-profiles
- Switch the profile depending on your OS with:
- Linux and MacOS ->
export AWS_PROFILE=admin
- Windows Command Prompt ->
setx AWS_PROFILE admin
- PowerShell ->
$Env:AWS_PROFILE="admin"
- Linux and MacOS ->
- Get the currently used profile with
aws configure list
- Verify the currently active profile with
aws sts get-caller-identity
- Exemplary list S3 buckets from this profile with:
aws s3 ls --profile <profile_name>
- Remove unwanted profiles (or add manually), by editing the config files:
vi ~/.aws/credentials
vi ~/.aws/config
Restore image
aws ec2 create-restore-image-task --object-key ami-0ec6fdfb365e5fc00.bin --bucket udacity-srend --name "udacity-<your_name>"
Copy the AMI to us-east-2
and us-west-1
aws ec2 copy-image --source-image-id <your-ami-id-from-above> --source-region us-east-1 --region us-east-2 --name "udacity-<your_name>"
aws ec2 copy-image --source-image-id <your-ami-id-from-above> --source-region us-east-1 --region us-west-1 --name "udacity-<your_name>"
- Make note of the ami output from the above 2 commands. You'll need to put this in the
ec2.tf
file forzone1
forus-east-2
and inec2.tf
file forzone2
forus-west-1
respectively
Close your CloudShell. Change your region to us-east-2
.
From the AWS console create an S3 bucket in us-east-2
, e.g s3-udacity-terraform-us-east-2
- click next until created.
- Update
_config.tf
in thezone1
folder with your S3 bucket name, e.gs3-udacity-terraform-us-east-2
- NOTE: S3 bucket names MUST be globally unique!
Change your region to us-west-1
.
From the AWS console create an S3 bucket in us-west-1
, e.g s3-udacity-terraform-us-west-1
- click next until created.
- Update
_config.tf
in thezone2
folder with your S3 bucket name, e.gs3-udacity-terraform-us-west-1
- NOTE: S3 bucket names MUST be globally unique!
- Do this in BOTH
us-east-2
andus-west-1
- Name the key
udacity
Setup your CloudShell. Open CloudShell in the us-east-2
region. Install the following:
-
helm
export VERIFY_CHECKSUM=false
curl -sSL https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
-
terraform
wget https://releases.hashicorp.com/terraform/1.0.7/terraform_1.0.7_linux_amd64.zip
unzip terraform_1.0.7_linux_amd64.zip
mkdir ~/bin
mv terraform ~/bin
export TF_PLUGIN_CACHE_DIR="/tmp"
-
kubectl
curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/kubectl
chmod +x ./kubectl
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$PATH:$HOME/bin
echo 'export PATH=$PATH:$HOME/bin' >> ~/.bashrc
- Clone the starter code from the git repo to a folder CloudShell
cd
into thezone1
folderterraform init
terraform apply
orterraform apply -auto-approve
NOTE The first time you run terraform apply
you may see errors about the Kubernetes namespace or an RDS error.
Running it again AND performing the step below (No. 8) should clear up those errors.
-
Delete
~/.kube/config
file locally, otherwise we will seeTried to insert into contexts,which is a <class 'NoneType'> not a <class 'list'>
-
Command
aws eks --region us-east-2 update-kubeconfig --name udacity-cluster
-
Get the <cluster_name> from command above, use it next
-
Change kubernetes context to the new AWS cluster using the <cluster_name> from above via
kubectl config use-context <cluster_name>
- e.g
arn:aws:eks:us-east-2:139802095464:cluster/udacity-cluster
-
Confirm with:
kubectl get pods --all-namespaces
-
Then run
kubectl create namespace monitoring
9.1. Copy the public IP address of your Ubuntu-Web EC2 instance.
- Login to the AWS console and copy the public IP address of your Ubuntu-Web EC2 instance.
- Ensure you are in the us-east-2 region.
9.2. Set public IP of your Ubuntu Web for Prometheus.
- Edit the
prometheus-additional.yaml
file and replace the<public_ip>
entries with the public IP of your Ubuntu Web. Save the file.
Optional: Transfer prometheus-additional.yaml
to Cloudshell via Git (in case you want to install Prometheus and Grafana from Cloudshell)
- git commit (locally, after setting the IP in the previous step)
- git push origin master (updates the prometheus-additional.yaml file with actual IP)
- git pull (from cloudshell, in order to run commands from there with updated IP)
Install via Helm
- Change directories to your project directory
cd ../..
and run: kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml --namespace monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -f "values.yaml" --namespace monitoring
- if helm install above doenst work out, try helm install without values.yaml below
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring
Via Port forwarding
kubectl -n monitoring port-forward svc/prometheus-grafana 8888:80
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 8889:9090
- Point your local web browser to http://localhost:8888 for Grafana access and http://localhost:8889 for Prometheus access -->
Via Load Balancer
- Get the DNS of your load balancer provisioned to access Grafana.
- You can find this by opening your AWS console and going to EC2 -> Load Balancers and selecting the load balancer provisioned.
- The DNS name of it will be listed below that you can copy and paste into your browser.
- Type that into your web browser to access Grafana.
Login
- Login to Grafana with
admin
for the username andprom-operator
for the password.
- Install Postman from here.
- See additional instructions for importing the collection, and enviroment files
Open Postman and load the files SRE-Project-postman-collection.json
and SRE-Project.postman_environment.json
-
At the top level of the project in Postman, create the
public-ip
,email
andtoken
variable in the Postman file with the public IP you gathered from above and click Save. You can choose whatever you like for the email and see the next step for the token. -
Run the
Initialize the Database
andRegister a User
tasks in Postman by clicking the "Send" button on top. In the register tasks, you will output a token. Use this token to create a token variable (under "Auth"). -
Run
Create Event
for 100 iterations by clicking the top levelSRE Project
folder in the left-hand side and select justCreate Event
and click the Run icon in the toolbar. -
Run
Get all events
for 100 iterations by clicking the top levelSRE Project
folder in the left-hand side and select justGet All Events
and click the Run icon in the toolbar. -
Optional: Run the Postman runners to generate some traffic. Use 100 iterations
-
Create an SLO/SLI document such as the template here. You will fill in the SLI column with a description of what the combination category and SLO represent. You'll implement these 4 categories in 4 panels in Grafana using Prometheus queries later on. This is a good tool for creating tables in Markdown https://tableconvert.com. I recommend using that tool for MD tables since they can get hard to read in a pure text editor.
-
Create a document that details the infrastructure. This is an exercise to identity assets for failover. You will also define basic DR steps for your infrastructure. Your orgnization has provided you a requirement document for the infrastructure. Please see this document for a template to use.
-
Open Grafana in your web browser
- Create a new dashboard with 4 panels. The Prometheus datasource should already be added that you can pull data from. The Flask exporter exports metrics for your EC2 instances provisioned during the install. Please note, while making the panel display the information in a way that makes sense (percentage, milliseconds, etc.) is also good, it is not necessarily a requirement. The backend query and data representation is more important. Same goes for colors and type of graph displayed.
- Create the 4 SLO/SLI panels as defined in the SLO/SLI document. The 4 panel categories will be availability (availability), remaining error budget (error budget), successful requests per second (throughput), and 90th percentile requests finish in this time (latency). See the following for more information on potential metrics to use https://github.com/rycus86/prometheus_flask_exporter
- NOTE: You will not see the goal SLO numbers in your dashboard and that is fine. The application doesn't have enough traffic or time to generate a 99% availabiliy or have an error budget that works.
- Please submit your Prometheus queries you use for you dashboards in the
prometheus_queries.md
file linked here. - Please take a screenshot of your created dashboard and include that as part of your submission for the project.
-
Deploy the infrastructure to zone1
- You will need to make sure the infrastructure is highly available. Please see the
requirements.md
document here for details on the requirements for making the infrastructure HA. You will modify your code to meet those requirements. Note for availability zones that not all regions have the same number of availability zones. You will need to lookup the AZs forus-east-2
. You will get errors when first running the code you will have to fix!- For the application load balancer, please note the technical requirements:
- This will attach to the Ubuntu VMs on port 80.
- It should listen on port 80
- For the application load balancer, please note the technical requirements:
- Make the appropriate changes to your code
cd
into yourzone1
folderterraform init
terraform apply
- Please take a screenshot of a successful Terraform run and include that as part of your submission for the project.
- You will need to make sure the infrastructure is highly available. Please see the
-
Deploy the infrastructure to zone2 (DR)
- You will need to make sure the infrastructure is highly available. Please see the
requirements.md
document here for details on the requirements for making the infrastructure HA. You will modify your code to meet those requirements. Note for availability zones that not all regions have the same number of availability zones. You will need to lookup the AZs forus-west-1
. You will get errors when first running the code you will have to fix in thezone1
main.tf
file- You will need to update the bucket name in the
_data.tf
file under thezone2
folder to reflect the name of the bucket you provisioned inus-east-2
earlier - For the application load balancer, please note the technical requirements:
- This will attach to the Ubuntu VMs on port 80.
- It should listen on port 80
- HINT: we actually provisioned the VPC for us-west-1 in the
zone1
folder, so you'll need to reference the subnet and vpc ID from that module output. Here is the code block you'll need to utilize for the ALB:
subnet_id = data.terraform_remote_state.vpc.outputs.public_subnet_ids vpc_id = data.terraform_remote_state.vpc.outputs.vpc_id
- You will need to update the bucket name in the
- Make the appropriate changes to your code
cd
into yourzone2
folderterraform init
terraform apply
- Please take a screenshot of a successful Terraform run and include that as part of your submission for the project.
- You will need to make sure the infrastructure is highly available. Please see the
-
Implement basic SQL replication and establish backups NOTE: The RDS configuration is completed under the
zone1
folder. Due to the way it was implemented in Terraform BOTH region RDS instances are completed under the same Terraform project.- You will need to make sure the cluster is highly available. Please see the
requirements.md
document here for details on the requirements for making the cluster HA. You will modify your code to meet those requirements. Additionally, you will need to set the following for the RDS instnaces:- Setup the source name and region for your RDS instance in your secondary zone
- You will need to add multiple availability zones for the RDS module. The starter code only contains 1 zone for each RDS instance in each region.
- The code for the
rds-s
cluster is commented out in therds.tf
file under thezone-1
folder. You will need to fix therds-s
module and then uncomment this code for it to work - Please take a screenshot of a successful Terraform run and include that as part of your submission for the project.
- You will need to make sure the cluster is highly available. Please see the
-
Destroy it all.
- Delete the RDS Clusters manually: first the primary, then the secondary.
- Destroy zone2 first, then zone1 using
terraform destroy
. - Please take a screenshot of the final output from Terraform showing the destroyed resources and include that as part of your submission for the project.
If you want to take your project even further going above and beyond, here are 3 standout suggestions:
- Perform a failover of their application load balancer to their secondary region using route 53 DNS
- Fail over the RDS instance to the secondary region so it becomes the primary target and the first region becomes the replica
- Create an additional AWS module to provision another piece of infrastructure not discussed in the project