diff --git a/.github/workflows/markdown-link-check.yaml b/.github/workflows/markdown-link-check.yaml index ddaec1eb3..b444b5a02 100644 --- a/.github/workflows/markdown-link-check.yaml +++ b/.github/workflows/markdown-link-check.yaml @@ -20,9 +20,9 @@ jobs: - uses: actions/checkout@v3 - uses: actions/setup-node@v3 with: - node-version: '16.x' + node-version: 20 - name: install markdown-link-check - run: npm install -g markdown-link-check@3.10.2 + run: npm install -g markdown-link-check@3.12.2 - name: markdown-link-check version run: npm list -g markdown-link-check - name: Run markdown-link-check on MD files diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 795a06e72..30ccbe350 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -7,7 +7,7 @@ repos: - id: detect-aws-credentials args: ['--allow-missing-credentials'] - repo: https://github.com/tcort/markdown-link-check - rev: v3.11.2 + rev: v3.12.2 hooks: - id: markdown-link-check args: ['-q', '-c .github/workflows/linkcheck.json'] diff --git a/docs/en/guides/choosing-a-tracing-agent.md b/docs/en/guides/choosing-a-tracing-agent.md index a6e114c8c..e7a768993 100644 --- a/docs/en/guides/choosing-a-tracing-agent.md +++ b/docs/en/guides/choosing-a-tracing-agent.md @@ -17,16 +17,18 @@ Any transaction tracing solution requires an agent and an integration into the u The SDKs included with X-Ray are part of a tightly integrated instrumentation solution offered by AWS. ADOT is part of a broader industry solution in which X-Ray is only one of many tracing solutions. You can implement end-to-end tracing in X-Ray using either approach, but it’s important to understand the differences in order to determine the most useful approach for you. -!!! success +:::info We recommend instrumenting your application with the AWS Distro for OpenTelemetry if you need the following: * The ability to send traces to multiple different tracing backends without having to re-instrument your code. For example, of you wish to shift from using the X-Ray console to [Zipkin](https://zipkin.io), then only configuration of the collector would change, leaving your applicaiton code untouched. * Support for a large number of library instrumentations for each language, maintained by the OpenTelemetry community. +::: -!!! success +:::info We recommend choosing an X-Ray SDK for instrumenting your application if you need the following: * A tightly integrated single-vendor solution. * Integration with X-Ray centralized sampling rules, including the ability to configure sampling rules from the X-Ray console and automatically use them across multiple hosts, when using Node.js, Python, Ruby, or .NET +::: \ No newline at end of file diff --git a/docs/en/index.md b/docs/en/index.md index 26fffb274..81bd5f798 100644 --- a/docs/en/index.md +++ b/docs/en/index.md @@ -27,7 +27,7 @@ This site is organized into four categories: 1. [Best practices for specific AWS tools (though these are largely fungible to other vendor products as well)](https://aws-observability.github.io/observability-best-practices/tools/cloudwatch_agent/) 1. [Curated recipes for observability with AWS](https://aws-observability.github.io/observability-best-practices/recipes/) -!!! success +:::info This site is based on real world use cases that AWS and our customers have solved for. Observability is at the heart of modern application development, and a critical consideration when operating distributed systems, such as microservices, or complex applications with many external integrations. We consider it to be a leading indicator of a healthy workload, and we are pleased to share our experiences with you here! diff --git a/docs/en/recipes/infra.md b/docs/en/recipes/infra.md index cda2d7c5a..a4542f8cf 100644 --- a/docs/en/recipes/infra.md +++ b/docs/en/recipes/infra.md @@ -28,7 +28,6 @@ [alb-docs]: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-monitoring.html [nlb-docs]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-monitoring.html [vpcfl]: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html -[vpcf-ws]: https://amazon-es-vpc-flowlogs.workshop.aws/en/ [eks-cp]: https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html [lambda-docs]: https://docs.aws.amazon.com/lambda/latest/operatorguide/monitoring-observability.html [rds]: rds.md diff --git a/docs/ja/recipes/infra.md b/docs/ja/recipes/infra.md index 81af21eaf..8203a5256 100644 --- a/docs/ja/recipes/infra.md +++ b/docs/ja/recipes/infra.md @@ -27,7 +27,6 @@ [alb-docs]: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-monitoring.html [nlb-docs]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-monitoring.html [vpcfl]: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html -[vpcf-ws]: https://amazon-es-vpc-flowlogs.workshop.aws/en/ [eks-cp]: https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html [lambda-docs]: https://docs.aws.amazon.com/lambda/latest/operatorguide/monitoring-observability.html [rds]: rds.md diff --git a/docusaurus/observability-best-practices/.gitignore b/docusaurus/observability-best-practices/.gitignore new file mode 100644 index 000000000..b2d6de306 --- /dev/null +++ b/docusaurus/observability-best-practices/.gitignore @@ -0,0 +1,20 @@ +# Dependencies +/node_modules + +# Production +/build + +# Generated files +.docusaurus +.cache-loader + +# Misc +.DS_Store +.env.local +.env.development.local +.env.test.local +.env.production.local + +npm-debug.log* +yarn-debug.log* +yarn-error.log* diff --git a/docusaurus/observability-best-practices/README.md b/docusaurus/observability-best-practices/README.md new file mode 100644 index 000000000..b2cb48a21 --- /dev/null +++ b/docusaurus/observability-best-practices/README.md @@ -0,0 +1,62 @@ +# Observability Best Practices + +## Welcome + +This is the source for the [AWS Observability Best Practices site](https://aws-observability.github.io/observability-best-practices/). Everyone is welcome to contribute here, not just AWS employees! + +## How to run/develop this site + +This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator. +You need to install [npm](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm) as a prerequisite. + +### Installation + +``` +$ (yarn | npm ) install +``` + +### Local Development + +``` +$ yarn start [or] npm run start +``` + +This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server. + +### Build + +``` +$ yarn build [or] npm run build +``` + +This command generates static content into the `build` directory and can be served using any static contents hosting service. + +### Deployment + +Using SSH with yarn: +``` +$ USE_SSH=true yarn deploy +``` + +Using SSH with npm: +``` +$ USE_SSH=true npm run serve +``` + +Not using SSH: + +``` +$ GIT_USER= yarn deploy +``` + +If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch. + + + +## Security + +See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. + +## License + +This library is licensed under the MIT-0 License. See the LICENSE file. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/babel.config.js b/docusaurus/observability-best-practices/babel.config.js new file mode 100644 index 000000000..e00595dae --- /dev/null +++ b/docusaurus/observability-best-practices/babel.config.js @@ -0,0 +1,3 @@ +module.exports = { + presets: [require.resolve('@docusaurus/core/lib/babel/preset')], +}; diff --git a/docusaurus/observability-best-practices/blog/2019-05-28-first-blog-post.md b/docusaurus/observability-best-practices/blog/2019-05-28-first-blog-post.md new file mode 100644 index 000000000..02f3f81bd --- /dev/null +++ b/docusaurus/observability-best-practices/blog/2019-05-28-first-blog-post.md @@ -0,0 +1,12 @@ +--- +slug: first-blog-post +title: First Blog Post +authors: + name: Gao Wei + title: Docusaurus Core Team + url: https://github.com/wgao19 + image_url: https://github.com/wgao19.png +tags: [hola, docusaurus] +--- + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet diff --git a/docusaurus/observability-best-practices/blog/2019-05-29-long-blog-post.md b/docusaurus/observability-best-practices/blog/2019-05-29-long-blog-post.md new file mode 100644 index 000000000..26ffb1b1f --- /dev/null +++ b/docusaurus/observability-best-practices/blog/2019-05-29-long-blog-post.md @@ -0,0 +1,44 @@ +--- +slug: long-blog-post +title: Long Blog Post +authors: endi +tags: [hello, docusaurus] +--- + +This is the summary of a very long blog post, + +Use a `` comment to limit blog post size in the list view. + + + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet + +Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque elementum dignissim ultricies. Fusce rhoncus ipsum tempor eros aliquam consequat. Lorem ipsum dolor sit amet diff --git a/docusaurus/observability-best-practices/blog/2021-08-26-welcome/docusaurus-plushie-banner.jpeg b/docusaurus/observability-best-practices/blog/2021-08-26-welcome/docusaurus-plushie-banner.jpeg new file mode 100644 index 000000000..11bda0928 Binary files /dev/null and b/docusaurus/observability-best-practices/blog/2021-08-26-welcome/docusaurus-plushie-banner.jpeg differ diff --git a/docusaurus/observability-best-practices/blog/2021-08-26-welcome/index.md b/docusaurus/observability-best-practices/blog/2021-08-26-welcome/index.md new file mode 100644 index 000000000..9455168f1 --- /dev/null +++ b/docusaurus/observability-best-practices/blog/2021-08-26-welcome/index.md @@ -0,0 +1,25 @@ +--- +slug: welcome +title: Welcome +authors: [slorber, yangshun] +tags: [facebook, hello, docusaurus] +--- + +[Docusaurus blogging features](https://docusaurus.io/docs/blog) are powered by the [blog plugin](https://docusaurus.io/docs/api/plugins/@docusaurus/plugin-content-blog). + +Simply add Markdown files (or folders) to the `blog` directory. + +Regular blog authors can be added to `authors.yml`. + +The blog post date can be extracted from filenames, such as: + +- `2019-05-30-welcome.md` +- `2019-05-30-welcome/index.md` + +A blog post folder can be convenient to co-locate blog post images: + +![Docusaurus Plushie](./docusaurus-plushie-banner.jpeg) + +The blog supports tags as well! + +**And if you don't want a blog**: just delete this directory, and use `blog: false` in your Docusaurus config. diff --git a/docusaurus/observability-best-practices/blog/authors.yml b/docusaurus/observability-best-practices/blog/authors.yml new file mode 100644 index 000000000..bcb299156 --- /dev/null +++ b/docusaurus/observability-best-practices/blog/authors.yml @@ -0,0 +1,17 @@ +endi: + name: Endilie Yacop Sucipto + title: Maintainer of Docusaurus + url: https://github.com/endiliey + image_url: https://github.com/endiliey.png + +yangshun: + name: Yangshun Tay + title: Front End Engineer @ Facebook + url: https://github.com/yangshun + image_url: https://github.com/yangshun.png + +slorber: + name: Sébastien Lorber + title: Docusaurus maintainer + url: https://sebastienlorber.com + image_url: https://github.com/slorber.png diff --git a/docusaurus/observability-best-practices/blog/tags.yml b/docusaurus/observability-best-practices/blog/tags.yml new file mode 100644 index 000000000..f71dd7393 --- /dev/null +++ b/docusaurus/observability-best-practices/blog/tags.yml @@ -0,0 +1,16 @@ +facebook: + label: Facebook + permalink: /facebook + description: Facebook tag description +hello: + label: Hello + permalink: /hello + description: Hello tag description +docusaurus: + label: Docusaurus + permalink: /docusaurus + description: Docusaurus tag description +hola: + label: Hola + permalink: /hola + description: Hola tag description diff --git a/docusaurus/observability-best-practices/docs/contributors.md b/docusaurus/observability-best-practices/docs/contributors.md new file mode 100644 index 000000000..ac095acc1 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/contributors.md @@ -0,0 +1,27 @@ +# Contributors + +The content on this site is maintained by Solution Architects, AWS Observability service team members and other volunteers from across the organization. Our goal is to improve the discovery of relevant best practices on how to set up and use AWS services and open source projects in the observability space. + +Recipes and content contributions in general so far are from the following +people: + +| Authors | Authors | Authors | Authors | +| ------------------- | --------------------------- | ----------------- | ------------------ | +| Alolita Sharma | Aly Shah Imtiaz | Helen Ashton | Elamaran Shanmugam | +| Dinesh Boddula | Imaya Kumar Jagannathan | Dieter Adant | Eric Hsueh | +| Jason Derrett | Kevin Lewin | Mahesh Biradar | Michael Hausenblas | +| Munish Dabra | Rich McDonough | Rob Sable | Rodrigue Koffi | +| Sheetal Joshi | Tomasz Wrzonski | Tyler Lynch | Vijayan Sarathy | +| Vikram Venkataraman | Yiming Peng | Arun Chandapillai | Alex Livingstone | +| Kiran Prakash | Bobby Hallahan | Toshal Dudhwala | Franklin Aguinaldo | +| Nirmal Mehta | Lucas Vieira Souza da Silva | William Armiros | Abhi Khanna | +| Arvind Raghunathan | Doyita Mitra | Rahul Popat | Taiki Hibira | +| Siva Guruvareddiar | | | | + + + +Note that all recipes published on this site are available via the +[MIT-0][mit0] license, a modification to the usual MIT license +that removes the requirement for attribution. + +[mit0]: https://github.com/aws/mit-0 \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/faq/adot.md b/docusaurus/observability-best-practices/docs/faq/adot.md new file mode 100644 index 000000000..c0915448f --- /dev/null +++ b/docusaurus/observability-best-practices/docs/faq/adot.md @@ -0,0 +1,26 @@ +# AWS Distro for Open Telemetry (ADOT) - FAQ + +1. **Can I use the ADOT collector to ingest metrics into AMP? + **Yes, this functionality was introduced with the GA launch for metrics support in May 2022 and you can use the ADOT collector from EC2, via our EKS add-on, via our ECS side-car integration, and/or via our Lambda layers. +1. **Can I use the ADOT collector to collect logs and ingest them into Amazon CloudWatch or Amazon OpenSearch?** + Not yet but we’re working on stabilizing logs upstream in OpenTelemetry and when the time comes, potentially later in 2023 or early 2024 we will support logs in ADOT, see also the [public roadmap entry](https://github.com/aws-observability/aws-otel-community/issues/11) +1. **Where can I find resource usage and performance details on the ADOT collector?** + We have a [Performance Report](https://aws-observability.github.io/aws-otel-collector/benchmark/report) online that we keep up to date as we release collectors. +1. **Is it possible to use ADOT with Apache Kafka?** + Yes, support to Kafka exporter and receiver was added in the ADOT collector v0.28.0. For more details, please check the [ADOT collector documentation](https://aws-otel.github.io/docs/components/kafka-receiver-exporter). +1. **How can I configure the ADOT collector?** + The ADOT collector is configured using YAML configuration files that are stored locally. Besides that, it is possible to use configuration stored in other locations, like S3 buckets. All the supported mechanisms to configure the ADOT collector are described in detail in the [ADOT collector documentation](https://aws-otel.github.io/docs/components/confmap-providers). +1. **Can I do advanced sampling in the ADOT collector?** + We’re working on it, please subscribe to the public [roadmap entry](https://github.com/aws-observability/aws-otel-collector/issues/1135) to keep up to date. +1. **Any tips how to scale the ADOT collector?** + Yes! See the upstream OpenTelemetry docs on [Scaling the Collector](https://opentelemetry.io/docs/collector/scaling/). +1. **I have a fleet of ADOT collectors, how can I manage them?** + This is an area of active development and we expect that it will mature in 2023, see the upstream OpenTelemetry docs on [Management](https://opentelemetry.io/docs/collector/management/) for more details, specifically on the [Open Agent Management Protocol (OpAMP)](https://opentelemetry.io/docs/collector/management/#opamp). +1. **How do you monitor the health and performance of the ADOT collector?** + 1. [Monitoring the collector](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/monitoring.md) - default metrics exposed on port 8080 that can be scraped by the Prometheus receiver + 2. Using the [Node Exporter](https://prometheus.io/docs/guides/node-exporter/), running node exporter would also provide several performance and health metrics about the node, pod, and operating system the collector is running in. + 3. [Kube-state-metrics (KSM)](https://github.com/kubernetes/kube-state-metrics), KSM can also produce interesting events about the collector. + 4. [Prometheus `up` metric](https://github.com/open-telemetry/opentelemetry-collector/pull/2918) + 5. A simple Grafana dashboard to get started: [https://grafana.com/grafana/dashboards/12553](.) +1. **Product FAQ** - [https://aws.amazon.com/otel/faqs/](.) + diff --git a/docusaurus/observability-best-practices/docs/faq/amg.md b/docusaurus/observability-best-practices/docs/faq/amg.md new file mode 100644 index 000000000..3c3ad71d9 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/faq/amg.md @@ -0,0 +1,58 @@ +# Amazon Managed Grafana - FAQ + +**Why should I choose Amazon Managed Grafana?** + +**[High Availability](https://docs.aws.amazon.com/grafana/latest/userguide/disaster-recovery-resiliency.html)**: Amazon Managed Grafana workspaces are highly available with multi-az replication. Amazon Managed Grafana also continuously monitors for the health of workspaces and replaces unhealthy nodes, without impacting access to the workspaces. Amazon Managed Grafana manages the availability of compute and database nodes so customers don’t have to manage the infrastructure resources required for the administration & maintenance. + +**[Data Security](https://docs.aws.amazon.com/grafana/latest/userguide/security.html)**: Amazon Managed Grafana encrypts the data at rest without any special configuration, third-party tools, or additional cost. [Data in-transit](https://docs.aws.amazon.com/grafana/latest/userguide/infrastructure-security.html) area also encrypted via TLS. + +**Which AWS regions are supported?** + +Current list of supported Regions is available in the [Supported Regions section in the documentation.](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html#AMG-supported-Regions) + +**We have multiple AWS accounts in multiple regions in our Organization, does Amazon Managed Grafana work for these scenarios** + +Amazon Managed Grafana integrates with [AWS Organizations](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_introduction.html) to discover AWS accounts and resources in Organizational Units (OUs). With AWS Organizations customers can [centrally manage data source configuration and permission settings](https://docs.aws.amazon.com/grafana/latest/userguide/AMG-and-Organizations.html) for multiple AWS accounts. + +**What data sources are supported in Amazon Managed Grafana?** + +Data sources are storage backends that customers can query in Grafana to build dashboards in Amazon Managed Grafana. Amazon Managed Grafana supports about [30+ built-in data sources](https://docs.aws.amazon.com/grafana/latest/userguide/AMG-data-sources-builtin.html) including AWS native services like Amazon CloudWatch, Amazon OpenSearch Service, AWS IoT SiteWise, AWS IoT TwinMaker, Amazon Managed Service for Prometheus, Amazon Timestream, Amazon Athena, Amazon Redshift, AWS X-Ray and many others. Additionally, [about 15+ other data sources](https://docs.aws.amazon.com/grafana/latest/userguide/AMG-data-sources-enterprise.html) are also available for upgraded workspaces in Grafana Enterprise. + +**Data sources of my workloads are in private VPCs. How do I connect them to Amazon Managed Grafana securely?** + +Private [data sources within a VPC](https://docs.aws.amazon.com/grafana/latest/userguide/AMG-configure-vpc.html) can be connected to Amazon Managed Grafana through AWS PrivateLink to keep the traffic secure. Further access control to Amazon Managed Grafana service from the [VPC endpoints](https://docs.aws.amazon.com/grafana/latest/userguide/AMG-configure-nac.html) can be restricted by attaching an [IAM resource policy](https://docs.aws.amazon.com/grafana/latest/userguide/VPC-endpoints.html#controlling-vpc) for [Amazon VPC endpoints](https://docs.aws.amazon.com/whitepapers/latest/aws-privatelink/what-are-vpc-endpoints.html). + +**What User Authentication mechanism is available in Amazon Managed Grafana?** + +In Amazon Managed Grafana workspace, [users are authenticated to the Grafana console](https://docs.aws.amazon.com/grafana/latest/userguide/authentication-in-AMG.html) by single sign-on using any IDP that supports Security Assertion Markup Language 2.0 (SAML 2.0) or AWS IAM Identity Center (successor to AWS Single Sign-On). + +> Related blog: [Fine-grained access control in Amazon Managed Grafana using Grafana Teams](https://aws.amazon.com/blogs/mt/fine-grained-access-control-in-amazon-managed-grafana-using-grafana-teams/) + +**What kind of automation support is available for Amazon Managed Grafana?** + +Amazon Managed Grafana is [integrated with AWS CloudFormation](https://docs.aws.amazon.com/grafana/latest/userguide/creating-resources-with-cloudformation.html), which helps customers in modeling and setting up AWS resources so that customers can spend less time creating and managing resources and infrastructure in AWS. With [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) customers can reuse templates to set up Amazon Managed Grafana resources consistently and repeatedly. Amazon Managed Grafana also has [API](https://docs.aws.amazon.com/grafana/latest/APIReference/Welcome.html)available which supports customers in automating through [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html) or integrating with software/products. Amazon Managed Grafana workspaces has [HTTP APIs](https://docs.aws.amazon.com/grafana/latest/userguide/Using-Grafana-APIs.html) for automation and integration support. + +> Related blog: [Announcing Private VPC data source support for Amazon Managed Grafana](https://aws.amazon.com/blogs/mt/announcing-private-vpc-data-source-support-for-amazon-managed-grafana/) + +**My Organization uses Terraform for automation. Does Amazon Managed Grafana support Terraform?** +Yes, [Amazon Managed Grafana supports](https://aws-observability.github.io/observability-best-practices/recipes/recipes/amg-automation-tf/) Terraform for [automation](https://registry.terraform.io/modules/terraform-aws-modules/managed-service-grafana/aws/latest) + +> Example: [Reference implementation for Terraform support](https://github.com/aws-observability/terraform-aws-observability-accelerator/tree/main/examples/managed-grafana-workspace) + +**I am using commonly used Dashboards in my current Grafana setup. Is there a way to use them on Amazon Managed Grafana rather than re-creating again?** + +Amazon Managed Grafana supports [HTTP APIs](https://docs.aws.amazon.com/grafana/latest/userguide/Using-Grafana-APIs.html) that allow you to easily automate deployment and management of Dashboards, users and much more. You can use these APIs in your GitOps/CICD processes to automate management of these resources. + +**Does Amazon Managed Grafana support Alerts?** + +[Amazon Managed Grafana alerting](https://docs.aws.amazon.com/grafana/latest/userguide/alerts-overview.html) provides customers with robust and actionable alerts that help learn about problems in the systems in near real time, minimizing disruption to services. Grafana includes access to an updated alerting system, Grafana alerting, that centralizes alerting information in a single, searchable view. + +**My Organization requires all actions be recorded for audits. Can Amazon Managed Grafana events be recorded?** + +Amazon Managed Grafana is integrated with [AWS CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html), which provides a record of actions taken by a user, a role, or an AWS service in Amazon Managed Grafana. CloudTrail captures all [API calls for Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/logging-using-cloudtrail.html)as events. The calls that are captured include calls from the Amazon Managed Grafana console and code calls to the Amazon Managed Grafana API operations. + +**What more information is available?** + +For additional information on Amazon Managed Grafana customers can read the AWS [Documentation](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html), go through the AWS Observability Workshop on [Amazon Managed Grafana](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/amg) and also check the [product page](https://aws.amazon.com/grafana/) to know the [features](https://aws.amazon.com/grafana/features/?nc=sn&loc=2), [pricing](https://aws.amazon.com/grafana/pricing/?nc=sn&loc=3) details, latest [blog posts](https://aws.amazon.com/grafana/resources/?nc=sn&loc=4&msg-blogs.sort-by=item.additionalFields.createdDate&msg-blogs.sort-order=desc#Latest_blog_posts) and [videos](https://aws.amazon.com/grafana/resources/?nc=sn&loc=4&msg-blogs.sort-by=item.additionalFields.createdDate&msg-blogs.sort-order=desc#Videos). + +**Product FAQ** [https://aws.amazon.com/grafana/faqs/](https://aws.amazon.com/grafana/faqs/) diff --git a/docusaurus/observability-best-practices/docs/faq/amp.md b/docusaurus/observability-best-practices/docs/faq/amp.md new file mode 100644 index 000000000..85544f62e --- /dev/null +++ b/docusaurus/observability-best-practices/docs/faq/amp.md @@ -0,0 +1,28 @@ +# Amazon Managed Service for Prometheus - FAQ + +1. **Which AWS Regions are supported currently and is it possible to collect metrics from other regions?** See our [documentation](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html) for updated list of Regions that we support. We plan to support all commercial regions in 2023. Please let us know which regions you would like so that we can better prioritize our existing Product Feature Requests (PFRs). You can always collect data from any regions and send it to a specific region that we support. Here’s a blog for more details: [Cross-region metrics collection for Amazon Managed Service for Prometheus](https://aws.amazon.com/blogs/opensource/set-up-cross-region-metrics-collection-for-amazon-managed-service-for-prometheus-workspaces/). +1. **How long does it take to see metering and/or billing in Cost Explorer or **** [CloudWatch as AWS billing charges](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/gs_monitor_estimated_charges_with_cloudwatch.html)?** + We meter blocks of ingested metric samples as soon as they are uploaded to S3 every 2 hours. It can take up to 3 hours to see metering and charges reported for Amazon Managed Service for Prometheus. +1. **As far as I can see the Prometheus Service is only able to scrape metrics from a cluster (EKS/ECS) Is that correct?** + We apologize for the lack of documentation for other compute environments. You can use Prometheus server to scrape [Prometheus metrics from EC2](https://aws.amazon.com/blogs/opensource/using-amazon-managed-service-for-prometheus-to-monitor-ec2-environments/) and any other compute environments where you can install a Prometheus server today as long as you configure the remote write and setup the [AWS SigV4 proxy](https://github.com/awslabs/aws-sigv4-proxy). The link to the [EC2 blog](https://aws.amazon.com/blogs/opensource/using-amazon-managed-service-for-prometheus-to-monitor-ec2-environments/) has a section “Running aws-sigv4-proxy” that can show you how to run it. We do need to add more documentation to help our customers simplify how to run AWS SigV4 on other compute environments. +1. **How would one connect this service to Grafana? Is there some documentation about this?** + We use the default [Prometheus data source available in Grafana](https://grafana.com/docs/grafana/latest/datasources/prometheus/) to query Amazon Managed Service for Prometheus using PromQL. Here’s some documentation and a blog that will help you get started: + 1. [Service docs](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-query.html) + 1. [Grafana setup on EC2](https://aws.amazon.com/blogs/opensource/setting-up-grafana-on-ec2-to-query-metrics-from-amazon-managed-service-for-prometheus/) +1. **What are some of the best practices to reduce the number of samples being sent to Amazon Managed Service for Prometheus?** + To reduce the number of samples being ingested into Amazon Managed Service for Prometheus, customers can extend their scrape interval (e.g., change from 30s to 1min) or decrease the number of series they are scraping. Changing the scrape interval will have a more dramatic impact on the number of samples than decreasing the number of series, with doubling the scrape interval halving the volume of samples ingested. +1. **How to send CloudWatch metrics to Amazon Managed Service for Prometheus?** + We recommend utilizing [CloudWatch metric streams to send CloudWatch metrics to Amazon Managed Service for Prometheus](https://aws-observability.github.io/observability-best-practices/recipes/recipes/lambda-cw-metrics-go-amp/). Some possible shortcomings of this integration are, + 1. A Lambda function is required to call the Amazon Managed Service for Prometheus APIs, + 1. No ability to enrich CloudWatch metrics with metadata (e.g., with AWS tags) before ingesting them to Amazon Managed Service for Prometheus, + 1. Metrics can only be filtered by namespace (not granular enough). As an alternative, customers can utilize Prometheus Exporters to send CloudWatch metrics data to Amazon Managed Service for Prometheus: (1) CloudWatch Exporter: Java based scraping that uses CW ListMetrics and GetMetricStatistics (GMS) APIs. + + [**Yet Another CloudWatch Exporter (YACE)**](https://github.com/nerdswords/yet-another-cloudwatch-exporter) is another option to get metrics from CloudWatch into Amazon Managed Service for Prometheus. This is a Go based tool that uses the CW ListMetrics, GetMetricData (GMD), and GetMetricStatistics (GMS) APIs. Some disadvantages in using this could be that you will have to deploy the agent and have to manage the life-cycle yourself which has to be done thoughtfully. + +1. **What version of Prometheus are you compatible with? + **Amazon Managed Service for Prometheus is compatible with [Prometheus 2.x](https://github.com/prometheus/prometheus/blob/main/RELEASE.md). Amazon Managed Service for Prometheus is based on the open source [CNCF Cortex project](https://cortexmetrics.io/) as its data plane. Cortex strives to be 100% API compatible with Prometheus (under /prometheus/* and /api/prom/*). Amazon Managed Service for Prometheus supports Prometheus-compatible PromQL queries and Remote write metric ingestion and the Prometheus data model for existing metric types including Gauge, Counters, Summary, and Histogram. We do not currently expose [all Cortex APIs](https://cortexmetrics.io/docs/api/). The list of compatible APIs we support can be [found here](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-APIReference.html). Customers can work with their account team to open new or influence existing Product Features Requests (PFRs) if we are missing any features required from Amazon Managed Service for Prometheus. +1. **What collector do you recommend for ingesting metrics into Amazon Managed Service for Prometheus? Should I utilize Prometheus in Agent mode? + **We support the usage of Prometheus servers inclusive of agent mode, the OpenTelemetry agent, and the AWS Distro for OpenTelemetry agent as agents that customers can use to send metrics data to Amazon Managed Service for Prometheus. The AWS Distro for OpenTelemetry is a downstream distribution of the OpenTelemetry project packaged and secured by AWS. Any of the three should be fine, and you’re welcome to pick whichever best suits your individual team’s needs and preferences. +1. **How does Amazon Managed Service for Prometheus’s performance scale with the size of a workspace?** + Currently, Amazon Managed Service for Prometheus supports up to 200M active time series in a single workspace. When we announce a new max limit, we’re ensuring that the performance and reliability properties of the service continue to be maintained across ingest and query. Queries across the same size dataset should not see a performance degradation regardless of the number of active series in a workspace. +1. **Product FAQ** [https://aws.amazon.com/prometheus/faqs/](.) diff --git a/docusaurus/observability-best-practices/docs/faq/cloudwatch.md b/docusaurus/observability-best-practices/docs/faq/cloudwatch.md new file mode 100644 index 000000000..9b9b17a4f --- /dev/null +++ b/docusaurus/observability-best-practices/docs/faq/cloudwatch.md @@ -0,0 +1,183 @@ +# Amazon CloudWatch - FAQ + +**Why should I choose Amazon CloudWatch?** + +Amazon CloudWatch is an AWS cloud native service which provides unified observability on a single platform for monitoring AWS cloud resources and the applications you run on AWS. Amazon CloudWatch can be used to collect monitoring and operational data in the form of [logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html), track [metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html), [events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html)and set [alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html). It also provides a [unified view](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) of AWS resources, applications, and services that run on AWS and [on-premises servers](https://aws.amazon.com/blogs/mt/how-to-monitor-hybrid-environment-with-aws-services/). Amazon CloudWatch helps you gain system-wide visibility into resource utilization, application performance, and operational health of your workloads. Amazon CloudWatch provides [actionable insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Insights-Sections.html) for AWS, hybrid, and on-premises applications and infrastructure resources. [Cross-account observability](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html) is an addition to CloudWatch’s unified observability capability. + +**Which AWS Services are natively integrated with Amazon CloudWatch and Amazon CloudWatch Logs?** + +Amazon CloudWatch natively integrates with more than 70+ AWS services allowing customers to collect infrastructure metrics for simplified monitoring and scalability with no action. Please check the documentation for a complete list of supported [AWS services that publish CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-services-cloudwatch-metrics.html). Currently, more than 30 AWS services publish logs to CloudWatch. Please check the documentation for a complete list of supported [AWS services that publish logs to CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/aws-services-sending-logs.html). + +**Where do I get the list of all the published metrics from all AWS Services to Amazon CloudWatch?** + +The list of all the [AWS Services that publish metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-services-cloudwatch-metrics.html) to Amazon CloudWatch is in AWS documentation. + +**Where do I get started for collecting & monitoring metrics to Amazon CloudWatch?** + +[Amazon CloudWatch collects metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) from various AWS Services which can be viewed through [AWS Management Console, AWS CLI, or an API](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html). Amazon CloudWatch collects [available metrics](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html) for Amazon EC2 Instances. For additional custom metrics customers can make use of unified CloudWatch agent to collect and monitor. + +> Related AWS Observability Workshop: [Metrics](https://catalog.workshops.aws/observability/en-US/aws-native/metrics) + +**My Amazon EC2 Instance requires very granular level of monitoring, what do I do?** + +By default, Amazon EC2 sends metric data to CloudWatch in 5-minute periods as Basic Monitoring for an instance. To send metric data for your instance to CloudWatch in 1-minute periods, [detailed monitoring](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html) can be enabled on the instance. + +**I want to publish own metrics for my application. Is there an option?** + +Customers can also publish their own [custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) to CloudWatch using the API or CLI through standard resolution of 1 minute granularity or high resolution granularity down to 1 sec interval. + +The CloudWatch agent also supports collecting custom metrics from EC2 instances in specialized scenarios like [Network performance metrics for EC2 instances](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-network-performance.html) running on Linux that use the Elastic Network Adapter (ENA), [NVIDIA GPU metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-NVIDIA-GPU.html) from Linux servers and Process metrics using procstat plugin from [individual processes](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-procstat-process-metrics.html) on Linux & Windows servers. + +> Related AWS Observability Workshop: [Public custom metrics](https://catalog.workshops.aws/observability/en-US/aws-native/metrics/publishmetrics) + +**What more support is available for collecting custom metrics through Amazon CloudWatch agent?** + +Custom metrics from applications or services can be retrieved using the unified CloudWatch agent with support for [StatsD](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-custom-metrics-statsd.html)or [collectd](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-custom-metrics-collectd.html)protocols. StatsD is a popular open-source solution that can gather metrics from a wide variety of applications. StatsD is especially useful for instrumenting own metrics, which supports both Linux and Windows based servers. collectd protocol is a popular open-source solution supported only on Linux Servers with plugins that can gather system statistics for a wide variety of applications. + +**My workload contains lot of ephemeral resources and generates logs in high-cardinality, what is the recommended approach collecting and measuring the metrics and logs?** + +[CloudWatch embedded metric format](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html) enables customers to ingest complex high-cardinality application data in the form of logs and to generate actionable metrics from ephemeral resources such as Lambda functions and containers. By doing so, customers can embed custom metrics alongside detailed log event data without having to instrument or maintain separate code, while gaining powerful analytical capabilities on your log data and CloudWatch can automatically extract the custom metrics to help visualize the data and set alarm on them for real-time incident detection. + +> Related AWS Observability Workshop: [Embedded Metric Format](https://catalog.workshops.aws/observability/en-US/aws-native/metrics/emf) + +**Where do I get started for collecting & monitoring logs to Amazon CloudWatch?** + +[Amazon CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) helps customers monitor and troubleshoot systems and applications in near real time using existing system, application and custom log files. Customers can install the [unified CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_GettingStarted.html) to collect [logs from Amazon EC2 Instances and on-premise servers](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) to CloudWatch. + +> Related AWS Observability Workshop: [Log Insights](https://catalog.workshops.aws/observability/en-US/aws-native/logs/logsinsights) + +**What is CloudWatch agent and why should I use that?** + +The [Unified CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) is an open-source software under the MIT license which supports most operating systems utilizing x86-64 and ARM64 architectures. The CloudWatch Agent helps collect system-level metrics from Amazon EC2 Instances & on-premise servers in a hybrid environment across operating systems, retrieve custom metrics from applications or services and collect logs from Amazon EC2 instances and on-premises servers. + +**I’ve all scales of installation required in my environment, so how can the CloudWatch agent be installed normally and using automation?** + +On all the supported operating systems including Linux and Windows Servers, customers can download and [install the CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html) using the [command line](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/installing-cloudwatch-agent-commandline.html), using AWS [Systems Manager](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/installing-cloudwatch-agent-ssm.html), or using an AWS [CloudFormation template](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent-New-Instances-CloudFormation.html). You can also install the [CloudWatch agent on on-premise servers](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-premise.html) for monitoring. + +**We have multiple AWS accounts in multiple regions in our Organization, does Amazon CloudWatch work for these scenarios.** + +Amazon CloudWatch provides [cross-account observability](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html)which helps customers monitor and troubleshoot health of resources and applications that span multiple accounts within a region. Amazon CloudWatch also provides a [cross-account, cross-region dashboard](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Cross-Account-Cross-Region.html). With this functionality customers can gain visibility and insights of their multi-account, multi-region resources and workloads. + +**What kind of automation support is available for Amazon CloudWatch?** + +Apart from accessing Amazon CloudWatch through the AWS Management Console customers can also access the service via API, [AWS command-line interface (CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and [AWS SDKs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/sdk-general-information-section.html). [CloudWatch API](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/Welcome.html) for metrics & dashboards help in automating through [AWS CLI](https://docs.aws.amazon.com/AmazonCloudWatch/latest/cli/Welcome.html)or integrating with software/products so that you can spend less time managing or administering the resources and applications. [CloudWatch API](https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/Welcome.html) for logs along with [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/logs/index.html) are also available separately. [Code examples for CloudWatch using AWS SDKs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/service_code_examples.html) are available for customers for additional reference. + +**I want to get started with monitoring resources quickly, what is the recommended approach?** + +Automatic Dashboards in CloudWatch are available in all AWS public regions which provides an aggregated view of the health and performance of all AWS resources. This helps customers quickly get started with monitoring, resource-based view of metrics and alarms, and easily drill-down to understand the root cause of performance issues. [Automatic Dashboards](https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/cloudwatch-dashboards-visualizations.html) are pre-built with AWS service recommended best practices, remain resource aware, and dynamically update to reflect the latest state of important performance metrics. + +Related AWS Observability Workshop: [Automatic Dashboards](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/autogen-dashboard) + +**I want to customize what I want to monitor in CloudWatch, what is the recommended approach?** + +With [Custom Dashboard](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create_dashboard.html) customers can create as many additional dashboards as they want with different widgets and customize it accordingly. When creating a custom dashboard, there are a variety of widget types that are available to pick and choose for customization. + +Related AWS Observability Workshop: [Dashboarding](https://catalog.workshops.aws/observability/en-US/aws-native/ec2-monitoring/dashboarding) + +**I’ve built few custom dashboards , is there a way to share it?** + +Yes, [sharing of CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-dashboard-sharing.html) is possible. There are three ways to share. Sharing a single dashboard publicly by allowing anyone with access to the link to view the dashboard. Sharing a single dashboard privately by specifying the email addresses of the people who are allowed to view the dashboard. Sharing all of the CloudWatch dashboards in the account by specifying a third-party single sign-on (SSO) provider for dashboard access. + +> Related AWS Observability Workshop: [Sharing CloudWatch Dashboards](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/sharingdashboard) + +**I want to improve the observability of my application including the aws resources underneath, how can I accomplish?** + +[Amazon CloudWatch Application Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-application-insights.html) facilitates observability for your applications along with the underlying AWS resources like SQL Server database, .Net based web (IIS) stack, application servers, OS, load balancers, queues, etc. It helps customers identify and set up key metrics and logs across application resources & technology stack. By doing so, it reduces mean time to repair (MTTR) & troubleshoot application issues faster. + +> Additional details in FAQ: [AWS resource & custom metrics monitoring](https://aws.amazon.com/cloudwatch/faqs/#AWS_resource_.26_custom_metrics_monitoring) + +**My Organization is open-source centric, does Amazon CloudWatch support monitoring & observability through open-source technologies.** + +For collecting metrics and traces, [AWS Distro for OpenTelemetry (ADOT) Collector](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-open-telemetry.html) along with the CloudWatch agent can be installed side-by-side on Amazon EC2 Instance and OpenTelemetry SDKs can be used to collect application traces & metrics from your workloads running on Amazon EC2 Instances. + +To support OpenTelemetry metrics in Amazon CloudWatch, [AWS EMF Exporter for OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsemfexporter) converts OpenTelemetry format metrics to CloudWatch Embedded Metric Format(EMF) which enables applications integrated in OpenTelemetry metrics to be able to send high-cardinality [application metrics to CloudWatch](https://aws-otel.github.io/docs/getting-started/adot-eks-add-on/config-cloudwatch). + +For logs, Fluent Bit helps create an easy extension point for streaming [logs from Amazon EC2](https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch) to AWS services including Amazon CloudWatch for log retention and analytics. The newly-launched [Fluent Bit plugin](https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin) can route logs to Amazon CloudWatch. + +For Dashboards, Amazon Managed Grafana can be added with [Amazon CloudWatch as a data source](https://docs.aws.amazon.com/grafana/latest/userguide/using-amazon-cloudwatch-in-AMG.html) by using the AWS data source configuration option in the Grafana workspace console. This feature simplifies adding CloudWatch as a data source by discovering existing CloudWatch accounts and manage the configuration of the authentication credentials that are required to access CloudWatch. + +**Our workload is already built to collect metrics using Prometheus from the environment. Can I continue using the same methodology.** + +Customers can choose to have an all open-source setup for their observability needs. For which, AWS Distro for OpenTelemetry (ADOT) Collector can be configured to scrape from a Prometheus-instrumented application and send the metrics to Prometheus Server or Amazon Managed Prometheus. + +The CloudWatch agent on EC2 instances can be installed & configured with [Prometheus to scrape metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-PrometheusEC2.html) for monitoring in CloudWatch. This can be helpful to customers who prefer container workloads on EC2 and require custom metrics that are compatible with open source Prometheus monitoring. + +CloudWatch [Container Insights monitoring for Prometheus](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights-Prometheus.html) automates the discovery of Prometheus metrics from containerized systems and workloads. Discovering Prometheus metrics is supported for Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS) and Kubernetes clusters running on Amazon EC2 instances. + +**My workloads contain microservices compute, especially EKS/Kubernetes related containers, how do I use Amazon CloudWatch to gain insights into the environment?** + +Customers can use [CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) to collect, aggregate, and summarize metrics & logs from containerized applications and microservices running on [Amazon Elastic Kubernetes Service (Amazon EKS)](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html) or Kubernetes platforms on Amazon EC2. [Container Insights](https://aws.amazon.com/cloudwatch/faqs/#Container_Monitoring) also supports collecting metrics from clusters deployed on Fargate for Amazon EKS. CloudWatch automatically [collects metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics.html) for many resources, such as CPU, memory, disk & network and also [provides diagnostic information](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-reference.html), such as container restart failures, to help isolate issues and resolve them quickly. + +> Related AWS Observability Workshop: [Container Insights on EKS](https://catalog.workshops.aws/observability/en-US/aws-native/insights/containerinsights/eks) + +**My workloads contain microservices compute, especially ECS related containers, how do I use Amazon CloudWatch to gain insights into the environment?** + +Customers can use [CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) to collect, aggregate, and summarize metrics & logs from containerized applications and microservices running on [Amazon Elastic Container Service (Amazon ECS)](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-ECS.html) or container platforms on Amazon EC2. [Container Insights](https://aws.amazon.com/cloudwatch/faqs/#Container_Monitoring) also supports collecting metrics from clusters deployed on Fargate for Amazon ECS. CloudWatch automatically [collects metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics.html) for many resources, such as CPU, memory, disk & network and also [provides diagnostic information](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-reference.html), such as container restart failures, to help isolate issues and resolve them quickly. + +> Related AWS Observability Workshop: [Container Insights on ECS](https://catalog.workshops.aws/observability/en-US/aws-native/insights/containerinsights/ecs) + +**My workloads contain serverless compute, especially AWS Lambda, how do I use Amazon CloudWatch to gain insights into the environment?** + +Customers can use [CloudWatch Lambda Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Lambda-Insights.html) for monitoring and troubleshooting serverless applications running on AWS Lambda. [CloudWatch Lambda Insights](https://aws.amazon.com/cloudwatch/faqs/#Lambda_Monitoring) collects, aggregates, and summarizes system-level metrics including CPU time, memory, disk, and network & also collects, aggregates, and summarizes diagnostic information such as cold starts and Lambda worker shutdowns to help customers isolate issues with Lambda functions and resolve them quickly. + +> Related AWS Observability Workshop: [Lambda Insights](https://catalog.workshops.aws/observability/en-US/aws-native/insights/lambdainsights) + +**I aggregate lot of logs into Amazon CloudWatch logs, how do I gain observability into those data?** + +[CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) enables customers to interactively search, analyze log data and have customers perform queries to efficiently and effectively respond to operational issues in Amazon CloudWatch Logs. If an issue occurs, customers can use [CloudWatch Logs Insights](https://aws.amazon.com/cloudwatch/faqs/#Log_analytics) to identify potential causes and validate deployed fixes. + +**How do I query logs in Amazon CloudWatch Logs?** + +CloudWatch Logs Insights in Amazon CloudWatch Logs use a [query language](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html) to query log groups. + +**How do I manage logs stored in Amazon CloudWatch Logs for cost optimization, compliance retention or for additional processing?** + +By default, [LogGroups](https://aws.amazon.com/cloudwatch/faqs/#Log_management)Amazon CloudWatch Logs are[kept indefinitely and never expire](https://docs.aws.amazon.com/managedservices/latest/userguide/log-customize-retention.html). Customers can adjust the retention policy of each log group to choose a retention period between one day and 10 years, depending up on how long they want to retain the logs to optimize cost or for compliance purposes. + +Customers can export log data from [log groups to Amazon S3 bucket](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3Export.html) and use this data in custom processing and analysis, or to load onto other systems. + +Customers can also configure log groups in CloudWatch Logs to [stream data to your Amazon OpenSearch Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_OpenSearch_Stream.html) cluster in near real-time through a CloudWatch Logs subscription. By doing so, it helps customers to perform interactive log analytics, real-time application monitoring, search, and more. + +**My workloads generate logs which could have sensitive data, is there a way to protect them in Amazon CloudWatch?** + +Customers can make use of [Log data protection feature](https://aws.amazon.com/cloudwatch/faqs/#Log_data_protection) in CloudWatch Logs that helps customers [define own rules and policies to automatically](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/mask-sensitive-log-data.html#mask-sensitive-log-data-start) detect and mask sensitive data within logs that are collected from systems and applications. + +Related AWS Observability Workshop: [Data Protection](https://catalog.workshops.aws/observability/en-US/aws-native/logs/dataprotection) + +**I would like to know anomaly bands or unexpected changes when it happens to my systems & applications. How can Amazon CloudWatch alert me when it occurs.** + +[Amazon CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) applies statistical and machine learning algorithms to continuously analyze single time series of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention. The algorithms create an anomaly detection model that generates a range of expected values that represent normal metric behavior. Customers can [create alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Anomaly_Detection_Alarm.html) based on the analysis of past metric data and a value set for the anomaly threshold. + +> Related AWS Observability Workshop: [Anomaly Detection](https://catalog.workshops.aws/observability/en-US/aws-native/metrics/alarms/anomalydetection) + +**I’ve setup metric alarm in Amazon CloudWatch, however I’m getting frequent alarm noises. How can I control and fine tune this?** + +Customers can combine multiple alarms into alarm hierarchies as [composite alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm.html) to reduce alarm noise by triggering just once when multiple [alarms](https://aws.amazon.com/cloudwatch/faqs/#Alarms) fire simultaneously. Composite alarms support an overall state by helping customers in grouping resources like an application, AWS Region, or AZ. + +> Related AWS Observability Workshop: [Alarms](https://catalog.workshops.aws/observability/en-US/aws-native/metrics/alarms) + +**My workload facing the internet is experiencing performance and availability issues, how do I troubleshoot?** + +[Amazon CloudWatch Internet Monitor](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-InternetMonitor.html) provides visibility into how internet issues impact the performance and availability between your applications hosted on AWS and your end users. With [Internet Monitor](https://aws.amazon.com/cloudwatch/faqs/#Internet_Monitoring), you can quickly identify what's impacting your application's performance and availability, so that you can track down and address issues which can significantly reduce the time it takes to diagnose internet issues. + +**I've my workload on AWS and I want to get notified even before the end users experience an impact or latency in accessing the application. How do I get better visibility and improve the observability of my customer facing workload?** + +Customers can use [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) to create canaries, configurable scripts that run on a schedule, to monitor your endpoints and APIs. Canaries follow the same routes and perform the same actions as a customer, which makes it possible to continually verify end user experience even when there are no live traffic to your applications. Canaries help you discover issues even before your customers do. Canaries check the availability and latency of endpoints and can store load time data and screenshots of the UI as rendered by a headless Chromium browser. + +> Related AWS Observability Workshop: [CloudWatch Synthetics](https://catalog.workshops.aws/observability/en-US/aws-native/app-monitoring/synthetics) + +**I've my workload on AWS and I want to observe end user experience by identifying client-side performance issues and action a faster resolution if there are any real-time issues.** + +[CloudWatch RUM](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) can perform real user monitoring to collect and view client-side data about your web application performance from actual user sessions in near real time. This collected data helps quickly identify and debug client-side performance issues and also helps to visualize and analyze page load times, client-side errors, and user behavior. When viewing this data, customers can see it all aggregated together and also see breakdowns by the browsers and devices that your customers use. CloudWatch RUM helps visualize anomalies in your application performance and find relevant debugging data such as error messages, stack traces, and user sessions. + +> Related AWS Observability Workshop: [CloudWatch RUM](https://catalog.workshops.aws/observability/en-US/aws-native/app-monitoring/rum) + +**My Organization requires all actions be recorded for audits. Can Amazon CloudWatch events be recorded?** + +Amazon CloudWatch is integrated with [AWS CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html), which provides a record of actions taken by a user, a role, or an AWS service in Amazon CloudWatch. CloudTrail captures all [API calls for Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/logging_cw_api_calls.html) as events that include calls from the console and code calls to API operations. + +**What more information is available?** + +For additional information customers can read the AWS Documentation for [CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html), [CloudWatch Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) and [CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html), go through the AWS Observability Workshop on [AWS Native Observability](https://catalog.workshops.aws/observability/en-US/aws-native) and also check the [product page](https://aws.amazon.com/cloudwatch/) to know the [features](https://aws.amazon.com/cloudwatch/features/), and [pricing](https://aws.amazon.com/cloudwatch/pricing/) details. Additional [tutorials on CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-tutorials.html) illustrating customer use case scenarios. + +**Product FAQ:** [https://aws.amazon.com/cloudwatch/faqs/](https://aws.amazon.com/cloudwatch/faqs/) diff --git a/docusaurus/observability-best-practices/docs/faq/faq.md b/docusaurus/observability-best-practices/docs/faq/faq.md new file mode 100644 index 000000000..03498db98 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/faq/faq.md @@ -0,0 +1,61 @@ +# General - FAQ + +## How are logs different from traces? + +Logs are limited to a single application and the events that relate to it. For example, if a user logs into a web site hosted on a microservices platform, and makes a purchase on this site, there may be logs related to that user emitted from multiple applications: + +1. A front-end web server +1. Authentication service +1. The inventory service +1. A payment processing backend +1. Outbound mailer that sends the user a receipt + +Every one of these may log something about this user, and that data is all valuable. However, traces will present a single, cohesive view of the user's entire interaction across that single transaction, spanning all of these discrete components. + +In this way, a trace a collection of events from multiple services intended to show a single view of an activity, whereas logs are bound to the context of the application that created them. + +## What signal types are immutable? + +All three of the basic signal types ([metrics](../signals/metrics/), [logs](../signals/logs/), and [traces](../signals/traces/)) are immutable, though some implementations have greater or lesser assurance of this. For example, immutability of logs is a strict requirement in many governance frameworks - and many tools exist to ensure this. Metrics and traces should likewise *always* be immutable. + +This leads to a question as to handling "bad data", or data that was inaccurate. With AWS observability services, there is no facility to delete metrics or traces that were emitted in error. CloudWatch Logs does allow for the deletion of an entire log stream, but you cannot retroactively change data once it has been collected. This is by design, and an important feature to ensure customer data is treated with the utmost care. + +## Why does immutability matter for observability? + +Immutability is paramount to observability! If past data can be modified then you would lose critical errors, or outliers in behaviour, that inform your *choices* when evolving your systems and operations. For example, a metric datapoint that shows a large gap in time does not simply show a lack of data collection, it may indicate a much larger issue in your infrastructure. Likewise, with "null" data - even empty timeseries are valuable. + +From a governance perspective, changing application logs or tracing after the fact violates the principal of [non-reputability](https://en.wikipedia.org/wiki/Non-repudiation), where you would not be able to trust that the data in your system is precisely as it was intended be by the source application. + +## What is a blast radius? + +The blast radius of a change is the potential damage that it can create in your environment. For example, if you make a database schema change then the potential risk could include the data in the database plus all of the applications that depend on it. + +Generally speaking, working to reduce the blast radius of a change is a best practice, and breaking a change into smaller, safer, and reversible chunks is always recommended wherever feasible. + +## What is a "cloud first" approach? + +Cloud-first strategies are where organization move all or most of their infrastructure to cloud-computing platforms. Instead of using physical resources like servers, they house resources in the cloud. + +To those used to co-located hardware, this might seem radical. However, the opposite is also true. Developers who adopt a cloud-first mentality find the idea of tying your servers to a physical location unthinkable. Cloud-first teams don’t think of their servers as discrete pieces of hardware or even virtual servers. Instead, they think of them as software to fulfill a business function. + +Cloud-first is to the 2020's what mobile-first was to the 2010's, and virtualization was to the early 2000's. + +## What is technical debt? + +Taken from [Wikipedia](https://en.wikipedia.org/wiki/Technical_debt): + +> In software development, technical debt (also known as design debt or code debt) is the implied cost of additional rework caused by choosing an easy (limited) solution now instead of using a better approach that would take longer. + +Basically, you accumulate debt over time as you add more to your workload without removing legacy code, applications, or human processes. Technical debt detracts from your absolute productivity. + +For example, if you have to spend 10% of your time performing maintenance on a legacy system that provides little or no direct value to your business, then that 10% is a *cost* that you pay. Reduction of technical debt equals increasing effective time to create new products that add value. + +## What is the separation of concerns + +In the context of observability solutions, the separation of concerns means to divide functional areas of a workload or an application into discrete components that are independently managed. Each component addresses a separate concern (such as log structure and the *emitting* of logs). Controlling configuration of a component without modifying the underlying code means that developers can focus on their concerns (application functionality and feature development), while DevOps personas can focus on optimizing system performance and troubleshooting. + +Separation of concerns is a [core concept](https://en.wikipedia.org/wiki/Separation_of_concerns) in computer science. + +## What is operational excellence? + +Operational excellence is the performance of best practices that align with operating workloads. AWS has an entire framework dedicated to being Well-Architected. See [this page](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html) to get started with operational excellence. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/faq/x-ray.md b/docusaurus/observability-best-practices/docs/faq/x-ray.md new file mode 100644 index 000000000..b79d118bd --- /dev/null +++ b/docusaurus/observability-best-practices/docs/faq/x-ray.md @@ -0,0 +1,21 @@ +# AWS X-Ray - FAQ + +1. **Does AWS Distro for Open Telemetry (ADOT) support trace propagation across AWS services such as Event Bridge or SQS? + **Technically, that’s not ADOT but AWS X-Ray. We are working on expanding the number and types of AWS services that propagate and/or generate spans. If you have a use case depending on this, please reach out to us. +1. **Will I be able to use the W3C trace header to ingest spans into AWS X-Ray using ADOT?** + Yes, later in 2023. We’re working on supporting W3C trace context propagation. +1. Can I trace requests across Lambda functions when SQS is involved in the middle? + Yes. X-Ray now supports tracing across Lambda functions when SQS is involved in the middle. Traces from upstream message producers are [automatically linked to traces](https://docs.aws.amazon.com/xray/latest/devguide/xray-services-sqs.html) from downstream Lambda consumer nodes, creating an end-to-end view of the application. +1. **Should I use X-Ray SDK or the OTel SDK to instrument my application?** + OTel offers more features than the X-Ray SDK, but to choose which one is right for your use case see [Choosing between ADOT and X-Ray SDK](https://docs.aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html#xray-instrumenting-choosing) +1. **Are [span events](https://opentelemetry.io/docs/instrumentation/ruby/manual/#add-span-events) supported in AWS X-Ray?** + Span events do not fit into the X-Ray model and are hence dropped. +1. **How can I extract data out of AWS X-Ray?** + You can retrieve Service Graph, Traces and Root cause analytics data [using X-Ray APIs](https://docs.aws.amazon.com/xray/latest/devguide/xray-api-gettingdata.html). +1. **Can I achieve 100% sampling? That is, I want all traces to be recorded without sampling at all.** + You can adjust the sampling rules to capture significantly increased amount of trace data. As long as the total segments sent do not breach the [service quota limits mentioned here](https://docs.aws.amazon.com/general/latest/gr/xray.html#limits_xray), X-Ray will make an effort to collect data as configured. There is no guarantee that this will result in 100% trace data capture as a result. +1. **Can I dynamically increase or decrease sampling rules through APIs?** +Yes, you can use the [X-Ray sampling APIs](https://docs.aws.amazon.com/xray/latest/devguide/xray-api-sampling.html) to make adjustments dynamically as necessary. See this [blog for a use-case based explanation](https://aws.amazon.com/blogs/mt/dynamically-adjusting-x-ray-sampling-rules/). +1. **Product FAQ** +[https://aws.amazon.com/xray/faqs/](.) + diff --git a/docusaurus/observability-best-practices/docs/guides/apm.md b/docusaurus/observability-best-practices/docs/guides/apm.md new file mode 100644 index 000000000..813b3b9a3 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/apm.md @@ -0,0 +1 @@ +# Application Performance Monitoring diff --git a/docusaurus/observability-best-practices/docs/guides/choosing-a-tracing-agent.md b/docusaurus/observability-best-practices/docs/guides/choosing-a-tracing-agent.md new file mode 100644 index 000000000..912de2101 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/choosing-a-tracing-agent.md @@ -0,0 +1,34 @@ +# Choosing a tracing agent + +## Choose the right agent + +AWS directly supports two toolsets for [trace](../signals/traces/) collection (plus our wealth of [observability partners](https://aws.amazon.com/products/management-and-governance/partners/): + +* The [AWS Distro for OpenTelemetry](https://aws-otel.github.io/), commonly called ADOT +* The X-Ray [SDKs](https://docs.aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html) and [daemon](https://docs.aws.amazon.com/xray/latest/devguide/xray-daemon.html) + +The selection of which tool or tools to use is a principal decision you must make as you evolve your observability solution. These tools are not mutually-exclusive, and you can mix them together as necessary. And there is a best practice for making this selection. However, first you should understand the current state of [OpenTelemetry (OTEL)](https://opentelemetry.io/). + +OTEL is the current industry standard specification for observabillity signalling, and contains definitions for each of the three core signal types: [metrics](../signals/metrics/), [traces](../signals/traces/), and [logs](../signals/logs). However, OTEL has not always existed and has evolved out of earlier specifications such as [OpenMetrics](https://openmetrics.io) and [OpenTracing](https://opentracing.io). Observability vendors began openly supporting OpenTelemetry Line Protocol (OTLP) in recent years. + +AWS X-Ray and CloudWatch pre-date the OTEL specification, as do other leading observability solutions. However, the AWS X-Ray service readily accepts OTEL traces using ADOT. ADOT has the integrations already built into it to emit telemetry into X-Ray directly, as well as to other ISV solutions. + +Any transaction tracing solution requires an agent and an integration into the underlying application in order to collect signals. And this, in turn, creates [technical debt](../faq/#what-is-technical-debt) in the form of libraries that must be tested, maintained, and upgraded, as well as possibly retooling if you choose to change your solution in the future. + +The SDKs included with X-Ray are part of a tightly integrated instrumentation solution offered by AWS. ADOT is part of a broader industry solution in which X-Ray is only one of many tracing solutions. You can implement end-to-end tracing in X-Ray using either approach, but it’s important to understand the differences in order to determine the most useful approach for you. + +:::info + We recommend instrumenting your application with the AWS Distro for OpenTelemetry if you need the following: + + * The ability to send traces to multiple different tracing backends without having to re-instrument your code. For example, of you wish to shift from using the X-Ray console to [Zipkin](https://zipkin.io), then only configuration of the collector would change, leaving your applicaiton code untouched. + + * Support for a large number of library instrumentations for each language, maintained by the OpenTelemetry community. +::: + +:::info + We recommend choosing an X-Ray SDK for instrumenting your application if you need the following: + + * A tightly integrated single-vendor solution. + + * Integration with X-Ray centralized sampling rules, including the ability to configure sampling rules from the X-Ray console and automatically use them across multiple hosts, when using Node.js, Python, Ruby, or .NET +::: \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/best-practices-metrics-collection-1.md b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/best-practices-metrics-collection-1.md new file mode 100644 index 000000000..da5a150f1 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/best-practices-metrics-collection-1.md @@ -0,0 +1,54 @@ +# Collecting system metrics with Container Insights +System metrics pertain to low-level resources that include physical components on a server such as CPU, memory, disks and network interfaces. +Use [CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) to collect, aggregate, and summarize system metrics from containerized applications deployed to Amazon ECS. Container Insights also provides diagnostic information, such as container restart failures, to help isolate issues and resolve them quickly. It is available for Amazon ECS clusters deployed on EC2 and Fargate. + +Container Insights collects data as performance log events using [embedded metric format](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html). These performance log events are entries that use a structured JSON schema that enables high-cardinality data to be ingested and stored at scale. From this data, CloudWatch creates aggregated metrics at the cluster, node, service and task level as CloudWatch metrics. + +:::note + For Container Insights metrics to appear in CloudWatch, you must enable Container Insights on your Amazon ECS clusters. This can be done either at the account level or at the individual cluster level. To enable at the account level, use the following AWS CLI command: + + ``` + aws ecs put-account-setting --name "containerInsights" --value "enabled + ``` + + To enable at the individual cluster level, use the following AWS CLI command: + + ``` + aws ecs update-cluster-settings --cluster $CLUSTER_NAME --settings name=containerInsights,value=enabled + ``` +::: + +## Collecting cluster-level and service-level metrics +By default, CloudWatch Container Insights collects metrics at the task, service and cluster level. The Amazon ECS agent collects these metrics for each task on an EC2 container instance (for both ECS on EC2 and ECS on Fargate) and sends them to CloudWatch as performance log events. You don't need to deploy any agents to the cluster. These log events from which the metrics are extracted are collected under the CloudWatch log group named */aws/ecs/containerinsights/$CLUSTER_NAME/performance*. The complete list of metrics extracted from these events are [documented here](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html). The metrics that Container Insights collects are readily viewable in pre-built dashboards available in the CloudWatch console by selcting *Container Insights* from the navigation page and then selecting *performance monitoring* from the dropdown list. They are also viewable in the *Metrics* section of the CloudWatch console. + +![Container Insights metrics dashboard](../../../../images/ContainerInsightsMetrics.png) + +:::note + If you're using Amazon ECS on an Amazon EC2 instance, and you want to collect network and storage metrics from Container Insights, launch that instance using an AMI that includes Amazon ECS agent version 1.29. +::: + +:::warning + Metrics collected by Container Insights are charged as custom metrics. For more information about CloudWatch pricing, see [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/) +::: + +## Collecting instance-level metrics +Deploying the CloudWatch agent to an Amazon ECS cluster hosted on EC2, allows you to collect instance-level metrics from the cluster. The agent is deployed as a daemon service and sends instance-level metrics as performance log events from each EC2 container instance in the cluster. The complete list of instance-level extracted from these events are [documented here](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html) + +:::info + Steps to deploy the CloudWatch agent to an Amazon ECS cluster to collect instance-level metrics are documented in the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-ECS-instancelevel.html). Note that this option is not availavble for Amazon ECS clusters that are hosted on Fargate. +::: + +## Analyzing performance log events with Logs Insights +Container Insights collects metrics by using performance log events with embedded metric format. Each log event may contain performance data observed on system resources such as CPU and memory or ECS resources such as tasks and services. Examples of performance log events that Container Insights collects from an Amazon ECS at the cluster, service, task and container level are [listed here](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-reference-performance-logs-ECS.html). CloudWatch generates metrics based only on some of the performance data in these log events. But you can use these log events to perform a deeper analysis of the performance data using CloudWatch Logs Insights queries. + +The user interface to run Logs Insights queries is available in the CloudWatch console by selecting *Logs Insights* from the navigation page. When you select a log group, CloudWatch Logs Insights automatically detects fields in the performance log events in the log group and displays them in *Discovered* fields in the right pane. The results of a query execution are displayed as a bar graph of log events in this log group over time. This bar graph shows the distribution of events in the log group that matches your query and time range. + +![Logs Insights dashboard](../../../../images/LogInsights.png) + +:::info + Here's a sample Logs Insights query to display container-level metrics for CPU and memory usage. + + ``` + stats avg(CpuUtilized) as CPU, avg(MemoryUtilized) as Mem by TaskId, ContainerName | sort Mem, CPU desc + ``` +::: \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/best-practices-metrics-collection-2.md b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/best-practices-metrics-collection-2.md new file mode 100644 index 000000000..4d2d5980b --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/best-practices-metrics-collection-2.md @@ -0,0 +1,120 @@ +# Collecting service metrics with Container Insights +Service metrics are applicaton-level metrics that are captured by adding instrumentation to your code. These metrics can be captured from an application using two different approaches. + +1. Push approach: Here, an application sends the metrics data directly to a destination. For example, using the CloudWatch PutMetricData API, an application can publish metric data points to CloudWatch. An application may also send the data via gRPC or HTTP using the OpenTelemetry Protocol (OTLP) to an agent such as the OpenTelemetry Collector. The latter will then send the data the metrics data to the final destination. +2. Pull approach: Here, the application exposes the metrics data at an HTTP endpoint in a pre-defined format. The data are then scraped by an agent that has access to this endpoint and then sent to the destination. + +![Push approach for metric collection](../../../../images/PushPullApproach.png) + +## CloudWatch Container Insights monitoring for Prometheus +[Prometheus](https://prometheus.io/docs/introduction/overview/) is a popular open-source systems monitoring and alerting toolkit. It has emerged as the de facto standard for collecting metrics using the pull approach from containerized applications. To capture metrics using Prometheus, you will have to first instrument your application code using the Prometheus [client library](https://prometheus.io/docs/instrumenting/clientlibs/) which is available in all the major programming languages. Metrics are usually exposed by the application over HTTP, to be read by the Prometheus server. +When Prometheus server scrapes your applications's HTTP endpoint, the client library sends the current state of all tracked metrics to the server. The server can either store the metrics in a local storage that it manages or send the metrics data to a remote destination such as CloudWatch. + +[CloudWatch Container Insights monitoring for Prometheus](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights-Prometheus.html) enables you to leverage the capabilities of Prometheus in an Amazon ECS cluster. It is available for Amazon ECS clusters deployed on EC2 and Fargate The CloudWatch agent can be used as a drop-in replacement for a Prometheus server, reducing the number of monitoring tools required to improve observability. It automates the discovery of Prometheus metrics from containerized applications deployed to Amazon ECS and sends the metrics data to CloudWatch as performance log events. + +:::info + Steps to deploy the CloudWatch agent with Prometheus metrics collection on an Amazon ECS cluster are documented in the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights-Prometheus-install-ECS.html) +::: +:::warning + Metrics collected by Container Insights monitoring for Prometheus are charged as custom metrics. For more information about CloudWatch pricing, see [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/) +::: +### Autodiscovery of targets on Amazon ECS clusters +The CloudWatch agent supports the standard Prometheus scrape configurations under the [scrape_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) section in the Prometheus documentation. Prometheus supports both static and dynamic discovery of scraping targets using one of the dozens of supported [service-discovery mechanisms](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config). . As Amazon ECS does not have any built-in service discovery mechanism, the agent relies on Prometheus' support for file-based discovery of targets. To setup the agent for file-based discovery of targets, the agent needs two [configuration parameters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights-Prometheus-Setup-configure-ECS.html), which are both defined in the task definition used for launching the agent. You can customize these parameters to have granular control over the metrics collected by the agent. + +The first parameter contains Prometheus global configuration that looks like the following sample: + +``` +global: + scrape_interval: 30s + scrape_timeout: 10s +scrape_configs: + - job_name: cwagent_ecs_auto_sd + sample_limit: 10000 + file_sd_configs: + - files: [ "/tmp/cwagent_ecs_auto_sd.yaml" ] +``` + +The second parameter contains configuration that helps the agent discover scraping targets. The agent periodically makes API calls to Amazon ECS to retrieve the metadata of the running ECS tasks that match the task definition patterns defined in the *ecs_service_discovery* section of this configurtion. All discovered targets are written into the result file */tmp/cwagent_ecs_auto_sd.yaml* that resides on the file system mounted to CloudWatch agent container. The sample configuration below will result in the agent scraping metrics from all tasks that are named with the prefix *BackendTask*. Refer to the [detaild guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights-Prometheus-Setup-autodiscovery-ecs.html) for autodiscovery of targets in an Amazon ECS Cluster. + +``` +{ + "logs":{ + "metrics_collected":{ + "prometheus":{ + "log_group_name":"/aws/ecs/containerinsights/{ClusterName}/prometheus" + "prometheus_config_path":"env:PROMETHEUS_CONFIG_CONTENT", + "ecs_service_discovery":{ + "sd_frequency":"1m", + "sd_result_file":"/tmp/cwagent_ecs_auto_sd.yaml", + "task_definition_list":[ + { + "sd_job_name":"backends", + "sd_metrics_ports":"3000", + "sd_task_definition_arn_pattern":".*:task-definition/BackendTask:[0-9]+", + "sd_metrics_path":"/metrics" + } + ] + }, + "emf_processor":{ + "metric_declaration":[ + { + "source_labels":[ + "job" + ], + "label_matcher":"^backends$", + "dimensions":[ + [ + "ClusterName", + "TaskGroup" + ] + ], + "metric_selectors":[ + "^http_requests_total$" + ] + } + ] + } + } + }, + "force_flush_interval":5 + } +} +``` + +### Importing Prometheus metrics into CloudWatch +The metrics collected by the agent are sent to CloudWatch as performance log events based on the filtering rules specified in *metric_declaration* section of the configuration. This section is also used to specify the array of logs with embedded metric format to be generated. The sample configuration above will generate log events, as shown below, only for a metric named *http_requests_total* with the label *job:backends*. Using this data, CloudWatch will create the metric *http_requests_total* under the CloudWatch namespace *ECS/ContainerInsights/Prometheus* with the dimensions *ClusterName* and *TaskGroup*. +``` +{ + "CloudWatchMetrics":[ + { + "Metrics":[ + { + "Name":"http_requests_total" + } + ], + "Dimensions":[ + [ + "ClusterName", + "TaskGroup" + ] + ], + "Namespace":"ECS/ContainerInsights/Prometheus" + } + ], + "ClusterName":"ecs-sarathy-cluster", + "LaunchType":"EC2", + "StartedBy":"ecs-svc/4964126209508453538", + "TaskDefinitionFamily":"BackendAlarmTask", + "TaskGroup":"service:BackendService", + "TaskRevision":"4", + "Timestamp":"1678226606712", + "Version":"0", + "container_name":"go-backend", + "exported_job":"storebackend", + "http_requests_total":36, + "instance":"10.10.100.191:3000", + "job":"backends", + "path":"/popular/category", + "prom_metric_type":"counter" +} +``` \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/cost-optimization.md b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/cost-optimization.md new file mode 100644 index 000000000..e69de29bb diff --git a/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/resource-optimization.md b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/ecs/resource-optimization.md new file mode 100644 index 000000000..e69de29bb diff --git a/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/amazon-cloudwatch-container-insights.md b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/amazon-cloudwatch-container-insights.md new file mode 100644 index 000000000..a9c4b67fc --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/amazon-cloudwatch-container-insights.md @@ -0,0 +1,250 @@ +# Amazon CloudWatch Container Insights + +In this section of Observability best practices guide, we will deep dive on to following topics related to Amazon CloudWatch Container Insights : + +* Introduction to Amazon CloudWatch Container Insights +* Using Amazon CloudWatch Container Insights with AWS Distro for Open Telemetry +* Fluent Bit Integration in CloudWatch Container Insights for Amazon EKS +* Cost savings with Container Insights on Amazon EKS +* Using EKS Blueprints to setup Container Insights + +### Introduction + +[Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) helps customers collect, aggregate, and summarize metrics and logs from containerized applications and microservices. Metrics data is collected as performance log events using the [embedded metric format](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html). These performance log events use a structured JSON schema that enables high-cardinality data to be ingested and stored at scale. From this data, CloudWatch creates aggregated metrics at the cluster, node, pod, task, and service level as CloudWatch metrics. The metrics that Container Insights collects are available in CloudWatch automatic dashboards. Container Insights are available for Amazon EKS clusters with self managed node groups, managed node groups and AWS Fargate profiles. + +From a cost optimization standpoint and to help you manage your Container Insights cost, CloudWatch does not automatically create all possible metrics from the log data. However, you can view additional metrics and additional levels of granularity by using CloudWatch Logs Insights to analyze the raw performance log events. Metrics collected by Container Insights are charged as custom metrics. For more information about CloudWatch pricing, see [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/). + +In Amazon EKS, Container Insights uses a containerized version of the [CloudWatch agent](https://gallery.ecr.aws/cloudwatch-agent/cloudwatch-agent) which is provided by Amazon via Amazon Elastic Container Registry to discover all of the running containers in a cluster. It then collects performance data at every tier of the performance stack. Container Insights supports encryption with the AWS KMS key for the logs and metrics that it collects. To enable this encryption, you must manually enable AWS KMS encryption for the log group that receives Container Insights data. This results in CloudWatch Container Insights encrypting this data using the provided AWS KMS key. Only symmetric keys are supported and asymmetric AWS KMS keys are not supported to encrypt your log groups. Container Insights are supported only in Linux instances. Container Insights for Amazon EKS is supported in the [these](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html#:~:text=Container%20Insights%20for%20Amazon%20EKS%20and%20Kubernetes%20is%20supported%20in%20the%20following%20Regions%3A) AWS Regions. + +### Using Amazon CloudWatch Container Insights with AWS Distro for Open Telemetry + +We will now deep dive in to [AWS Distro for OpenTelemetry (ADOT)](https://aws-otel.github.io/docs/introduction) which is one of the options to enable collection of Container insight metrics from Amazon EKS workloads. [AWS Distro for OpenTelemetry (ADOT)](https://aws-otel.github.io/docs/introduction) is a secure, AWS-supported distribution of the [OpenTelemetry](https://opentelemetry.io/docs/) project. With ADOT, users can instrument their applications just once to send correlated metrics and traces to multiple monitoring solutions. With ADOT support for CloudWatch Container Insights, customers can collect system metrics such as CPU, memory, disk, and network usage from Amazon EKS clusters running on [Amazon Elastic Cloud Compute](https://aws.amazon.com/pm/ec2/?trk=ps_a134p000004f2ZFAAY&trkCampaign=acq_paid_search_brand&sc_channel=PS&sc_campaign=acquisition_US&sc_publisher=Google&sc_category=Cloud%20Computing&sc_country=US&sc_geo=NAMER&sc_outcome=acq&sc_detail=amazon%20ec2&sc_content=EC2_e&sc_matchtype=e&sc_segment=467723097970&sc_medium=ACQ-P|PS-GO|Brand|Desktop|SU|Cloud%20Computing|EC2|US|EN|Text&s_kwcid=AL!4422!3!467723097970!e!!g!!amazon%20ec2&ef_id=Cj0KCQiArt6PBhCoARIsAMF5waj-FXPUD0G-cm0dJ05Mz6aXDvqEGu-S7pCXwvVusULN6ZbPbc_Alg8aArOHEALw_wcB:G:s&s_kwcid=AL!4422!3!467723097970!e!!g!!amazon%20ec2) (Amazon EC2), providing the same experience as Amazon CloudWatch agent. ADOT Collector is now available with support for CloudWatch Container Insights for Amazon EKS and AWS Fargate profile for Amazon EKS. Customers can now collect container and pod metrics such as CPU and memory utilization for their pods that are deployed to an Amazon EKS cluster and view them in CloudWatch dashboards without any changes to their existing CloudWatch Container Insights experience. This will enable customers to also determine whether to scale up or down to respond to traffic and save costs. + +The ADOT Collector has the [concept of a pipeline](https://opentelemetry.io/docs/collector/configuration/) which comprises three key types of components, namely, receiver, processor, and exporter. A [receiver](https://opentelemetry.io/docs/collector/configuration/#receivers) is how data gets into the collector. It accepts data in a specified format, translates it into the internal format and passes it to [processors](https://opentelemetry.io/docs/collector/configuration/#processors) and [exporters](https://opentelemetry.io/docs/collector/configuration/#exporters) defined in the pipeline. It can be pull or push based. A processor is an optional component that is used to perform tasks such as batching, filtering, and transformations on data between being received and being exported. An exporter is used to determine which destination to send the metrics, logs or traces. The collector architecture allows multiple instances of such pipelines to be defined via YAML configuration. The following diagrams illustrates the pipeline components in an ADOT Collector instance deployed to Amazon EKS and Amazon EKS with Fargate profile. + +![CW-ADOT-EKS](../../../../images/Containers/aws-native/eks/cw-adot-collector-pipeline-eks.jpg) + +*Figure: Pipeline components in an ADOT Collector instance deployed to Amazon EKS* + +In the above architecture, we are deploying we are using an instance of [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/awscontainerinsightreceiver) in the pipeline and collect the metrics directly from the Kubelet. AWS Container Insights Receiver (`awscontainerinsightreceiver`) is an AWS specific receiver that supports [CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html). CloudWatch Container Insights collect, aggregate, and summarize metrics and logs from your containerized applications and microservices. Data are collected as as performance log events using [embedded metric format](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html). From the EMF data, Amazon CloudWatch can create the aggregated CloudWatch metrics at the cluster, node, pod, task, and service level. Below is an example of a sample `awscontainerinsightreceiver` configuration : + +``` +receivers: + awscontainerinsightreceiver: + # all parameters are optional + collection_interval: 60s + container_orchestrator: eks + add_service_as_attribute: true + prefer_full_pod_name: false + add_full_pod_name_metric_label: false +``` + +This entails deploying the collector as a DaemonSet using the above configuration on Amazon EKS. You will also have access to a fuller set of metrics collected by this receiver directly from the Kubelet. Having more than one instances of ADOT Collector will suffice to collect resource metrics from all the nodes in a cluster. Having a single instance of ADOT collector can be overwhelming during higher loads so always recommend to deploy more than one collector. + +![CW-ADOT-FARGATE](../../../../images/Containers/aws-native/eks/cw-adot-collector-pipeline.jpg) + +*Figure: Pipeline components in an ADOT Collector instance deployed to Amazon EKS with Fargate profile* + +In the above architecture, the kubelet on a worker node in a Kubernetes cluster exposes resource metrics such as CPU, memory, disk, and network usage at the */metrics/cadvisor* endpoint. However, in EKS Fargate networking architecture, a pod is not allowed to directly reach the kubelet on that worker node. Hence, the ADOT Collector calls the Kubernetes API Server to proxy the connection to the kubelet on a worker node, and collect kubelet’s cAdvisor metrics for workloads on that node. These metrics are made available in Prometheus format. Therefore, the collector uses an instance of [Prometheus Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver) as a drop-in replacement for a Prometheus server and scrapes these metrics from the Kubernetes API server endpoint. Using Kubernetes service discovery, the receiver can discover all the worker nodes in an EKS cluster. Hence, more than one instances of ADOT Collector will suffice to collect resource metrics from all the nodes in a cluster. Having a single instance of ADOT collector can be overwhelming during higher loads so always recommend to deploy more than one collector. + +The metrics then go through a series of processors that perform filtering, renaming, data aggregation and conversion, and so on. The following is the list of processors used in the pipeline of an ADOT Collector instance for Amazon EKS illustrated above. + +* [Filter Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessor) is part of the AWS OpenTelemetry distribution to include or exclude metrics based on their name. It can be used as part of the metrics collection pipeline to filter out unwanted metrics. For example, suppose that you want Container Insights to only collect pod-level metrics (with name prefix `pod_`) excluding those for networking, with name prefix `pod_network`. + +``` + # filter out only renamed metrics which we care about + filter: + metrics: + include: + match_type: regexp + metric_names: + - new_container_.* + - pod_.* +``` + +* [Metrics Transform Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor) can be used to rename metrics, and add, rename or delete label keys and values. It can also be used to perform scaling and aggregations on metrics across labels or label values. + +``` + metricstransform/rename: + transforms: + - include: container_spec_cpu_quota + new_name: new_container_cpu_limit_raw + action: insert + match_type: regexp + experimental_match_labels: {"container": "\\S"} +``` + +* [Cumulative to Delta Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/cumulativetodeltaprocessor) converts monotonic, cumulative sum and histogram metrics to monotonic, delta metrics. Non-monotonic sums and exponential histograms are excluded. + +``` +` # convert cumulative sum datapoints to delta + cumulativetodelta: + metrics: + - pod_cpu_usage_seconds_total + - pod_network_rx_errors` +``` + +* [Delta to Rate Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/deltatorateprocessor) to convert delta sum metrics to rate metrics. This rate is a gauge. + +``` +` # convert delta to rate + deltatorate: + metrics: + - pod_memory_hierarchical_pgfault + - pod_memory_hierarchical_pgmajfault + - pod_network_rx_bytes + - pod_network_rx_dropped + - pod_network_rx_errors + - pod_network_tx_errors + - pod_network_tx_packets + - new_container_memory_pgfault + - new_container_memory_pgmajfault + - new_container_memory_hierarchical_pgfault + - new_container_memory_hierarchical_pgmajfault` +``` + +* [Metrics Generation Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricsgenerationprocessor) can be used to create new metrics using existing metrics following a given rule. + +``` + experimental_metricsgeneration/1: + rules: + - name: pod_memory_utilization_over_pod_limit + unit: Percent + type: calculate + metric1: pod_memory_working_set + metric2: pod_memory_limit + operation: percent +``` + +The final component in the pipeline is [AWS CloudWatch EMF Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsemfexporter), which converts the metrics to embedded metric format (EMF) and then sends them directly to CloudWatch Logs using the [PutLogEvents](https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html) API. The following list of metrics is sent to CloudWatch by the ADOT Collector for each of the workloads running on Amazon EKS. + +* pod_cpu_utilization_over_pod_limit +* pod_cpu_usage_total +* pod_cpu_limit +* pod_memory_utilization_over_pod_limit +* pod_memory_working_set +* pod_memory_limit +* pod_network_rx_bytes +* pod_network_tx_bytes + +Each metric will be associated with the following dimension sets and collected under the CloudWatch namespace named *ContainerInsights*. + +* ClusterName, LaunchType +* ClusterName, Namespace, LaunchType +* ClusterName, Namespace, PodName, LaunchType + +Further, Please learn about [Container Insights Prometheus support for ADOT](https://aws.amazon.com/blogs/containers/introducing-cloudwatch-container-insights-prometheus-support-with-aws-distro-for-opentelemetry-on-amazon-ecs-and-amazon-eks/) and [deploying ADOT collector on Amazon EKS to visualize Amazon EKS resource metrics using CloudWatch Container Insights](https://aws.amazon.com/blogs/containers/introducing-amazon-cloudwatch-container-insights-for-amazon-eks-fargate-using-aws-distro-for-opentelemetry/) to setup ADOT collector pipeline in your Amazon EKS cluster and how to visualize your Amazon EKS resource metrics in CloudWatch Container Insights. Additionally, please reference [Easily Monitor Containerized Applications with Amazon CloudWatch Container Insights](https://community.aws/tutorials/navigating-amazon-eks/eks-monitor-containerized-applications#step-3-use-cloudwatch-logs-insights-query-to-search-and-analyze-container-logs), which includes step-by-step instructions on configuring an Amazon EKS cluster, deploying a containerized application, and monitoring the application's performance using Container Insights. + +### Fluent Bit Integration in CloudWatch Container Insights for Amazon EKS + +[Fluent Bit](https://fluentbit.io/) is an open source and multi-platform log processor and forwarder that allows you to collect data and logs from different sources, and unify and send them to different destinations including CloudWatch Logs. It’s also fully compatible with [Docker](https://www.docker.com/) and [Kubernetes](https://kubernetes.io/) environments. Using the newly launched Fluent Bit daemonset, you can send container logs from your EKS clusters to CloudWatch logs for logs storage and analytics. + +Due to its lightweight nature, using Fluent Bit as the default log forwarder in Container Insights on EKS worker nodes will allow you to stream application logs into CloudWatch logs efficiently and reliably. With Fluent Bit, Container Insights is able to deliver thousands of business critical logs at scale in a resource efficient manner, especially in terms of CPU and memory utilization at the pod level. In other words, compared to FluentD, which was the log forwarder used prior, Fluent Bit has a smaller resource footprint and, as a result, is more resource efficient for memory and CPU. On the other hand, [AWS for Fluent Bit image](https://github.com/aws/aws-for-fluent-bit), which includes Fluent Bit and related plugins, gives Fluent Bit an additional flexibility of adopting new AWS features faster as the image aims to provide a unified experience within AWS ecosystem. + +The architecture below shows individual components used by CloudWatch Container Insights for EKS: + +![CW-COMPONENTS](../../../../images/Containers/aws-native/eks/cw-components.jpg) + +*Figure: Individual components used by CloudWatch Container Insights for EKS.* + +While working with containers, it is recommended to push all the logs, including application logs, through the standard output (stdout) and standard error output (stderr) methods whenever possible using the Docker JSON logging driver. For this reason, in EKS, the logging driver is configured by default and everything that a containerized application writes to `stdout` or `stderr` is streamed into a JSON file under `“/var/log/containers"` on the worker node. Container Insights classifies those logs into three different categories by default and creates dedicated input streams for each category within Fluent Bit and independent log groups within CloudWatch Logs. Those categories are: + +* Application logs: All applications logs stored under `“/var/log/containers/*.log"` are streamed into the dedicated `/aws/containerinsights/Cluster_Name/application` log group. All non-application logs such as kube-proxy and aws-node logs are excluded by default. However, additional Kubernetes add-on logs, such as CoreDNS logs, are also processed and streamed into this log group. +* Host logs: system logs for each EKS worker node are streamed into the `/aws/containerinsights/Cluster_Name/host` log group. These system logs include the contents of `“/var/log/messages,/var/log/dmesg,/var/log/secure”` files. Considering the stateless and dynamic nature of containerized workloads, where EKS worker nodes are often being terminated during scaling activities, streaming those logs in real time with Fluent Bit and having those logs available in CloudWatch logs, even after the node is terminated, are critical in terms of observability and monitoring health of EKS worker nodes. It also enables you to debug or troubleshoot cluster issues without logging into worker nodes in many cases and analyze these logs in more systematic way. +* Data plane logs: EKS already provides [control plane logs](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html). With Fluent Bit integration in Container Insights, the logs generated by EKS data plane components, which run on every worker node and are responsible for maintaining running pods are captured as data plane logs. These logs are also streamed into a dedicated CloudWatch log group under `‘ /aws/containerinsights/Cluster_Name/dataplane`. kube-proxy, aws-node, and Docker runtime logs are saved into this log group. In addition to control plane logs, having data plane logs stored in CloudWatch Logs helps to provide a complete picture of your EKS clusters. + +Further, Please learn more on topics such as Fluent Bit Configurations, Fluent Bit Monitoring and Log analysis from [Fluent Bit Integration with Amazon EKS](https://aws.amazon.com/blogs/containers/fluent-bit-integration-in-cloudwatch-container-insights-for-eks/). + +### Cost savings with Container Insights on Amazon EKS + +With the default configuration, the Container Insights receiver collects the complete set of metrics as defined by the [receiver documentation](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/awscontainerinsightreceiver#available-metrics-and-resource-attributes). The number of metrics and dimensions collected is high, and for large clusters this will significantly increase the costs for metric ingestion and storage. We are going to demonstrate two different approaches that you can use to configure the ADOT Collector to send only metrics that bring value and saves cost. + +#### Using processors + +This approach involves the introduction of OpenTelemetry processors as discussed above to filter out metrics or attributes to reduce the size of [EMF logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html). We will demonstrate the basic usage of two processors namely *Filter* and *Resource.* + +[Filter processors](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/filterprocessor/README.md) can be included in the `ConfigMap` named `otel-agent-conf`: + +``` +processors: + # filter processors example + filter/include: + # any names NOT matching filters are excluded from remainder of pipeline + metrics: + include: + match_type: regexp + metric_names: + # re2 regexp patterns + - ^pod_.* + filter/exclude: + # any names matching filters are excluded from remainder of pipeline + metrics: + exclude: + match_type: regexp + metric_names: + - ^pod_network.* +``` + +[Resource processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/resourceprocessor/README.md) is also built into the AWS OpenTelemetry Distro and can be used to remove unwanted metric attributes. For example, if you want to remove the `Kubernetes` and `Sources` fields from the EMF logs, you can add the resource processor to the pipeline: + +``` + # resource processors example + resource: + attributes: + - key: Sources + action: delete + - key: kubernetes + action: delete +``` + +#### Customize Metrics and Dimensions + +In this approach, you will configure the CloudWatch EMF exporter to generate only the set of metrics that you want to send to CloudWatch Logs. The [metric_declaration](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/5ccdbe08c6a2a43b7c6c7f9c0031a4b0348394a9/exporter/awsemfexporter/README.md#metric_declaration) section of CloudWatch EMF exporter configuration can be used to define the set of metrics and dimensions that you want to export. For example, you can keep only pod metrics from the default configuration. This `metric_declaration` section will look like the following and to reduce the number of metrics, you can keep the dimension set only `[PodName, Namespace, ClusterName]` if you do not care about others: + +``` + awsemf: + namespace: ContainerInsights + log_group_name: '/aws/containerinsights/{ClusterName}/performance' + log_stream_name: '{NodeName}' + resource_to_telemetry_conversion: + enabled: true + dimension_rollup_option: NoDimensionRollup + parse_json_encoded_attr_values: [Sources, kubernetes] + # Customized metric declaration section + metric_declarations: + # pod metrics + - dimensions: [[PodName, Namespace, ClusterName]] + metric_name_selectors: + - pod_cpu_utilization + - pod_memory_utilization + - pod_cpu_utilization_over_pod_limit + - pod_memory_utilization_over_pod_limit +``` + +This configuration will produce and stream the following four metrics within single dimension `[PodName, Namespace, ClusterName]` rather than 55 different metrics for multiple dimensions in the default configuration: + +* pod_cpu_utilization +* pod_memory_utilization +* pod_cpu_utilization_over_pod_limit +* pod_memory_utilization_over_pod_limit + +With this configuration, you will only send the metrics that you are interested in rather than all the metrics configured by default. As a result, you will be able to decrease metric ingestion cost for Container Insights considerably. Having this flexibility will provide Container Insights costumers with high level of control over metrics being exported. Customizing metrics by modifying the `awsemf` exporter configuration is also highly flexible, and you can customize both the metrics that you want to send and their dimensions. Note that this is only applicable to logs that are sent to CloudWatch. + +The two approaches demonstrated discussed above are not mutually exclusive with each other. In fact, they both can be combined for a high degree of flexibility in customizing metrics that we want ingested into our monitoring system. We use this approach to decrease costs associated with metric storage and processing, as show in following graph. + +![CW-COST-EXPLORER](../../../../images/Containers/aws-native/eks/cw-cost-explorer.jpg) + +*Figure: AWS Cost Explorer* + +In the preceding AWS Cost Explorer graph, we can see daily cost associated with CloudWatch using different configurations on the ADOT Collector in a small EKS cluster (20 Worker nodes, 220 pods). *Aug 15th* shows CloudWatch bill using ADOT Collector with the default configuration. On *Aug 16th*, we have used the [Customize EMF exporter](https://aws.amazon.com/blogs/containers/cost-savings-by-customizing-metrics-sent-by-container-insights-in-amazon-eks/#customize-emf-exporter) approach and can see about 30% cost savings. On *Aug 17th*, we used the [Processors](https://aws.amazon.com/blogs/containers/cost-savings-by-customizing-metrics-sent-by-container-insights-in-amazon-eks/#processors) approach, which achieves about 45% costs saving. +You must consider the trade-offs of customizing metrics sent by Container Insights as you will be able to decrease monitoring costs by sacrificing visibility of the monitored cluster. But also, the built-in dashboard provided by Container Insights within the AWS Console can be impacted by customized metrics as you can select not sending metrics and dimensions used by the dashboard. For further learning please check on [Cost savings by customizing metrics sent by Container Insights in Amazon EKS](https://aws.amazon.com/blogs/containers/cost-savings-by-customizing-metrics-sent-by-container-insights-in-amazon-eks/). + +### Using EKS Blueprints to setup Container Insights + +[EKS Blueprints](https://aws.amazon.com/blogs/containers/bootstrapping-clusters-with-eks-blueprints/) is a collection of Infrastructure as Code (IaC) modules that will help you configure and deploy consistent, batteries-included EKS clusters across accounts and regions. You can use EKS Blueprints to easily bootstrap an EKS cluster with [Amazon EKS add-ons](https://docs.aws.amazon.com/eks/latest/userguide/eks-add-ons.html) as well as a wide range of popular open-source add-ons, including Prometheus, Karpenter, Nginx, Traefik, AWS Load Balancer Controller, Container Insights, Fluent Bit, Keda, Argo CD, and more. EKS Blueprints is implemented in two popular IaC frameworks, [HashiCorp Terraform](https://github.com/aws-ia/terraform-aws-eks-blueprints) and [AWS Cloud Development Kit (AWS CDK)](https://github.com/aws-quickstart/cdk-eks-blueprints), which help you automate infrastructure deployments. + +As part of your Amazon EKS Cluster creation process using EKS Blueprints, you can setup Container Container Insights as a Day 2 operational tooling to collect, aggregate, and summarize metrics and logs from containerized applications and micro-services to Amazon CloudWatch console. + +### Conclusion + +In this section of Observability best practices guide, we covered lot of deeper details around CloudWatch Container insights which included a introduction to Amazon CloudWatch Container Insights and how it can help you to observe your containerized workloads on Amazon EKS. We covered deeper grounds on using Amazon CloudWatch Container Insights with AWS Distro for Open Telemetry to enable collection fo Container insight metrics to visualize the metrics our your containerzied workloads on Amazon CloudWatch console. Next, we covered lot of depth around Fluent Bit Integration in CloudWatch Container Insights for Amazon EKS to create dedicated input streams within Fluent Bit and independent log groups within CloudWatch Logs for Application, Host and Data Plane logs. Next, we talked about two different approaches such as processors, metrics dimensions to achieve cost savings with CloudWatch Container insights. Finally we talked in brief about how use EKS Blueprints as a vehicle to setup Container Insights during the Amazon EKS cluster creation process. You can get hands-on experience with the [CloudWatch Container Insights module](https://catalog.workshops.aws/observability/en-US/aws-native/insights/containerinsights) with in the[One Observability Workshop](https://catalog.workshops.aws/observability/en-US). diff --git a/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/container-tracing-with-aws-xray.md b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/container-tracing-with-aws-xray.md new file mode 100644 index 000000000..718e9c5bf --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/container-tracing-with-aws-xray.md @@ -0,0 +1,158 @@ +# Container Tracing with AWS X-Ray + +In this section of Observability best practices guide, we will deep dive on to following topics related to Container Tracing with AWS X-Ray : + +* Introduction to AWS X-Ray +* Traces collection using Amazon EKS add-ons for AWS Distro for OpenTelemetry +* Conclusion + +### Introduction + +[AWS X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) is a service that collects data about requests that your application serves, and provides tools that you can use to view, filter, and gain insights into that data to identify issues and opportunities for optimization. For any traced request to your application, you can see detailed information not only about the request and response, but also about calls that your application makes to downstream AWS resources, microservices, databases, and web APIs. + +Instrumenting your application involves sending trace data for incoming and outbound requests and other events within your application, along with metadata about each request. Many instrumentation scenarios require only configuration changes. For example, you can instrument all incoming HTTP requests and downstream calls to AWS services that your Java application makes. There are several SDKs, agents, and tools that can be used to instrument your application for X-Ray tracing. See [Instrumenting your application](https://docs.aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html) for more information. + +We will learn about about containerized application tracing by collect traces from your Amazon EKS cluster using Amazon EKS add-ons for AWS Distro for OpenTelemetry. + +### Traces collection using Amazon EKS add-ons for AWS Distro for OpenTelemetry + +[AWS X-Ray](https://aws.amazon.com/xray/) provides application-tracing functionality, giving deep insights into all microservices deployed. With X-Ray, every request can be traced as it flows through the involved microservices. This provides your DevOps teams the insights they need to understand how your services interact with their peers and enables them to analyze and debug issues much faster. + +[AWS Distro for OpenTelemetry (ADOT)](https://aws-otel.github.io/docs/introduction) is a secure, AWS-supported distribution of the OpenTelemetry project. Users can instrument their applications just once and, using ADOT, send correlated metrics and traces to multiple monitoring solutions. Amazon EKS now allows users to enable ADOT as an add-on at any time after the cluster is up and running. The ADOT add-on includes the latest security patches and bug fixes and is validated by AWS to work with Amazon EKS. + +The ADOT add-on is an implementation of a Kubernetes Operator, which is a software extension to Kubernetes that makes use of custom resources to manage applications and their components. The add-on watches for a custom resource named OpenTelemetryCollector and manages the lifecycle of an ADOT Collector based on the configuration settings specified in the custom resource. + +The ADOT Collector has the concept of a pipeline that comprises three key types of components, namely, receiver, processor, and exporter. A [receiver](https://opentelemetry.io/docs/collector/configuration/#receivers) is how data gets into the collector. It accepts data in a specific format, translates it into the internal format, and passes it to [processors](https://opentelemetry.io/docs/collector/configuration/#processors) and [exporters](https://opentelemetry.io/docs/collector/configuration/#exporters) defined in the pipeline. It can be pull- or push-based. A processor is an optional component that is used to perform tasks such as batching, filtering, and transformations on data between being received and being exported. An exporter is used to determine which destination to send the metrics, logs, or traces to. The collector architecture allows multiple instances of such pipelines to be set up via a Kubernetes YAML manifest. + +The following diagram illustrates an ADOT Collector configured with a traces pipeline, which sends telemetry data to AWS X-Ray. The traces pipeline comprises an instance of [AWS X-Ray Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/awsxrayreceiver) and [AWS X-Ray Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsxrayexporter) and sends traces to AWS X-Ray. + +![Tracing-1](../../../../images/Containers/aws-native/eks/tracing-1.jpg) + +*Figure: Traces collection using Amazon EKS add-ons for AWS Distro for OpenTelemetry.* + +Let’s delve into the details of installing the ADOT add-on in an EKS cluster and then collect telemetry data from workloads. The following is a list of prerequisites needed before we can install the ADOT add-on. + +* An EKS cluster supporting Kubernetes version 1.19 or higher. You may create the EKS cluster using one of the [approaches outlined here](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html). +* [Certificate Manager](https://cert-manager.io/), if not already installed in the cluster. It can be installed with the default configuration as per [this documentation](https://cert-manager.io/docs/installation/). +* Kubernetes RBAC permissions specifically for EKS add-ons to install the ADOT add-on in your cluster. This can be done by applying the [settings in this YAML](https://amazon-eks.s3.amazonaws.com/docs/addons-otel-permissions.yaml) file to the cluster using a CLI tool such as kubectl. + +You can check the list of add-ons enabled for different versions of EKS using the following command: + +`aws eks describe-addon-versions` + +The JSON output should list the ADOT add-on among others, as shown below. Note that when an EKS cluster is created, EKS add-ons does not install any add-ons on it. + + +``` +{ + "addonName":"adot", + "type":"observability", + "addonVersions":[ + { + "addonVersion":"v0.45.0-eksbuild.1", + "architecture":[ + "amd64" + ], + "compatibilities":[ + { + "clusterVersion":"1.22", + "platformVersions":[ + "*" + ], + "defaultVersion":true + }, + { + "clusterVersion":"1.21", + "platformVersions":[ + "*" + ], + "defaultVersion":true + }, + { + "clusterVersion":"1.20", + "platformVersions":[ + "*" + ], + "defaultVersion":true + }, + { + "clusterVersion":"1.19", + "platformVersions":[ + "*" + ], + "defaultVersion":true + } + ] + } + ] +} +``` + +Next, you can install the ADOT add-on with the following command : + +`aws eks create-addon --addon-name adot --addon-version v0.45.0-eksbuild.1 --cluster-name $CLUSTER_NAME ` + +The version string must match the value of *addonVersion* field in the previously shown output. The output from a successful execution of this command looks as follows: + +``` +{ + "addon": { + "addonName": "adot", + "clusterName": "k8s-production-cluster", + "status": "ACTIVE", + "addonVersion": "v0.45.0-eksbuild.1", + "health": { + "issues": [] + }, + "addonArn": "arn:aws:eks:us-east-1:123456789000:addon/k8s-production-cluster/adot/f0bff97c-0647-ef6f-eecf-0b2a13f7491b", + "createdAt": "2022-04-04T10:36:56.966000+05:30", + "modifiedAt": "2022-04-04T10:38:09.142000+05:30", + "tags": {} + } +} +``` + +Wait until the add-on is in ACTIVE status before proceeding to the next step. The status of the add-on can be checked using the following command ; + +`aws eks describe-addon --addon-name adot --cluster-name $CLUSTER_NAME` + +#### Deploying the ADOT Collector + +The ADOT add-on is an implementation of a Kubernetes Operator, which is a software extension to Kubernetes that makes use of custom resources to manage applications and their components. The add-on watches for a custom resource named OpenTelemetryCollector and manages the lifecycle of an ADOT Collector based on the configuration settings specified in the custom resource. The following figure shows an illustration of how this works. + +![Tracing-1](../../../../images/Containers/aws-native/eks/tracing-2.jpg) + +*Figure: Deploying the ADOT Collector.* + +Next, let’s take a look at how to deploy an ADOT Collector. The [YAML configuration file here](https://github.com/aws-observability/aws-o11y-recipes/blob/main/sandbox/eks-addon-adot/otel-collector-xray-prometheus-complete.yaml) defines an OpenTelemetryCollector custom resource. When deployed to an EKS cluster, this will trigger the ADOT add-on to provision an ADOT Collector that includes a traces and metrics pipelines with components, as shown in the first illustration above. The collector is launched into the `aws-otel-eks` namespace as a Kubernetes Deployment with the name `${custom-resource-name}-collector`. A ClusterIP service with the same name is launched as well. Let’s look into the individual components that make up the pipelines of this collector. + +The AWS X-Ray Receiver in the traces pipeline accepts segments or spans in [X-Ray Segment format](https://docs.aws.amazon.com/xray/latest/devguide/xray-api-segmentdocuments.html), which enables it to process segments sent by microservices instrumented with X-Ray SDK. It is configured to listen for traffic on UDP port 2000 and is exposed as a Cluster IP service. Per this configuration, workloads that want to send trace data to this receiver should be configured with the environment variable `AWS_XRAY_DAEMON_ADDRESS` set to `observability-collector.aws-otel-eks:2000`. The exporter sends these segments directly to X-Ray using the [PutTraceSegments](https://docs.aws.amazon.com/xray/latest/api/API_PutTraceSegments.html) API. + +ADOT Collector is configured to be launched under the identity of a Kubernetes service account named `aws-otel-collector`, which is granted these permissions using a ClusterRoleBinding and ClusterRole, also shown in the [configuration](https://github.com/aws-observability/aws-o11y-recipes/blob/main/sandbox/eks-addon-adot/otel-collector-xray-prometheus-complete.yaml). The exporters need IAM permissions to send data to X-Ray. This is done by associating the service account with an IAM role using the [IAM roles for service accounts](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) feature supported by EKS. The IAM role should be associated with the AWS-managed policies such as AWSXRayDaemonWriteAccess. The [helper script here](https://github.com/aws-observability/aws-o11y-recipes/blob/main/sandbox/eks-addon-adot/adot-irsa.sh) may be used, after setting the CLUSTER_NAME and REGION variables, to create an IAM role named `EKS-ADOT-ServiceAccount-Role` that is granted these permissions and is associated with the `aws-otel-collector` service account. + +#### End-to-end test of traces collection + +Let’s now put all this together and test traces collection from workloads deployed to an EKS cluster. The following illustration shows the setup employed for this test. It comprises a front-end service that exposes a set of REST APIs and interacts with S3 as well as a datastore service that, in turn, interacts with an instance of Aurora PostgreSQL database. The services are instrumented with X-Ray SDK. ADOT Collector is launched in Deployment mode by deploying an OpenTelemetryCollector custom resource using the YAML manifest that was discussed in the last section. Postman client is used as an external traffic generator, targeting the front-end service. + +![Tracing-3](../../../../images/Containers/aws-native/eks/tracing-3.jpg) + +*Figure: End-to-end test of traces collection.* + +The following image shows the service graph generated by X-Ray using the segment data captured from the services, with the average response latency for each segment. + +![Tracing-4](../../../../images/Containers/aws-native/eks/tracing-4.jpg) + +Figure: CloudWatch Service Map console.* + +Please check on [Traces pipeline with OTLP Receiver and AWS X-Ray Exporter sending traces to AWS X-Ray](https://github.com/aws-observability/aws-otel-community/blob/master/sample-configs/operator/collector-config-xray.yaml) for OpenTelemetryCollector custom resource definitions that pertain to traces pipeline configurations. Customers who want to use ADOT Collector in conjunction with AWS X-Ray may start with these configuration templates, replace the placeholder variables with values based on their target environments and quickly deploy the collector to their Amazon EKS clusters using EKS add-on for ADOT. + + +### Using EKS Blueprints to setup container tracing with AWS X-Ray + +[EKS Blueprints](https://aws.amazon.com/blogs/containers/bootstrapping-clusters-with-eks-blueprints/) is a collection of Infrastructure as Code (IaC) modules that will help you configure and deploy consistent, batteries-included EKS clusters across accounts and regions. You can use EKS Blueprints to easily bootstrap an EKS cluster with [Amazon EKS add-ons](https://docs.aws.amazon.com/eks/latest/userguide/eks-add-ons.html) as well as a wide range of popular open-source add-ons, including Prometheus, Karpenter, Nginx, Traefik, AWS Load Balancer Controller, Container Insights, Fluent Bit, Keda, Argo CD, and more. EKS Blueprints is implemented in two popular IaC frameworks, [HashiCorp Terraform](https://github.com/aws-ia/terraform-aws-eks-blueprints) and [AWS Cloud Development Kit (AWS CDK)](https://github.com/aws-quickstart/cdk-eks-blueprints), which help you automate infrastructure deployments. + +As part of your Amazon EKS Cluster creation process using EKS Blueprints, you can setup AWS X-Ray as a Day 2 operational tooling to collect, aggregate, and summarize metrics and logs from containerized applications and micro-services to Amazon CloudWatch console. + +## Conclusion + +In this section of Observability best practices guide, we learned about using AWS X-Ray for container tracing your applications on Amazon EKS by traces collection using Amazon EKS add-ons for AWS Distro for OpenTelemetry. For further learning, please check on [Metrics and traces collection using Amazon EKS add-ons for AWS Distro for OpenTelemetry to Amazon Managed Service for Prometheus and Amazon CloudWatch.](https://aws.amazon.com/blogs/containers/metrics-and-traces-collection-using-amazon-eks-add-ons-for-aws-distro-for-opentelemetry/) Finally we talked in brief about how use EKS Blueprints as a vehicle to setup Container tracing using AWS X-Ray during the Amazon EKS cluster creation process. For further deep dive, we would highly recommend you to practice X-Ray Traces module under **AWS native** Observability category of AWS [One Observability Workshop](https://catalog.workshops.aws/observability/en-US). diff --git a/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/eks-api-server-monitoring.md b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/eks-api-server-monitoring.md new file mode 100644 index 000000000..9467f8dcf --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/eks-api-server-monitoring.md @@ -0,0 +1,204 @@ +# Amazon EKS API Server Monitoring + +In this section of Observability best practices guide, we will deep dive on to following topics related to API Server Monitoring: + +* Introduction to Amazon EKS API Server Monitoring +* Setting up an API Server Troubleshooter Dashboard +* Using API Troubleshooter Dashboard to Understand API Server Problems +* Understanding Unbounded list calls to API Server +* Stopping bad behavior to API Server +* API Priority and Fairness +* Identifying slowest API calls and API Server Latency Issues + +### Introduction + +Monitoring your Amazon EKS managed control plane is a very important Day 2 operational activity to proactively identity issues with health of your EKS cluster. Amazon EKS Control plane monitoring helps you to take proactive measures based on the collected metrics. These metrics would helps us to troubleshoot the API servers and pin point the problem under the hood. + +We will be using Amazon Managed Service for Prometheus (AMP) for our demonstration in this section for Amazon EKS API server monitoring and Amazon Managed Grafana (AMG) for visualization of metrics. Prometheus is a popular open source monitoring tool that provides powerful querying features and has wide support for a variety of workloads. Amazon Managed Service for Prometheus is a fully managed Prometheus-compatible service that makes it easier to monitor environments, such as Amazon EKS, [Amazon Elastic Container Service (Amazon ECS)](http://aws.amazon.com/ecs), and [Amazon Elastic Compute Cloud (Amazon EC2)](http://aws.amazon.com/ec2), securely and reliably. [Amazon Managed Grafana](https://aws.amazon.com/grafana/) is a fully managed and secure data visualization service for open source Grafana that enables customers to instantly query, correlate, and visualize operational metrics, logs, and traces for their applications from multiple data sources + +We will first setup a starter dashboard using Amazon Managed Service for Prometheus and Amazon Managed Grafana to help you with troubleshooting [Amazon Elastic Kubernetes Service (Amazon EKS)](https://aws.amazon.com/eks) API Servers with Prometheus. We will diving deep in up coming sections around understanding problems while troubleshooting the EKS API Servers, API priority and fairness, stopping bad behaviours. Finally we will deep dive in indentifying API calls that are slowest and API server latency issues which helps us to take actions to keep state of our Amazon EKS cluster healthy. + +### Setting up an API Server Troubleshooter Dashboard + +We will setup a starter dashboard to help you with troubleshooting [Amazon Elastic Kubernetes Service (Amazon EKS)](https://aws.amazon.com/eks) API Servers with AMP. We will use this to help you understand the metrics while troubleshooting your production EKS clusters. We will further focus deep on the collected metrics to understand its importance while troubleshooting your Amazon EKS clusters. + +First, setup an [ADOT collector to collect metrics from your Amazon EKS cluster to Amazon Manager Service for Prometheus](https://aws.amazon.com/blogs/containers/metrics-and-traces-collection-using-amazon-eks-add-ons-for-aws-distro-for-opentelemetry/). In this setup you will be using EKS ADOT Addon which allows users to enable ADOT as an add-on at any time after the EKS cluster is up and running. The ADOT add-on includes the latest security patches and bug fixes and is validated by AWS to work with Amazon EKS. This setup will show you how to install the ADOT add-on in an EKS cluster and then use it to collect metrics from your cluster. + +Next, [setup your Amazon Managed Grafana workspace to visualize metrics using AMP](https://aws.amazon.com/blogs/mt/amazon-managed-grafana-getting-started/) as a data source which you have setup in the first step. Finally download the [API troubleshooter dashboard](https://github.com/RiskyAdventure/Troubleshooting-Dashboards/blob/main/api-troubleshooter.json), navigate to Amazon Managed Grafana to upload the API troubleshooter dashboard json to visualize the metrics for further troubleshooting. + +### Using API Troubleshooter Dashboard to Understand Problems + +Let’s say you found an interesting open-source project that you wanted to install in your cluster. That operator deploys a DaemonSet to your cluster that might be using malformed requests, a needlessly high volume of LIST calls, or maybe each of its DaemonSets across all your 1,000 nodes are requesting status of all 50,000 pods on your cluster every minute! +Does this really happen often? Yes, it does! Let’s take a quick detour on how that happens. + +#### Understanding LIST vs. WATCH + +Some applications need to understand the state of the objects in your cluster. For example, your machine learning (ML) application wants to know the job status by understanding how many pods are not in the *Completed* status. In Kubernetes, there are well-behaved ways to do this with something called a WATCH, and some not-so-well-behaved ways that list every object on the cluster to find the latest status on those pods. + +#### A well-behaved WATCH + +Using a WATCH or a single, long-lived connection to receive updates via a push model is the most scalable way to do updates in Kubernetes. To oversimplify, we ask for the full state of the system, then only update the object in a cache when changes are received for that object, periodically running a re-sync to ensure that no updates were missed. + +In the below image we use the `apiserver_longrunning_gauge` to get an idea of the number of these long-lived connections across both API servers. + +![API-MON-1](../../../../images/Containers/aws-native/eks/api-mon-1.jpg) + +*Figure: `apiserver_longrunning_gauge` metric* + +Even with this efficient system, we can still have too much of a good thing. For example, if we use many very small nodes, each using two or more DaemonSets that need to talk to the API server, it is quite easy to dramatically increase the number of WATCH calls on the system unnecessarily. For example, let’s look at the difference between eight xlarge nodes vs. a single 8xlarge. Here we see an 8x increase of WATCH calls on the system. + +![API-MON-2](../../../../images/Containers/aws-native/eks/api-mon-2.jpg) + +*Figure: WATCH calls between 8 xlarge nodes.* + +Now these are efficient calls, but what if instead they were the ill-behaved calls we alluded to earlier? Imagine if one of the above DaemonSets on each of the 1,000 nodes is requesting updates on each of the total 50,000 pods in the cluster. We will explore this idea of an unbounded list call in next section. + +A quick word of caution before continuing, the type of consolidation in the above example must be done with great care, and has many other factors to consider. Everything from the delay of the number of threads competing for a limited number of CPUs on the system, Pod churn rate, to the maximum number of volume attachments a node can handle safely. However, our focus will be on the metrics that lead us to actionable steps that can prevent issues from happening—and maybe give us new insight into our designs. + +The WATCH metric is a simple one, but it can be used to track and reduce the number of watches, if that is a problem for you. Here are a few options you could consider to reduce this number: + +* Limit the number of ConfigMaps Helm creates to track History +* Use Immutable ConfigMaps and Secrets which do not use a WATCH +* Sensible node sizing and consolidation + +### Understanding Unbounded list calls to API Server + +Now for the LIST call we have been talking about. A list call is pulling the full history on our Kubernetes objects each time we need to understand an object’s state, nothing is being saved in a cache this time. + +How impactful is all this? That will vary depending on how many agents are requesting data, how often they are doing so, and how much data they are requesting. Are they asking for everything on the cluster, or just a single namespace? Does that happen every minute, on very node? Let’s use an example of a logging agent that is appending Kubernetes metadata on every log sent from a node. This could be an overwhelming amount of data in larger clusters. There are many ways for the agent to get that data via a list call, so let’s look at a few. + +The below request is asking for pods from a specific namespace. + +`/api/v1/namespaces/my-namespace/pods` + +Next, we request all 50,000 pods on the cluster, but in chunks of 500 pods at a time. + +`/api/v1/pods?limit=500` + +The next call is the most disruptive. Fetching all 50,000 pods on the entire cluster at the same time. + +`/api/v1/pods` + +This happens quite commonly in the field and can be seen in the logs. + +### Stopping bad behavior to API Server + +How can we protect our cluster from such bad behavior? Before Kubernetes 1.20, the API server would protect itself by limiting the number of *inflight* requests processed per second. Since etcd can only handle so many requests at one time in a performant way, we need to ensure the number of requests is limited to a value per second that keeps etcd reads and writes in a reasonable latency band. Unfortunately, at the time of this writing, there is no dynamic way to do this. + +In the below chart we see a breakdown of read requests, which has a default maximum of 400 inflight request per API server and a default max of 200 concurrent write requests. In a default EKS cluster you will see two API servers for a total of 800 reads and 400 writes. However, caution is advised as these servers can have asymmetric loads on them at different times like right after an upgrade, etc. + +![API-MON-3](../../../../images/Containers/aws-native/eks/api-mon-3.jpg) + +*Figure: Grafana chart with breakdown of read requests.* + +It turns out that the above was not a perfect scheme. For example, how could we keep this badly behaving new operator we just installed from taking up all the inflight write requests on the API server and potentially delaying important requests such as node keepalive messages? + +### API Priority and Fairness + +Instead of worrying about how many read/write requests were open per second, what if we treated the capacity as one total number, and each application on the cluster got a fair percentage or share of that total maximum number? + +To do that that effectively, we would need to identify who sent the request to the API server, then give that request a name tag of sorts. With this new name tag, we could then see all these requests are coming from a new agent we will call “Chatty.” Now we can group all of Chatty’s requests into something called a *flow*, that identifies those requests are coming from the same DaemonSet. This concept now gives us the ability to restrict this bad agent and ensure it does not consume the whole cluster. + +However, not all requests are created equal. The control plane traffic that is needed to keep the cluster operational should be a higher priority than our new operator. This is where the idea of priority levels comes into play. What if, by default, we had a several “buckets” or queues for critical, high, and low priority traffic? We do not want the chatty agent flow getting a fair share of traffic in the critical traffic queue. We can however put that traffic in a low priority queue so that flow is competing with perhaps other chatty agents. We then would want to ensure that each priority level had the right number of shares or percentage of the overall maximum the API server can handle to ensure the requests were not too delayed. + +#### Priority and fairness in action + +Since this is a relatively new feature, many existing dashboards will use the older model of maximum inflight reads and maximum inflight writes. Why this can be problematic? + +What if we were giving high priority name tags to everything in the kube-system namespace, but we then installed that bad agent into that important namespace, or even simply deployed too many applications in that namespace? We could end up having the same problem we were trying to avoid! So best to keep a close eye on such situations. + +I have broken out for you some of the metrics I find most interesting to track these kinds of issues. + + +* What percentage of a priority group’s shares are used? +* What is the longest time a request waited in a queue? +* Which flow is using the most shares? +* Are there unexpected delays on the system? + +#### Percent in use + +Here we see the different default priority groups on the cluster and what percentage of the max is used. + +![API-MON-4](../../../../images/Containers/aws-native/eks/api-mon-4.jpg) + +*Figure: Priority groups on the cluster.* + +#### Time request was in queue + +How long in seconds the request sat in the priority queue before being processed. + +![API-MON-5](../../../../images/Containers/aws-native/eks/api-mon-5.jpg) + +*Figure: Time the request was in priority queue.* + +#### Top executed requests by flow + +Which flow is taking up the most shares? + +![API-MON-6](../../../../images/Containers/aws-native/eks/api-mon-6.jpg) + +*Figure: Top executing requests by flow.* + +#### Request Execution Time + +Are there any unexpected delays in processing? + +![API-MON-7](../../../../images/Containers/aws-native/eks/api-mon-7.jpg) + +*Figure: Flow control request execution time.* + +### Identifying slowest API calls and API Server Latency Issues + +Now that we understand the nature of the things that cause API latency, we can take a step back and look at the big picture. It’s important to remember that our dashboard designs are simply trying to get a quick snapshot if there is a problem we should be investigating. For detailed analysis, we would use ad-hoc queries with PromQL—or better yet, logging queries. + +What are some ideas for the high-level metrics we would want to look at? + +* What API call is taking the most time to complete? + * What is the call doing? (Listing objects, deleting them, etc.) + * What objects is it trying to do that operation on? (Pods, Secrets, ConfigMaps, etc.) +* Is there a latency problem on the API server itself? + * Is there a delay in one of my priority queues causing a backup in requests? +* Does it just look like API server is slow because the etcd server is experiencing latency? + +#### Slowest API call + +In the below chart we are looking for the API calls that took the most time to complete for that period. In this case we see a custom resource definition (CRD) is calling a LIST function that is the most latent call during the 05:40 time frame. Armed with this data we can use CloudWatch Insights to pull LIST requests from the audit log in that timeframe to see which application this might be. + +![API-MON-8](../../../../images/Containers/aws-native/eks/api-mon-8.jpg) + +*Figure: Top 5 slowest API calls.* + +#### API Request Duration + +This API latency chart helps us to understand if any requests are approaching the timeout value of one minute. I like the histogram over time format below as I can see outliers in the data that a line graph would hide. + +![API-MON-9](../../../../images/Containers/aws-native/eks/api-mon-9.jpg) + +*Figure: API Request duration heatmap.* + +Simply hovering over a bucket shows us the exact number of calls that took around 25 milliseconds. +[Image: Image.jpg]*Figure: Calls over 25 milliseconds.* + +This concept is important when we are working with other systems that cache requests. Cache requests will be fast; we do not want to merge those request latencies with slower requests. Here we can see two distinct bands of latency, requests that have been cached, and those that have not. + +![API-MON-10](../../../../images/Containers/aws-native/eks/api-mon-10.jpg) + +*Figure: Latency, requests cached.* + +#### ETCD Request Duration + +ETCD latency is one of the most important factors in Kubernetes performance. Amazon EKS allows you see this performance from the API server’s perspective by looking at the `request_duration_seconds_bucket` metric. + +![API-MON-11](../../../../images/Containers/aws-native/eks/api-mon-11.jpg) + +*Figure : `request_duration_seconds_bucket` metric.* + +We can now start to put the things we learned together by seeing if certain events are correlated. In the below chart we see API server latency, but we also see much of this latency is coming from the etcd server. Being able to quickly move to the right problem area with just a glance is what makes a dashboard powerful. + +![API-MON-12](../../../../images/Containers/aws-native/eks/api-mon-12.jpg) + +*Figure: Etcd Requests* + +## Conclusion + +In this section of Observability best practices guide, We used a [starter dashboard](https://github.com/RiskyAdventure/Troubleshooting-Dashboards/blob/main/api-troubleshooter.json) using Amazon Managed Service for Prometheus and Amazon Managed Grafana to help you with troubleshooting [Amazon Elastic Kubernetes Service (Amazon EKS)](https://aws.amazon.com/eks) API Servers. Further, we deep dived around understanding problems while troubleshooting the EKS API Servers, API priority and fairness, stopping bad behaviours. Finally deep dived in indentifying API calls that are slowest and API server latency issues which helps us to take actions to keep state of our Amazon EKS cluster healthy. For further deep dive, we would highly recommend you to practice Application Monitoring module under AWS native Observability category of AWS [One Observability Workshop](https://catalog.workshops.aws/observability/en-US). diff --git a/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/log-aggregation.md b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/log-aggregation.md new file mode 100644 index 000000000..1795ce97e --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/aws-native/eks/log-aggregation.md @@ -0,0 +1,367 @@ +# Log Aggregation + +In this section of Observability best practices guide, we will deep dive on to following topics related to Amazon EKS Logging with AWS Native services: + +* Introduction to AWS EKS logging +* Amazon EKS control plane logging +* Amazon EKS data plane logging +* Amazon EKS application logging +* Unified log aggregation from Amazon EKS and other compute platforms using AWS Native services +* Conclusion + +### Introduction + +Amazon EKS logging can be divided into three types such as control plane logging, node logging, and application logging. The [Kubernetes control plane](https://kubernetes.io/docs/concepts/overview/components/#control-plane-components) is a set of components that manage Kubernetes clusters and produce logs used for auditing and diagnostic purposes. With Amazon EKS, you can [turn on logs for different control plane components](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html) and send them to CloudWatch. + +Kubernetes also runs system components such as `kubelet` and `kube-proxy` on each Kubernetes node that runs your pods. These components write logs within each node and you can configure CloudWatch and Container Insights to capture these logs for each Amazon EKS node. + +Containers are grouped as [pods](https://kubernetes.io/docs/concepts/workloads/pods/) within a Kubernetes cluster and are scheduled to run on your Kubernetes nodes. Most containerized applications write to standard output and standard error, and the container engine redirects the output to a logging driver. In Kubernetes, the container logs are found in the `/var/log/pods` directory on a node. You can configure CloudWatch and Container Insights to capture these logs for each of your Amazon EKS pods. + +There are three common approaches for capturing logs Shipping container logs to a centralized log aggregation system in Kubernetes: + +* Node level agent, like a [Fluentd daemonset](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs.html). This is the recommended pattern. +* Sidecar container, like a Fluentd sidecar container. +* Directly writing to log collection system. In this approach, the application is responsible for shipping the logs. This is the least recommended option because you will have to include the log aggregation system’s SDK in your application code instead of reusing community build solutions like Fluentd. This pattern also disobeys the *principle of separation of concerns*, according to which, logging implementation should be independent of the application. Doing so allows you to change the logging infrastructure without impacting or changing your application. + +We will now dive in to each of these logging categories for Amazon EKS logging along with talking about unified log aggregation from Amazon EKS and other compute platforms. + +### Amazon EKS control plane logging + +An Amazon EKS cluster consists of a high availability, single-tenant control plane for your Kubernetes cluster and the Amazon EKS nodes that run your containers. The control plane nodes run in an account managed by AWS. The Amazon EKS cluster control plane nodes are integrated with CloudWatch and you can turn on logging for specific control plane components. Logs are provided for each Kubernetes control plane component instance. AWS manages the health of your control plane nodes and provides a [service-level agreement (SLA) for the Kubernetes endpoint](http://aws.amazon.com/eks/sla/). + +[Amazon EKS control plane logging](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html) consists of following cluster control plane log types. Each log type corresponds to a component of the Kubernetes control plane. To learn more about these components, see [Kubernetes Components](https://kubernetes.io/docs/concepts/overview/components/) in the Kubernetes documentation. + +* **API server (`api`)** – Your cluster's API server is the control plane component that exposes the Kubernetes API. If you enable API server logs when you launch the cluster, or shortly thereafter, the logs include API server flags that were used to start the API server. For more information, see [`kube-apiserver`](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/) and the [audit policy](https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L1129-L1255) in the Kubernetes documentation. +* **Audit (`audit`)** – Kubernetes audit logs provide a record of the individual users, administrators, or system components that have affected your cluster. For more information, see [Auditing](https://kubernetes.io/docs/tasks/debug-application-cluster/audit/) in the Kubernetes documentation. +* **Authenticator (`authenticator`)** – Authenticator logs are unique to Amazon EKS. These logs represent the control plane component that Amazon EKS uses for Kubernetes [Role Based Access Control](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) (RBAC) authentication using IAM credentials. For more information, see [Cluster management](https://docs.aws.amazon.com/eks/latest/userguide/eks-managing.html). +* **Controller manager (`controllerManager`)** – The controller manager manages the core control loops that are shipped with Kubernetes. For more information, see [kube-controller-manager](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) in the Kubernetes documentation. +* **Scheduler (`scheduler`)** – The scheduler component manages when and where to run pods in your cluster. For more information, see [kube-scheduler](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/) in the Kubernetes documentation. + +Please follow [enabling and disabling control plane logs](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html#:~:text=the%20Kubernetes%20documentation.-,Enabling%20and%20disabling%20control%20plane%20logs,-By%20default%2C%20cluster) section and enable control plane logs via AWS console or via AWS CLI. + +#### Querying control plane logs from CloudWatch console + +After you enable control plane logging on your Amazon EKS cluster, you can find EKS control plane logs in the `/aws/eks/cluster-name/cluster` log group. For more information, see [Viewing cluster control plane logs](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html#viewing-control-plane-logs). Please make sure to replace `cluster-name` with your cluster's name. + +You can use CloudWatch Logs Insights to search through the EKS control plane log data. For more information, see [Analyzing log data with CloudWatch Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html). It is important to node that, you can view log events in CloudWatch Logs only after you turn on control plane logging in a cluster. Before you select a time range to run queries in CloudWatch Logs Insights, verify that you turned on control plane logging. Please check the screenshot below showing an example of a EKS control plane log query with query output. + +![LOG-AGGREG-1](../../../../images/Containers/aws-native/eks/log-aggreg-1.jpg) + +*Figure: CloudWatch Logs Insights.* + +#### Sample queries for common EKS use cases on CloudWatch Logs Insights + +To find the cluster creator, search for the IAM entity that's mapped to the **kubernetes-admin** user. + +``` +fields @logStream, @timestamp, @message| sort @timestamp desc +| filter @logStream like /authenticator/ +| filter @message like "username=kubernetes-admin" +| limit 50 +``` + +Example output: + +``` + +@logStream, @timestamp @messageauthenticator-71976 ca11bea5d3083393f7d32dab75b,2021-08-11-10:09:49.020,"time=""2021-08-11T10:09:43Z"" level=info msg=""access granted"" arn=""arn:aws:iam::12345678910:user/awscli"" client=""127.0.0.1:51326"" groups=""[system:masters]"" method=POST path=/authenticate sts=sts.eu-west-1.amazonaws.com uid=""heptio-authenticator-aws:12345678910:ABCDEFGHIJKLMNOP"" username=kubernetes-admin" +``` + +In this output, IAM user **arn:aws:iam::[12345678910](tel:12345678910):user/awscli** is mapped to user **kubernetes-admin**. + +To find requests that a specific user performed, search for operations that the **kubernetes-admin** user performed. + +``` + +fields @logStream, @timestamp, @message| filter @logStream like /^kube-apiserver-audit/ +| filter strcontains(user.username,"kubernetes-admin") +| sort @timestamp desc +| limit 50 +``` + +Example output: + +``` + +@logStream,@timestamp,@messagekube-apiserver-audit-71976ca11bea5d3083393f7d32dab75b,2021-08-11 09:29:13.095,"{...""requestURI"":""/api/v1/namespaces/kube-system/endpoints?limit=500";","string""verb"":""list"",""user"":{""username"":""kubernetes-admin"",""uid"":""heptio-authenticator-aws:12345678910:ABCDEFGHIJKLMNOP"",""groups"":[""system:masters"",""system:authenticated""],""extra"":{""accessKeyId"":[""ABCDEFGHIJKLMNOP""],""arn"":[""arn:aws:iam::12345678910:user/awscli""],""canonicalArn"":[""arn:aws:iam::12345678910:user/awscli""],""sessionName"":[""""]}},""sourceIPs"":[""12.34.56.78""],""userAgent"":""kubectl/v1.22.0 (darwin/amd64) kubernetes/c2b5237"",""objectRef"":{""resource"":""endpoints"",""namespace"":""kube-system"",""apiVersion"":""v1""}...}" +``` + +To find API calls that a specific userAgent made, you can use this example query: + +``` + +fields @logStream, @timestamp, userAgent, verb, requestURI, @message| filter @logStream like /kube-apiserver-audit/ +| filter userAgent like /kubectl\/v1.22.0/ +| sort @timestamp desc +| filter verb like /(get)/ +``` + +Shortened example output: + +``` + +@logStream,@timestamp,userAgent,verb,requestURI,@messagekube-apiserver-audit-71976ca11bea5d3083393f7d32dab75b,2021-08-11 14:06:47.068,kubectl/v1.22.0 (darwin/amd64) kubernetes/c2b5237,get,/apis/metrics.k8s.io/v1beta1?timeout=32s,"{""kind"":""Event"",""apiVersion"":""audit.k8s.io/v1"",""level"":""Metadata"",""auditID"":""863d9353-61a2-4255-a243-afaeb9183524"",""stage"":""ResponseComplete"",""requestURI"":""/apis/metrics.k8s.io/v1beta1?timeout=32s"",""verb"":""get"",""user"":{""username"":""kubernetes-admin"",""uid"":""heptio-authenticator-aws:12345678910:AIDAUQGC5HFOHXON7M22F"",""groups"":[""system:masters"",""system:authenticated""],""extra"":{""accessKeyId"":[""ABCDEFGHIJKLMNOP""],""arn"":[""arn:aws:iam::12345678910:user/awscli""],""canonicalArn"":[""arn:aws:iam::12345678910:user/awscli""],""sourceIPs"":[""12.34.56.78""],""userAgent"":""kubectl/v1.22.0 (darwin/amd64) kubernetes/c2b5237""...}" +``` + +To find mutating changes made to the **aws-auth** ConfigMap, you can use this example query: + +``` + +fields @logStream, @timestamp, @message| filter @logStream like /^kube-apiserver-audit/ +| filter requestURI like /\/api\/v1\/namespaces\/kube-system\/configmaps/ +| filter objectRef.name = "aws-auth" +| filter verb like /(create|delete|patch)/ +| sort @timestamp desc +| limit 50 +``` + +Shortened example output: + +``` + +@logStream,@timestamp,@messagekube-apiserver-audit-f01c77ed8078a670a2eb63af6f127163,2021-10-27 05:43:01.850,{""kind"":""Event"",""apiVersion"":""audit.k8s.io/v1"",""level"":""RequestResponse"",""auditID"":""8f9a5a16-f115-4bb8-912f-ee2b1d737ff1"",""stage"":""ResponseComplete"",""requestURI"":""/api/v1/namespaces/kube-system/configmaps/aws-auth?timeout=19s"",""verb"":""patch"",""responseStatus"": {""metadata"": {},""code"": 200 },""requestObject"": {""data"": { contents of aws-auth ConfigMap } },""requestReceivedTimestamp"":""2021-10-27T05:43:01.033516Z"",""stageTimestamp"":""2021-10-27T05:43:01.042364Z"" } +``` + +To find requests that were denied, you can use this example query: + +``` + +fields @logStream, @timestamp, @message| filter @logStream like /^authenticator/ +| filter @message like "denied" +| sort @timestamp desc +| limit 50 +``` + +Example output: + +``` + +@logStream,@timestamp,@messageauthenticator-8c0c570ea5676c62c44d98da6189a02b,2021-08-08 20:04:46.282,"time=""2021-08-08T20:04:44Z"" level=warning msg=""access denied"" client=""127.0.0.1:52856"" error=""sts getCallerIdentity failed: error from AWS (expected 200, got 403)"" method=POST path=/authenticate" +``` + +To find the node that a pod was scheduled on, query the **kube-scheduler** logs. + +``` + +fields @logStream, @timestamp, @message| sort @timestamp desc +| filter @logStream like /kube-scheduler/ +| filter @message like "aws-6799fc88d8-jqc2r" +| limit 50 +``` + +Example output: + +``` + +@logStream,@timestamp,@messagekube-scheduler-bb3ea89d63fd2b9735ba06b144377db6,2021-08-15 12:19:43.000,"I0915 12:19:43.933124 1 scheduler.go:604] ""Successfully bound pod to node"" pod=""kube-system/aws-6799fc88d8-jqc2r"" node=""ip-192-168-66-187.eu-west-1.compute.internal"" evaluatedNodes=3 feasibleNodes=2" +``` + +In this example output, pod **aws-6799fc88d8-jqc2r** was scheduled on node **ip-192-168-66-187.eu-west-1.compute.internal**. + +To find HTTP 5xx server errors for Kubernetes API server requests, you can use this example query: + +``` + +fields @logStream, @timestamp, responseStatus.code, @message| filter @logStream like /^kube-apiserver-audit/ +| filter responseStatus.code >= 500 +| limit 50 +``` + +Shortened example output: + +``` + +@logStream,@timestamp,responseStatus.code,@messagekube-apiserver-audit-4d5145b53c40d10c276ad08fa36d1f11,2021-08-04 07:22:06.518,503,"...""requestURI"":""/apis/metrics.k8s.io/v1beta1?timeout=32s"",""verb"":""get"",""user"":{""username"":""system:serviceaccount:kube-system:resourcequota-controller"",""uid"":""36d9c3dd-f1fd-4cae-9266-900d64d6a754"",""groups"":[""system:serviceaccounts"",""system:serviceaccounts:kube-system"",""system:authenticated""]},""sourceIPs"":[""12.34.56.78""],""userAgent"":""kube-controller-manager/v1.21.2 (linux/amd64) kubernetes/d2965f0/system:serviceaccount:kube-system:resourcequota-controller"",""responseStatus"":{""metadata"":{},""code"":503},..."}}" +``` + +To troubleshoot a CronJob activation, search for API calls that the **cronjob-controller** made. + +``` + +fields @logStream, @timestamp, @message| filter @logStream like /kube-apiserver-audit/ +| filter user.username like "system:serviceaccount:kube-system:cronjob-controller" +| display @logStream, @timestamp, @message, objectRef.namespace, objectRef.name +| sort @timestamp desc +| limit 50 +``` + +Shortened example output: + +``` + +{ "kind": "Event", "apiVersion": "audit.k8s.io/v1", "objectRef": { "resource": "cronjobs", "namespace": "default", "name": "hello", "apiGroup": "batch", "apiVersion": "v1" }, "responseObject": { "kind": "CronJob", "apiVersion": "batch/v1", "spec": { "schedule": "*/1 * * * *" }, "status": { "lastScheduleTime": "2021-08-09T07:19:00Z" } } } +``` + +In this example output, the **hello** job in the **default** namespace runs every minute and was last scheduled at **2021-08-09T07:19:00Z**. + +To find API calls that the **replicaset-controller** made, you can use this example query: + +``` + +fields @logStream, @timestamp, @message| filter @logStream like /kube-apiserver-audit/ +| filter user.username like "system:serviceaccount:kube-system:replicaset-controller" +| display @logStream, @timestamp, requestURI, verb, user.username +| sort @timestamp desc +| limit 50 +``` + +Example output: + +``` + +@logStream,@timestamp,requestURI,verb,user.usernamekube-apiserver-audit-8c0c570ea5676c62c44d98da6189a02b,2021-08-10 17:13:53.281,/api/v1/namespaces/kube-system/pods,create,system:serviceaccount:kube-system:replicaset-controller +kube-apiserver-audit-4d5145b53c40d10c276ad08fa36d1f11,2021-08-04 0718:44.561,/apis/apps/v1/namespaces/kube-system/replicasets/coredns-6496b6c8b9/status,update,system:serviceaccount:kube-system:replicaset-controller +``` + +To find operations that are made against a Kubernetes resource, you can use this example query: + +``` + +fields @logStream, @timestamp, @message| filter @logStream like /^kube-apiserver-audit/ +| filter verb == "delete" and requestURI like "/api/v1/namespaces/default/pods/my-app" +| sort @timestamp desc +| limit 10 +``` + +The preceding example query filters for **delete** API calls on the **default** namespace for pod **my-app**. +Shortened example output: + +``` + +@logStream,@timestamp,@messagekube-apiserver-audit-e7b3cb08c0296daf439493a6fc9aff8c,2021-08-11 14:09:47.813,"...""requestURI"":""/api/v1/namespaces/default/pods/my-app"",""verb"":""delete"",""user"":{""username""""kubernetes-admin"",""uid"":""heptio-authenticator-aws:12345678910:ABCDEFGHIJKLMNOP"",""groups"":[""system:masters"",""system:authenticated""],""extra"":{""accessKeyId"":[""ABCDEFGHIJKLMNOP""],""arn"":[""arn:aws:iam::12345678910:user/awscli""],""canonicalArn"":[""arn:aws:iam::12345678910:user/awscli""],""sessionName"":[""""]}},""sourceIPs"":[""12.34.56.78""],""userAgent"":""kubectl/v1.22.0 (darwin/amd64) kubernetes/c2b5237"",""objectRef"":{""resource"":""pods"",""namespace"":""default"",""name"":""my-app"",""apiVersion"":""v1""},""responseStatus"":{""metadata"":{},""code"":200},""requestObject"":{""kind"":""DeleteOptions"",""apiVersion"":""v1"",""propagationPolicy"":""Background""}, +..." +``` + +To retrieve a count of HTTP response codes for calls made to the Kubernetes API server, you can use this example query: + +``` + +fields @logStream, @timestamp, @message| filter @logStream like /^kube-apiserver-audit/ +| stats count(*) as count by responseStatus.code +| sort count desc +``` + +Example output: + +``` + +responseStatus.code,count200,35066 +201,525 +403,125 +404,116 +101,2 +``` + +To find changes that are made to DaemonSets/Addons in the **kube-system** namespace, you can use this example query: + +``` + +filter @logStream like /^kube-apiserver-audit/| fields @logStream, @timestamp, @message +| filter verb like /(create|update|delete)/ and strcontains(requestURI,"/apis/apps/v1/namespaces/kube-system/daemonsets") +| sort @timestamp desc +| limit 50 +``` + +Example output: + +``` + +{ "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "RequestResponse", "auditID": "93e24148-0aa6-4166-8086-a689b0031612", "stage": "ResponseComplete", "requestURI": "/apis/apps/v1/namespaces/kube-system/daemonsets/aws-node?fieldManager=kubectl-set", "verb": "patch", "user": { "username": "kubernetes-admin", "groups": [ "system:masters", "system:authenticated" ] }, "userAgent": "kubectl/v1.22.2 (darwin/amd64) kubernetes/8b5a191", "objectRef": { "resource": "daemonsets", "namespace": "kube-system", "name": "aws-node", "apiGroup": "apps", "apiVersion": "v1" }, "requestObject": { "REDACTED": "REDACTED" }, "requestReceivedTimestamp": "2021-08-09T08:07:21.868376Z", "stageTimestamp": "2021-08-09T08:07:21.883489Z", "annotations": { "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "" } } +``` + +In this example output, the **kubernetes-admin** user used **kubectl** v1.22.2 to patch the **aws-node** DaemonSet. + +To find the user that deleted a node, you can use this example query: + +``` + +fields @logStream, @timestamp, @message| filter @logStream like /^kube-apiserver-audit/ +| filter verb == "delete" and requestURI like "/api/v1/nodes" +| sort @timestamp desc +| limit 10 +``` + +Shortened example output: + +``` + +@logStream,@timestamp,@messagekube-apiserver-audit-e503271cd443efdbd2050ae8ca0794eb,2022-03-25 07:26:55.661,"{"kind":"Event"," +``` + +Finally, if you have started using control plane logging feature, we would highly recommend you to learn more about [Understanding and Cost Optimizing Amazon EKS Control Plane Logs](https://aws.amazon.com/blogs/containers/understanding-and-cost-optimizing-amazon-eks-control-plane-logs/). + +### Amazon EKS data plane logging + +We recommend that you use [CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs.html) to capture logs and metrics for Amazon EKS. [Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) implements cluster, node, and pod-level metrics with the CloudWatch agent, and [Fluent Bit](https://fluentbit.io/) or [Fluentd](https://www.fluentd.org/) for log capture to CloudWatch. Container Insights also provides automatic dashboards with layered views of your captured CloudWatch metrics. Container Insights is deployed as CloudWatch DaemonSet and Fluent Bit DaemonSet that runs on every Amazon EKS node. Fargate nodes are not supported by Container Insights because the nodes are managed by AWS and don’t support DaemonSets. Fargate logging for Amazon EKS is covered separately in this guide. + +The following table shows the CloudWatch log groups and logs captured by the [default Fluentd or Fluent Bit log capture configuration](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs-FluentBit.html) for Amazon EKS. + +|`/aws/containerinsights/Cluster_Name/host` |Logs from `/var/log/dmesg`, `/var/log/secure`, and `/var/log/messages`. | +|--- |--- | +|`/aws/containerinsights/Cluster_Name/dataplane` |The logs in `/var/log/journal` for `kubelet.service`, `kubeproxy.service`, and `docker.service`. | + +If you don’t want to use Container Insights with Fluent Bit or Fluentd for logging, you can capture node and container logs with the CloudWatch agent installed on Amazon EKS nodes. Amazon EKS nodes are EC2 instances, which means you should include them in your standard system-level logging approach for Amazon EC2. If you install the CloudWatch agent using Distributor and State Manager, then Amazon EKS nodes are also included in the CloudWatch agent installation, configuration, and update. The following table shows logs that are specific to Kubernetes and that you must capture if you aren’t using Container Insights with Fluent Bit or Fluentd for logging. + + +|`var/log/aws-routed-eni/ipamd.log``/var/log/aws-routed-eni/plugin.log` |The logs for the L-IPAM daemon can be found here | +|--- |--- | + +Please reference [Amazon EKS node logging prescriptive guidance](https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/kubernetes-eks-logging.html) to learn more about data plane logging. + +### Amazon EKS application logging + +Amazon EKS application logging becomes inevitable while running applications at scale in Kubernetes environment. To collect application logs you must install a log aggregator, such as [Fluent Bit](https://fluentbit.io/), [Fluentd](https://www.fluentd.org/), or [CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html), in your Amazon EKS cluster. + +[Fluent Bit](https://fluentbit.io/) is an open-source log processor and forwarder that is written in C++, which means that you can collect data from different sources, enrich them with filters, and send them to multiple destinations. By using this guide's solution you can enable `aws-for-fluent-bit` or `fargate-fluentbit` for logging. [Fluentd](https://www.fluentd.org/) is an open-source data collector for unified logging layer and written in Ruby. Fluentd acts as a unified logging layer that can aggregate data from multiple sources, unify data with different formats into JSON-formatted objects, and route them to different output destinations. Choosing a log collector is important for CPU and memory utilization when you monitor thousands of servers. If you have multiple Amazon EKS clusters, you can use Fluent Bit as a lightweight shipper to collect data from different nodes in the cluster and forward it to Fluentd for aggregation, processing and routing to a supported output destination. + +We recommend to use Fluent Bit as the log collector and forwarder to send application and cluster logs to CloudWatch. You can then stream the logs to Amazon OpenSearch Service by using a subscription filter in CloudWatch. This option is shown in this section's architecture diagram. + +![LOG-AGGREG-2](../../../../images/Containers/aws-native/eks/log-aggreg-2.jpg) + +*Figure: Amazon EKS application logging architecture.* + +The diagram shows the following workflow when application logs from Amazon EKS clusters are streamed to Amazon OpenSearch Service. The Fluent Bit service in the Amazon EKS cluster pushes the logs to CloudWatch. The AWS Lambda function streams the logs to Amazon OpenSearch Service using a subscription filter. You can then use Kibana to visualize the logs in the configured indexes. You can also stream logs by using Amazon Kinesis Data Firehose and store them in an S3 bucket for analysis and querying with [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html). + +In most clusters, using Fluentd or Fluent Bit for log aggregation needs little optimization. This changes when you’re dealing with larger clusters with thousands of pods and nodes. We have published our findings from studying the [impact of Fluentd and Fluent Bit in clusters with thousands of pods](https://aws.amazon.com/blogs/containers/fluentd-considerations-and-actions-required-at-scale-in-amazon-eks/). For further learning, we would recommend you to check on [enhancement to Fluent Bit that is designed to reduce the volume API calls](https://aws.amazon.com/blogs/containers/capturing-logs-at-scale-with-fluent-bit-and-amazon-eks/) it makes to the Kubernetes API servers using *Use_Kubelet* option. This Fluent Bit’s `Use_Kubelet` feature allows it to retrieve pod metadata from the kubelet on the host. Amazon EKS customers can use Fluent Bit to capture logs in clusters that run tens of thousands of pods with this feature enabled without overloading the Kubernetes API server. We recommend enabling the feature even if you aren’t running a large Kubernetes cluster. + +#### Logging for Amazon EKS on Fargate + +With Amazon EKS on Fargate, you can deploy pods without allocating or managing your Kubernetes nodes. This removes the need to capture system-level logs for your Kubernetes nodes. To capture the logs from your Fargate pods, you can use Fluent Bit to forward the logs directly to CloudWatch. This enables you to automatically route logs to CloudWatch without further configuration or a sidecar container for your Amazon EKS pods on Fargate. For more information about this, see [Fargate logging](https://docs.aws.amazon.com/eks/latest/userguide/fargate-logging.html) in the Amazon EKS documentation and [Fluent Bit for Amazon EKS](http://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) on the AWS Blog. This solution captures the `STDOUT` and `STDERR` input/output (I/O) streams from your container and sends them to CloudWatch through Fluent Bit, based on the Fluent Bit configuration established for the Amazon EKS cluster on Fargate. + +With Fluent Bit support for Amazon EKS, you no longer need to run a sidecar to route container logs from Amazon EKS pods running on Fargate. With the new built-in logging support, you can select a destination of your choice to send the records to. Amazon EKS on Fargate uses a version of Fluent Bit for AWS, an upstream conformant distribution of Fluent Bit managed by AWS. + +![LOG-AGGREG-3](../../../../images/Containers/aws-native/eks/log-aggreg-3.jpg) + +*Figure: Logging for Amazon EKS on Fargate.* + +Please learn more about Fluent Bit support for Amazon EKS see [Fargate logging](https://docs.aws.amazon.com/eks/latest/userguide/fargate-logging.html) in the Amazon EKS documentation. + +For some reasons, for pods running on AWS Fargate where you need to use the sidecar pattern. You can run a Fluentd (or [Fluent Bit](http://fluentbit.io/)) sidecar container to capture logs produced by your applications. This option requires that the application writes logs to filesystem instead of `stdout` or `stderr`. A consequence of this approach is that you will not be able use `kubectl` logs to view container logs. To make logs appear in `kubectl logs`, you can write application logs to both `stdout` and filesystem simultaneously. + +[Pods on Fargate get 20GB of ephemeral storage](https://docs.aws.amazon.com/eks/latest/userguide/fargate-pod-configuration.html), which is available to all the containers that belong to a pod. You can configure your application to write logs to the local filesystem and instruct Fluentd to watch the log directory (or file). Fluentd will read events from the tail of log files and send the events to a destination like CloudWatch for storage. Ensure that you rotate logs regularly to prevent logs from usurping the entire volume. + +Please learn more about [How to capture application logs when using Amazon EKS on AWS Fargate](https://aws.amazon.com/blogs/containers/how-to-capture-application-logs-when-using-amazon-eks-on-aws-fargate/) to operate and observe your kubernets applications at scale on AWS Fargate. We will also `tee` write to file and `stdout` so we see logs in `kubectl logs` in this approach. + +### Unified log aggregation from Amazon EKS and other compute platforms using AWS Native services + +Customers these days want to unify and centralize logs across different computing platforms such as [Amazon Elastic Kubernetes Service](https://aws.amazon.com/eks/) (Amazon EKS), [Amazon Elastic Compute Cloud](https://aws.amazon.com/ec2/) (Amazon EC2), [Amazon Elastic Container Service](https://aws.amazon.com/ecs/) (Amazon ECS), [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/), and [AWS Lambda](https://aws.amazon.com/lambda/) using agents, log routers, and extensions. We can then use [Amazon OpenSearch Service](https://aws.amazon.com/opensearch-service/) with OpenSearch Dashboards to visualize and analyze the logs, collected across different computing platforms to get application insights. + +A unified aggregated log system provides the following benefits: + +* A single point of access to all the logs across different computing platforms +* Help defining and standardizing the transformations of logs before they get delivered to downstream systems like [Amazon Simple Storage Service](http://aws.amazon.com/s3) (Amazon S3), Amazon OpenSearch Service, [Amazon Redshift](https://aws.amazon.com/redshift), and other services +* The ability to use Amazon OpenSearch Service to quickly index, and OpenSearch Dashboards to search and visualize logs from its routers, applications, and other devices + +The following diagram shows the architecture which performs log aggregation across different compute platforms such as [Amazon Elastic Kubernetes Service](https://aws.amazon.com/eks/) (Amazon EKS), [Amazon Elastic Compute Cloud](https://aws.amazon.com/ec2/) (Amazon EC2), [Amazon Elastic Container Service](https://aws.amazon.com/ecs/) (Amazon ECS),and [AWS Lambda](https://aws.amazon.com/lambda/). + +![LOG-AGGREG-4](../../../../images/Containers/aws-native/eks/log-aggreg-4.jpg) + +*Figure: Log aggregation across different compute platforms.* + +The architecture uses various log aggregation tools such as log agents, log routers, and Lambda extensions to collect logs from multiple compute platforms and deliver them to Kinesis Data Firehose. Kinesis Data Firehose streams the logs to Amazon OpenSearch Service. Log records that fail to get persisted in Amazon OpenSearch service will get written to AWS S3. To scale this architecture, each of these compute platforms streams the logs to a different Firehose delivery stream, added as a separate index, and rotated every 24 hours. + +For further learning check on [how to unify and centralize logs across different compute platforms](https://aws.amazon.com/blogs/big-data/unify-log-aggregation-and-analytics-across-compute-platforms/) such as [Amazon Elastic Kubernetes Service](https://aws.amazon.com/eks/) (Amazon EKS), [Amazon Elastic Compute Cloud](https://aws.amazon.com/ec2/) (Amazon EC2), [Amazon Elastic Container Service](https://aws.amazon.com/ecs/) (Amazon ECS),and [AWS Lambda](https://aws.amazon.com/lambda/) using Kinesis Data Firehose and Amazon OpenSearch Service. This approach allows you to analyze logs quickly and the root cause of failures, using a single platform rather than different platforms for different services. + +## Conclusion + +In this section of Observability best practices guide, we started with deep diving in three types of Kubernetes logging such as control plane logging, node logging, and application logging. Further we learned about unified log aggregation from Amazon EKS and other compute platforms using AWS Native services such as Kinesis Data Firehose and Amazon OpenSearch Service. For further deep dive, we would highly recommend you to practice Logs and Insights modules under AWS native Observability category of AWS [One Observability Workshop](https://catalog.workshops.aws/observability/en-US). diff --git a/docusaurus/observability-best-practices/docs/guides/containers/oss/ecs/best-practices-metrics-collection-1.md b/docusaurus/observability-best-practices/docs/guides/containers/oss/ecs/best-practices-metrics-collection-1.md new file mode 100644 index 000000000..c71638073 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/oss/ecs/best-practices-metrics-collection-1.md @@ -0,0 +1,168 @@ +# Collecting system metrics in an ECS cluster using AWS Distro for OpenTelemetry +[AWS Distro for OpenTelemetry](https://aws-otel.github.io/docs/introduction) (ADOT) is a secure, AWS-supported distribution of the [OpenTelemetry](https://opentelemetry.io/) project. Using ADOT, you can collect telemetry data from multiple sources and send correlated metrics, traces and logs to multiple monitoring solutions. ADOT may be deployed on Amazon ECS cluster in two difference patterns. + +## Deployment patterns for ADOT Collector +1. In the sidecar pattern, the ADOT collector runs inside each task in the cluster and it processes telemetry data collected from application containers only within that task. This deployment pattern is required only when you need the collector to read task metadata from Amazon ECS [Task Metadata Endpoint](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint.html), and generate resource usage metrics (such as CPU, memory, network, and disk) from them. +![ADOT architecture](../../../../images/ADOT-sidecar.png) + +2. In the central collector pattern, a single instance of ADOT collector is deployed on the cluster and it processes telemetry data from all the tasks running on the cluster. This is the most commonly used deployment pattern. The collector is deployed using either REPLICA or DAEMON service scheduler strategy. +![ADOT architecture](../../../../images/ADOT-central.png) + +The ADOT collector architecture has the concept of a pipeline. A single collector can contain more than one pipeline. Each pipeline is dedicated to processing one of the three types of telemetry data, namely, metrics, traces and logs. You can configure multiple pipelines for each type of telemetry data. This versatile architecture thus allows a single collector to perform the role of multiple observability agents that would otherwise have to be deployed on the cluster. It significantly reduces the deployment footprint of obsevrability agents on the cluster. The primary components of a collector that make up a pipeline are grouped into three categories, namely, Receiver, Processor, and Exporter. There are secondary components called Extensions which provide capabilities that can be added to the collector, but which are not part of pipelines. + +:::info + Refer to the OpenTelemetry [documentaton](https://opentelemetry.io/docs/collector/configuration/#basics) for a detailed explanation of Receivers, Processors, Exporters and Extensions. +::: + +## Deploying ADOT Collector for ECS task metrics collection + +To collect resource utilization metrics at the ECS task level, the ADOT collector should be deployed using the sidecar pattern, using a task definition as shown below. The container image used for the collector is bundled with several pipeline configurations. You can choose one of them based on your requirments and specify the configuration file path in the *command* section of the container defintion. Setting this value to `--config=/etc/ecs/container-insights/otel-task-metrics-config.yaml` will result in the use of a [pipeline configuration](https://github.com/aws-observability/aws-otel-collector/blob/main/config/ecs/container-insights/otel-task-metrics-config.yaml) that collects resource utilization metrics and traces from other containers running within the same task as the collector and send them to Amazon CloudWatch and AWS X-Ray. Specifically, the collector uses an [AWS ECS Container Metrics Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/awsecscontainermetricsreceiver) that reads task metadata and docker stats from [Amazon ECS Task Metadata Endpoint](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4.html), and generates resource usage metrics (such as CPU, memory, network, and disk) from them. + +```javascript +{ + "family":"AdotTask", + "taskRoleArn":"arn:aws:iam::123456789012:role/ECS-ADOT-Task-Role", + "executionRoleArn":"arn:aws:iam::123456789012:role/ECS-Task-Execution-Role", + "networkMode":"awsvpc", + "containerDefinitions":[ + { + "name":"application-container", + "image":"..." + }, + { + "name":"aws-otel-collector", + "image":"public.ecr.aws/aws-observability/aws-otel-collector:latest", + "cpu":512, + "memory":1024, + "command": [ + "--config=/etc/ecs/container-insights/otel-task-metrics-config.yaml" + ], + "portMappings":[ + { + "containerPort":2000, + "protocol":"udp" + } + ], + "essential":true + } + ], + "requiresCompatibilities":[ + "EC2" + ], + "cpu":"1024", + "memory":"2048" + } +``` +:::info + Refer to the [documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-ECS-adot.html) for details about setting up the IAM task role and task execution role that the ADOT collector will use when deployed on an Amazon ECS cluster. +::: + +:::info + The [AWS ECS Container Metrics Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/awsecscontainermetricsreceiver) works only for ECS Task Metadata Endpoint V4. Amazon ECS tasks on Fargate that use platform version 1.4.0 or later and Amazon ECS tasks on Amazon EC2 that are running at least version 1.39.0 of the Amazon ECS container agent can utilize this receiver. For more information, see [Amazon ECS Container Agent Versions](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-versions.html) +::: + +As seen in the default [pipeline configuration](https://github.com/aws-observability/aws-otel-collector/blob/main/config/ecs/container-insights/otel-task-metrics-config.yaml), the collector's pipeline first uses the [Filter Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessor) which filters out a [subset of metrics](https://github.com/aws-observability/aws-otel-collector/blob/09d59966404c2928aaaf6920f27967a84d898254/config/ecs/container-insights/otel-task-metrics-config.yaml#L25) pertaining to CPU, memory, network, and disk usage. Then it uses the [Metrics Transform Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor) that performs a set of [transformations](https://github.com/aws-observability/aws-otel-collector/blob/09d59966404c2928aaaf6920f27967a84d898254/config/ecs/container-insights/otel-task-metrics-config.yaml#L39) to change the names of these metrics as well as update their attributes. Finally, the metrics are sent to CloudWatch as performance log events using the [Amazon CloudWatch EMF Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsemfexporter). Using this default configuration will result in collection of the following resource usage metrics under the CloudWatch namespace *ECS/ContainerInsights*. + +- MemoryUtilized +- MemoryReserved +- CpuUtilized +- CpuReserved +- NetworkRxBytes +- NetworkTxBytes +- StorageReadBytes +- StorageWriteBytes + +:::info + Note that these are the same [metrics collected by Container Insights for Amazon ECS](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html) and are made readily available in CloudWatch when you enable Container Insights at the cluster or account level. Hence, enabling Container Insights is the recommended approach for collecting ECS resource usage metrics in CloudWatch. +::: + +The AWS ECS Container Metrics Receiver emits 52 unique metrics which it reads from the Amazon ECS Task Metadata Endpoint. The complete list of metrics collected by the receiver is [documented here](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/awsecscontainermetricsreceiver#available-metrics). You may not want to send all of them to your preferred destination. If you want more explicit control over the ECS resource usage metrics, then you can create a custom pipeline configuration, filtering and transforming the available metrics with your choice of processors/transfomers and send them to a destination based on your choice of exporters. Refer to the documentation for [additional examples](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/awsecscontainermetricsreceiver#full-configuration-examples) of pipeline configurations to capture ECS task level metrics. + +If you want to use a custom pipeline configuration, then you can use the task definition shown below and deploy the collector using the sidecar pattern. Here, the configuration of the collector pipeline is loaded from a parameter named *otel-collector-config* in AWS SSM Parameter Store. + +:::note + The SSM Parameter Store parameter name must be exposed to the collector using an environment variable named AOT_CONFIG_CONTENT. +::: + +```javascript +{ + "family":"AdotTask", + "taskRoleArn":"arn:aws:iam::123456789012:role/ECS-ADOT-Task-Role", + "executionRoleArn":"arn:aws:iam::123456789012:role/ECS-Task-Execution-Role", + "networkMode":"awsvpc", + "containerDefinitions":[ + { + "name":"application-container", + "image":"..." + }, + { + "name":"aws-otel-collector", + "image":"public.ecr.aws/aws-observability/aws-otel-collector:latest", + "cpu":512, + "memory":1024, + "secrets":[ + { + "name":"AOT_CONFIG_CONTENT", + "valueFrom":"arn:aws:ssm:us-east-1:123456789012:parameter/otel-collector-config" + } + ], + "portMappings":[ + { + "containerPort":2000, + "protocol":"udp" + } + ], + "essential":true + } + ], + "requiresCompatibilities":[ + "EC2" + ], + "cpu":"1024", + "memory":"2048" + } +``` + +## Deploying ADOT Collector for ECS container instance metrics collection + +To collect EC2 instance-level metrics from your ECS cluster, the ADOT collector can be deployed using a task definition as shown below. It should be deployed with the daemon service scheduler strategy. You can choose a pipeline configuration bundled into the container image. The configuration file path in the *command* section of the container defintion should be set to `--config=/etc/ecs/otel-instance-metrics-config.yaml`. The collector uses the [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/awscontainerinsightreceiver#aws-container-insights-receiver) to collect EC2 instance-level infrastructure metrics for many resources such as such as CPU, memory, disk, and network. the metrics are sent to CloudWatch as performance log events using the [Amazon CloudWatch EMF Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsemfexporter). The functionality of the collector with this configuration is equivalent to that of deploying the CloudWatch agent to an Amazon ECS cluster hosted on EC2, + +:::info + The ADOT Collector deployment for collecting EC2 instance-level metrics is not supported on ECS clusters running on AWS Fargate +::: + +```javascript +{ + "family":"AdotTask", + "taskRoleArn":"arn:aws:iam::123456789012:role/ECS-ADOT-Task-Role", + "executionRoleArn":"arn:aws:iam::123456789012:role/ECS-Task-Execution-Role", + "networkMode":"awsvpc", + "containerDefinitions":[ + { + "name":"application-container", + "image":"..." + }, + { + "name":"aws-otel-collector", + "image":"public.ecr.aws/aws-observability/aws-otel-collector:latest", + "cpu":512, + "memory":1024, + "command": [ + "--config=/etc/ecs/otel-instance-metrics-config.yaml" + ], + "portMappings":[ + { + "containerPort":2000, + "protocol":"udp" + } + ], + "essential":true + } + ], + "requiresCompatibilities":[ + "EC2" + ], + "cpu":"1024", + "memory":"2048" + } +``` \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/containers/oss/ecs/best-practices-metrics-collection-2.md b/docusaurus/observability-best-practices/docs/guides/containers/oss/ecs/best-practices-metrics-collection-2.md new file mode 100644 index 000000000..1ab77b036 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/oss/ecs/best-practices-metrics-collection-2.md @@ -0,0 +1,250 @@ +# Collecting service metrics in an ECS cluster using AWS Distro for OpenTelemetry +## Deploying ADOT Collector with default configuration +The ADOT collector can be deployed using a task definition as shown below, using the sidecar pattern. The container image used for the collector is bundled with two collector pipeline configurations which can be specified in the *command* section of the container defintion. Seting this value `--config=/etc/ecs/ecs-default-config.yaml` +will result in the use of a [pipeline configuration](https://github.com/aws-observability/aws-otel-collector/blob/main/config/ecs/ecs-default-config.yaml) that will collect application metrics and traces from other containers running within the same task as the collector and send them to Amazon CloudWatch and AWS X-Ray. Specifically, the collector uses an [OpenTelemetry Protocol (OTLP) Receiver](https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiver) to receive metrics sent by applications that have been instrumented with OpenTelemetry SDKs as well as a [StatsD Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/statsdreceiver) to collect StatsD metrics. Additionally, it uses an [AWS X-ray Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/awsxrayreceiver) to collect traces from applications that have been instrumented with AWS X-Ray SDK. + +:::info + Refer to the [documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-ECS-adot.html) for details about setting up the IAM task role and task execution role that the ADOT collector will use when deployed on an Amazon ECS cluster. +::: + +```javascript +{ + "family":"AdotTask", + "taskRoleArn":"arn:aws:iam::123456789012:role/ECS-ADOT-Task-Role", + "executionRoleArn":"arn:aws:iam::123456789012:role/ECS-Task-Execution-Role", + "networkMode":"awsvpc", + "containerDefinitions":[ + { + "name":"application-container", + "image":"..." + }, + { + "name":"aws-otel-collector", + "image":"public.ecr.aws/aws-observability/aws-otel-collector:latest", + "cpu":512, + "memory":1024, + "command": [ + "--config=/etc/ecs/ecs-default-config.yaml" + ], + "portMappings":[ + { + "containerPort":2000, + "protocol":"udp" + } + ], + "essential":true + } + ], + "requiresCompatibilities":[ + "EC2" + ], + "cpu":"1024", + "memory":"2048" + } +``` +## Deploying ADOT Collector for Prometheus metrics collection +To deploy ADOT with the central collector pattern, with a pipeline that is different from the default configuration, the task definition shown below can be used. Here, the configuration of the collector pipeline is loaded from a parameter named *otel-collector-config* in AWS SSM Parameter Store. The collector is launched using REPLICA service scheduler strategy. + +```javascript +{ + "family":"AdotTask", + "taskRoleArn":"arn:aws:iam::123456789012:role/ECS-ADOT-Task-Role", + "executionRoleArn":"arn:aws:iam::123456789012:role/ECS-Task-Execution-Role", + "networkMode":"awsvpc", + "containerDefinitions":[ + { + "name":"aws-otel-collector", + "image":"public.ecr.aws/aws-observability/aws-otel-collector:latest", + "cpu":512, + "memory":1024, + "secrets":[ + { + "name":"AOT_CONFIG_CONTENT", + "valueFrom":"arn:aws:ssm:us-east-1:123456789012:parameter/otel-collector-config" + } + ], + "portMappings":[ + { + "containerPort":2000, + "protocol":"udp" + } + ], + "essential":true + } + ], + "requiresCompatibilities":[ + "EC2" + ], + "cpu":"1024", + "memory":"2048" + } +``` + +:::note + The SSM Parameter Store parameter name must be exposed to the collector using an environment variable named AOT_CONFIG_CONTENT. + When using the ADOT collector for Prometheus metrics collection from applications and deploying it with REPLICA service scheduler startegy, make sure that you set the replica count to 1. Deploying more than 1 replica of the collector will result in an incorrect representation of metrics data in the target destination. +::: + +The configuration shown below enables the ADOT collector to collect Prometheus metrics from services in the cluster using a [Prometheus Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver). The receiver is meant to minimally be a drop-in replacement for Prometheus server. To collect metrics with this receiver, you need a mechanism for discovering the set of target services to be scraped. The receiver supports both static and dynamic discovery of scraping targets using one of the dozens of supported [service-discovery mechanisms](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config). + +As Amazon ECS does not have any built-in service discovery mechanism, the collector relies on Prometheus' support for file-based discovery of targets. To setup the Prometheus receiver for file-based discovery of targets, the collector makes use of the [Amazon ECS Observer](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/observer/ecsobserver/README.md) extension. The extension uses ECS/EC2 API to discover Prometheus scrape targets from all running tasks and filter them based on service names, task definitions and container labels listed under the *ecs_observer/task_definitions* section in the configuration. All discovered targets are written into the file specified by the *result_file* field, which resides on the file system mounted to ADOT collector container. Subequently, the Prometheus receiver scrapes metrics from the targets listed in this file. + +### Sending metrics data to Amazon Managed Prometheus workspace +The metrics collected by the Prometheus Receiver can be sent to an Amazon Managed Prometheus workspace using a [Prometheus Remote Write Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/prometheusremotewriteexporter) in the collector pipeline, as shown in the *exporters* section of the configuration below. The exporter is configured with the remote write URL of the workspace and it sends the metrics data using HTTP POST requests. It makes use of the built-in AWS Signature Version 4 authenticator to sign the requests sent to the workspace. + +```yaml +extensions: + health_check: + sigv4auth: + region: us-east-1 + ecs_observer: + refresh_interval: 60s + cluster_name: 'ecs-ec2-cluster' + cluster_region: us-east-1 + result_file: '/etc/ecs_sd_targets.yaml' + services: + - name_pattern: '^WebAppService$' + task_definitions: + - job_name: "webapp-tasks" + arn_pattern: '.*:task-definition/WebAppTask:[0-9]+' + metrics_path: '/metrics' + metrics_ports: + - 3000 + +receivers: + awsxray: + prometheus: + config: + scrape_configs: + - job_name: ecs_services + file_sd_configs: + - files: + - '/etc/ecs_sd_targets.yaml' + refresh_interval: 30s + relabel_configs: + - source_labels: [ __meta_ecs_cluster_name ] + action: replace + target_label: cluster + - source_labels: [ __meta_ecs_service_name ] + action: replace + target_label: service + - source_labels: [ __meta_ecs_task_definition_family ] + action: replace + target_label: taskdefinition + - source_labels: [ __meta_ecs_task_container_name ] + action: replace + target_label: container + +processors: + filter/include: + metrics: + include: + match_type: regexp + metric_names: + - ^http_requests_total$ + +exporters: + awsxray: + prometheusremotewrite: + endpoint: https://aps-workspaces.us-east-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write + auth: + authenticator: sigv4auth + resource_to_telemetry_conversion: + enabled: true + +service: + extensions: + - ecs_observer + - health_check + - sigv4auth + pipelines: + metrics: + receivers: [prometheus] + exporters: [prometheusremotewrite] + traces: + receivers: [awsxray] + exporters: [awsxray] +``` + +### Sending metrics data to Amazon CloudWatch +Alternatively, the metrics data can be sent to Amazon CloudWatch by using the [Amazon CloudWatch EMF Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsemfexporter) in the collector pipeline, as shown in the *exporters* section of the configuration below. This exporter sends metrics data to CloudWatch as performance log events. The *metric_declaration* field in the exporter is used to specify the array of logs with embedded metric format to be generated. The configurtion below will generate log events only for a metric named *http_requests_total*. Using this data, CloudWatch will create the metric *http_requests_total* under the CloudWatch namespace *ECS/ContainerInsights/Prometheus* with the dimensions *ClusterName*, *ServiceName* and *TaskDefinitionFamily*. + + +```yaml +extensions: + health_check: + sigv4auth: + region: us-east-1 + ecs_observer: + refresh_interval: 60s + cluster_name: 'ecs-ec2-cluster' + cluster_region: us-east-1 + result_file: '/etc/ecs_sd_targets.yaml' + services: + - name_pattern: '^WebAppService$' + task_definitions: + - job_name: "webapp-tasks" + arn_pattern: '.*:task-definition/WebAppTask:[0-9]+' + metrics_path: '/metrics' + metrics_ports: + - 3000 + +receivers: + awsxray: + prometheus: + config: + global: + scrape_interval: 15s + scrape_timeout: 10s + scrape_configs: + - job_name: ecs_services + file_sd_configs:: + - files: + - '/etc/ecs_sd_targets.yaml' + relabel_configs: + - source_labels: [ __meta_ecs_cluster_name ] + action: replace + target_label: ClusterName + - source_labels: [ __meta_ecs_service_name ] + action: replace + target_label: ServiceName + - source_labels: [ __meta_ecs_task_definition_family ] + action: replace + target_label: TaskDefinitionFamily + - source_labels: [ __meta_ecs_task_container_name ] + action: replace + target_label: container + +processors: + filter/include: + metrics: + include: + match_type: regexp + metric_names: + - ^http_requests_total$ + +exporters: + awsxray: + awsemf: + namespace: ECS/ContainerInsights/Prometheus + log_group_name: '/aws/ecs/containerinsights/{ClusterName}/prometheus' + dimension_rollup_option: NoDimensionRollup + metric_declarations: + - dimensions: [[ClusterName, ServiceName, TaskDefinitionFamily]] + metric_name_selectors: + - http_requests_total + +service: + extensions: + - ecs_observer + - health_check + - sigv4auth + pipelines: + metrics: + receivers: [prometheus] + processors: [filter/include] + exporters: [awsemf] + traces: + receivers: [awsxray] + exporters: [awsxray] +``` diff --git a/docusaurus/observability-best-practices/docs/guides/containers/oss/ecs/best-practices-metrics-collection.md b/docusaurus/observability-best-practices/docs/guides/containers/oss/ecs/best-practices-metrics-collection.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/oss/ecs/best-practices-metrics-collection.md @@ -0,0 +1 @@ + diff --git a/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/best-practices-metrics-collection.md b/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/best-practices-metrics-collection.md new file mode 100644 index 000000000..8299b8c11 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/best-practices-metrics-collection.md @@ -0,0 +1,277 @@ +# EKS Observability : Essential Metrics + +# Current Landscape + +Monitoring is defined as a solution that allows infrastructure and application owners a way to see and understand both historical and current state of their systems, focused on gathering defined metrics or logs. + +Monitoring has evolved through the years. We started working with debug and dump logs to debug and troubleshoot issues to having basic monitoring using command-line tools like syslogs, top etc, which progressed to being able to visualize them in a dashboard. In the advent of cloud and increase in scale, we are tracking more today that we have ever been. The industry has shifted more into Observability, which is defined as a solution to allow infrastructure and application owners to actively troubleshoot and debug their systems. With Observability focusing more on looking at patterns derived from the metrics. + + +# Metrics, why does it matter? + +Metrics are a series of numerical values that are kept in order with the time that they are created. They are used to track everything from the number of servers in your environment, their disk usage, number of requests they handle per second, or the latency in completing these requests. Metrics are data that tell you how your systems are performing. Whether you are running a small or large cluster, getting insights on your systems health and performance allows you to identify areas of improvement, ability to troubleshoot and trace an issue, as well as improve your workloads performance and efficiency as whole. These changes can impact how much time and resources you spend on your cluster, which translates directly into cost. + + +# Metrics Collection + +Collecting metrics from an EKS cluster consists of [three components](https://aws-observability.github.io/observability-best-practices/recipes/telemetry/) : + +1. Sources: where metrics come from like the ones listed in this guide. +2. Agents: Applications running in the EKS environment, often called an agent, which collects the metrics monitoring data and pushes this data to the second component. Some examples of this component are [AWS Distro for OpenTelemetry (ADOT)](https://aws-otel.github.io/) and [CloudWatch Agent](https://aws-observability.github.io/observability-best-practices/tools/cloudwatch_agent/) +3. Destinations: A monitoring data storage and analysis solution, this component is typically a data service that is optimized for [time series formatted data](https://aws-observability.github.io/observability-best-practices/signals/metrics/). Some examples of this component are [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) and [AWS Cloudwatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html). + +Note: In this section, configuration examples are links to relevant sections of the [AWS Observability Accelerator](https://aws-observability.github.io/terraform-aws-observability-accelerator/). This is to ensure you get up to date guidance and examples on EKS metrics collection implementations. + +## Managed Open Source Solution + +[AWS Distro for OpenTelemetry (ADOT)](https://aws-otel.github.io/) is a supported version of the [OpenTelemetry](https://opentelemetry.io/) project that enables users to send correlated metrics and traces to various monitoring data collection solutions like [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) and [AWS Cloudwatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html). ADOT can be installed through [EKS Managed Add-ons](https://docs.aws.amazon.com/eks/latest/userguide/eks-add-ons.html) on to an EKS cluster and configured to collect metrics (like the ones listed on this page) and workload traces. AWS has validated that the ADOT add-on is compatible with Amazon EKS, and it is regularly updated with the latest bug fixes and security patches. [ADOT best practices and more information.](https://aws-observability.github.io/observability-best-practices/guides/operational/adot-at-scale/operating-adot-collector/) + + +## ADOT + AMP + +The quickest way to get up and running with AWS Distro for OpenTelemetry (ADOT), Amazon Managed Service for Prometheus (AMP), and Amazon Managed Service for Grafana (AMG) is to utilize the [infrastructure monitoring example](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/) from AWS Observability Accelerator. The accelerator examples deploy the tools and services in your environment with out of the box metrics collection, alerting rules and Grafana dashboards. + +Please refer to the AWS documentation for additional information on installation, configuration and operation of [EKS Managed Add-on for ADOT](https://docs.aws.amazon.com/eks/latest/userguide/opentelemetry.html). + +### Sources + +EKS metrics are created from multiple locations at different layers of an overall solution. This is a table summarizing the metrics sources that are called out in essential metrics section. + + +|Layer |Source |Tool |Installation and More info |Helm Chart | +|--- |--- |--- |--- |--- | +|Control Plane |*api server endpoint*/metrics |N/A - api server exposes metrics in prometheus format directly |https://docs.aws.amazon.com/eks/latest/userguide/prometheus.html |N/A | +|Cluster State |*kube-state-metrics-http-endpoint*:8080/metrics |kube-state-metrics |https://github.com/kubernetes/kube-state-metrics#overview |https://github.com/kubernetes/kube-state-metrics#helm-chart | +|Kube Proxy |*kube-proxy-http*:10249/metrics |N/A - kube proxy exposes metrics in prometheus format directly |https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ |N/A | +|VPC CNI |*vpc-cni-metrics-helper*/metrics |cni-metrics-helper |https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/cni-metrics-helper/README.md |https://github.com/aws/amazon-vpc-cni-k8s/tree/master/charts/cni-metrics-helper | +|Core DNS |*core-dns*:9153/metrics |N/A - core DNS exposes metrics in prometheus format directly |https://github.com/coredns/coredns/tree/master/plugin/metrics |N/A | +|Node |*prom-node-exporter-http*:9100/metrics |prom-node-exporter |https://github.com/prometheus/node_exporter +https://prometheus.io/docs/guides/node-exporter/#node-exporter-metrics |https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-node-exporter | +|Kubelet/Pod |*kubelet*/metrics/cadvisor |kubelet or proxied through api server |https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/ |N/A | + +### Agent : AWS Distro for OpenTelemetry + +AWS recommends installation, configuration and operations of ADOT on your EKS cluster through the AWS EKS ADOT managed addon. This addon utilized the ADOT operator/collector custom resource model allowing you to deploy, configure and manage multiple ADOT collectors on your cluster. For detailed information on installation, advanced configuration and operations of this addon check out this [documentation](https://aws-otel.github.io/docs/getting-started/adot-eks-add-on). + +Note: The AWS EKS ADOT managed addon web console can be used for [advanced configuration of the ADOT addon](https://docs.aws.amazon.com/eks/latest/userguide/deploy-collector-advanced-configuration.html). + +There are two components to the ADOT collector configuration. + +1. The [collector configuration](https://github.com/aws-observability/aws-otel-community/blob/master/sample-configs/operator/collector-config-amp.yaml) which includes collector deployment mode (deployment, daemonset, etc). +2. The [OpenTelemetry Pipeline configuration](https://opentelemetry.io/docs/collector/configuration/) which includes what receivers, processors, and exporters are needed for metrics collection. Example configuration snippet: + +``` +config: | + extensions: + sigv4auth: + region: + service: "aps" + + receivers: + # + # Scrape configuration for the Prometheus Receiver + # This is the same configuration used when Prometheus is installed using the community Helm chart + # + prometheus: + config: + global: + scrape_interval: 60s + scrape_timeout: 10s + + scrape_configs: + - job_name: kubernetes-apiservers + bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + kubernetes_sd_configs: + - role: endpoints + relabel_configs: + - action: keep + regex: default;kubernetes;https + source_labels: + - __meta_kubernetes_namespace + - __meta_kubernetes_service_name + - __meta_kubernetes_endpoint_port_name + scheme: https + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: true + + ... + ... + + exporters: + prometheusremotewrite: + endpoint: + auth: + authenticator: sigv4auth + logging: + loglevel: warn + extensions: + sigv4auth: + region: + service: aps + health_check: + pprof: + endpoint: :1888 + zpages: + endpoint: :55679 + processors: + batch/metrics: + timeout: 30s + send_batch_size: 500 + service: + extensions: [pprof, zpages, health_check, sigv4auth] + pipelines: + metrics: + receivers: [prometheus] + processors: [batch/metrics] + exporters: [logging, prometheusremotewrite] +``` + +A complete best practices collector configuration, ADOT pipeline configuration and Prometheus scrape configuration can be found here as [a Helm Chart in the Observability Accelerator](https://github.com/aws-observability/terraform-aws-observability-accelerator/blob/main/modules/eks-monitoring/otel-config/templates/opentelemetrycollector.yaml). + + +### Destination: Amazon Managed Service for Prometheus + +The ADOT collector pipeline utilizes Prometheus Remote Write capabilities to export metrics to an AMP instance. Example configuration snippet, note the AMP WRITE ENDPOINT URL + +``` + exporters: + prometheusremotewrite: + endpoint: + auth: + authenticator: sigv4auth + logging: + loglevel: warn +``` + +A complete best practices collector configuration, ADOT pipeline configuration and Prometheus scrape configuration can be found here as [a Helm Chart in the Observability Accelerator](https://github.com/aws-observability/terraform-aws-observability-accelerator/blob/main/modules/eks-monitoring/otel-config/templates/opentelemetrycollector.yaml). + +Best practices on AMP configuration and usage is [here](https://aws-observability.github.io/observability-best-practices/recipes/amp/). + +# What are the relevant metrics? + +Gone are the days where you have little metrics available, nowadays it is the opposite, there are hundreds of metrics available. Being able to determine what are the relevant metrics is important towards building a system with an observability first mindset. + +This guide outlines the different grouping of metrics available to you and explains which ones you should focus on as you build observability into your infrastructure and applications. The list of metrics below are the list of metrics we recommend monitoring based on best practices. + +The metrics listed in the following sections are in addition to the metrics highlighted in the [AWS Observability Accelerator Grafana Dashboards](https://github.com/aws-observability/terraform-aws-observability-accelerator/tree/main/modules/eks-monitoring) and [Kube Prometheus Stack Dashboards](https://monitoring.mixins.dev/). + +## Control Plane Metrics + +The Amazon EKS control plane is managed by AWS for you and runs in an account managed by AWS. It consists of control plane nodes that run the Kubernetes components, such as etcd and the Kubernetes API server. Kubernetes publishes various events to keep users informed of activities in the cluster, such as spinning up and tearing down pods, deployments, namespaces, and more. The Amazon EKS control plane is a critical component that you need to track to make sure the core components are able function properly and perform the fundamental activities required by your cluster. + +The Control Plane API Server exposes thousands of metrics, the table below lists the essential control plane metrics that we recommend monitoring. + +|Name |Metric |Description |Reason | +|--- |--- |--- |--- | +|API Server total requests |apiserver_request_total |Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code. | | +|API Server latency |apiserver_request_duration_seconds |Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope, and component. | | +|Request latency |rest_client_request_duration_seconds |Request latency in seconds. Broken down by verb and URL. | | +|Total requests |rest_client_requests_total |Number of HTTP requests, partitioned by status code, method, and host. | | +|API Server request duration |apiserver_request_duration_seconds_bucket |Measures the latency for each request to the Kubernetes API server in seconds | | +|API server request latency sum |apiserver_request_latencies_sum |Cumulative Counter which tracks total time taken by the K8 API server to process requests | | +|API server registered watchers |apiserver_registered_watchers |The number of currently registered watchers for a given resource | | +|API server number of objects |apiserver_storage_object |Number of stored objects at the time of last check split by kind. | | +|Admission controller latency |apiserver_admission_controller_admission_duration_seconds |Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit). | | +|Etcd latency |etcd_request_duration_seconds |Etcd request latency in seconds for each operation and object type. | | +|Etcd DB size |apiserver_storage_db_total_size_in_bytes |Etcd database size. |This helps you proactively monitor etcd database usage, and avoid overrunning the limit. | + +## Cluster State metrics + +The Cluster State Metrics are generated by `kube-state-metrics` (KSM). KSM is a utility that runs as a pod in the cluster, listening to the Kubernetes API Server, providing you insights into your cluster state and Kubernetes objects in your cluster as Prometheus metrics. KSM will need to be [installed](https://github.com/kubernetes/kube-state-metrics) before these metrics are available. These metrics are used by Kubernetes to effectively do pod scheduling, and is focused on the health of various objects inside, such as deployments, replica sets, nodes and pods. Cluster state metrics expose pod information on status, capacity and availability. Its essential to keep track on how your cluster is performing on scheduling tasks for your cluster so you can keep track performance, get ahead of issues and monitor the health of your cluster. There are about X number of exposed Cluster State Metrics, the table below lists the essential metrics that should be tracked. + +|Name |Metric |Description | +|--- |--- |--- | +|Node status |kube_node_status_condition |Current health status of the node. Returns a set of node conditions and `true`, `false`, or `unknown` for each | +|Desired pods |kube_deployment_spec_replicas or kube_daemonset_status_desired_number_scheduled |Number of pods specified for a Deployment or DaemonSet | +|Current pods |kube_deployment_status_replicas or kube_daemonset_status_current_number_scheduled |Number of pods currently running in a Deployment or DaemonSet | +|Pod capacity |kube_node_status_capacity_pods |Maximum pods allowed on the node | +|Available pods |kube_deployment_status_replicas_available or kube_daemonset_status_number_available |Number of pods currently available for a Deployment or DaemonSet | +|Unavailable pods |kube_deployment_status_replicas_unavailable or kube_daemonset_status_number_unavailable |Number of pods currently not available for a Deployment or DaemonSet | +|Pod readiness |kube_pod_status_ready |If a pod is ready to serve client requests | +|Pod status |kube_pod_status_phase |Current status of the pod; value would be pending/running/succeeded/failed/unknown | +|Pod waiting reason |kube_pod_container_status_waiting_reason |Reason a container is in a waiting state | +|Pod termination status |kube_pod_container_status_terminated |Whether the container is currently in a terminated state or not | +|Pods pending scheduling |pending_pods |Number of pods awaiting node assignment | +|Pod scheduling attempts |pod_scheduling_attempts |Number of attempts made to schedule pods | + +## Cluster Add-on Metrics + +Cluster add-on is software that provides supporting operational capabilities to Kubernetes applications. This includes software like observability agents or Kubernetes drives that allow the cluster to interact with underlying AWS resources for networking, compute and storage. Add-on software is typically built and maintained by the Kubernetes community, cloud providers like AWS, or third-party vendors. Amazon EKS automatically installs self-managed add-ons such as the Amazon VPC CNI plugin for Kubernetes, `kube-proxy`, and CoreDNS for every cluster. + +These Cluster add-ons provide operational support in different areas like networking, domain name resolution, etc. They provide you with insights on how the critical supporting infrastructure and components are operating. Tracking add-on metrics are important to understand your clusters operational health. + +Below are the essential add-ons that you should consider monitoring along with their essential metrics. + +## Amazon VPC CNI Plugin + +Amazon EKS implements cluster networking through the Amazon VPC Container Network Interface (VPC CNI) plugin. The CNI plugin allows Kubernetes Pods to have the same IP address ad they do on the VPC network. More specifically, all containers inside the Pod share a network namespace, and they can communicate with each-other using local ports. The VPC CNI add-on enables you to continuously ensure the security and stability of your Amazon EKS clusters and decrease the amount of effort required to install, configure and update add-ons. + +VPC CNI add-on metrics are exposed by the CNI Metrics Helper. Monitoring the IP address allocation is fundamental to ensuring a healthy cluster and avoiding IP exhaustion issues. [Here is the latest networking best practices and VPC CNI metrics to collect and monitor](https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/#monitor-ip-address-inventory). + +## CoreDNS Metrics + +CoreDNS is a flexible, extensible DNS server that can serve as the Kubernetes cluster DNS. The CoreDNS pods provide name resolution for all pods in the cluster. Running DNS intensive workloads can sometimes experience intermittent CoreDNS failures due to DNS throttling, and this can impact applications. + +Checkout the latest best practices for tracking key [CoreDNS performance metrics here](https://aws.github.io/aws-eks-best-practices/reliability/docs/dataplane/#monitor-coredns-metrics) and [Monitoring CoreDNS traffic for DNS throttling issues](https://aws.github.io/aws-eks-best-practices/networking/monitoring/) + + +## Pod/Container Metrics + +Tracking usage in across all layers of you application is important, these includes taking a closer look at your nodes and pods running inside your cluster. Out of all the metrics available at the pod dimension, this list of metrics are of practical use for you to understand the state of the workloads running on your cluster. Tracking CPU, memory and network usage allows for diagnosing and troubleshooting application related issues. Tracking your workload metrics provide you insights into your resource utilization to right size your workloads running on EKS. + +|Metric |Example PromQL Query |Dimension | +|--- |--- |--- | +|Number of running pods per namspace |count by(namespace) (kube_pod_info) |Per Cluster by Namespace | +|CPU usage per container per pod |sum(rate(container_cpu_usage_seconds_total\{container!=""\}[5m])) by (namespace, pod) |Per Cluster by Namespace by Pod | +|Memory utilization per pod |sum(container_memory_usage_bytes\{container!=""\}) by (namespace, pod) |Per Cluster by Namespace by Pod | +|Network Received Bytes per pod |sum by(pod) (rate(container_network_receive_bytes_total[5m])) |Per Cluster by Namespace by Pod | +|Network Transmitted Bytes per pod |sum by(pod) (rate(container_network_transmit_bytes_total[5m])) |Per Cluster by Namespace by Pod | +|The number of container restarts per container |increase(kube_pod_container_status_restarts_total[15m]) > 3 |Per Cluster by Namespace by Pod | + +## Node Metrics + +Kube State Metrics and Prometheus node exporter gathers metric statistics on the nodes in your cluster. Tracking your nodes status, cpu usage, memory, filesystem and traffic is important to understand your node utilization. Understanding how your nodes resources are being utilized is important for properly selecting instance types and storage to effectively the types of workloads you expect to run on your cluster. The metrics below are some of the essential metrics that you should be tracking. + + +|Metric |Example PromQL Query |Dimension | +|--- |--- |--- | +|Node CPU Utilization |sum(rate(container_cpu_usage_seconds_total\{container!=""\}[5m])) by (node) |Per Cluster by Node | +|Node Memory Utilization |sum(container_memory_usage_bytes\{container!=""\}) by (node) |Per Cluster by Node | +|Node Network Total Bytes |sum by (instance) (rate(node_network_receive_bytes_total[3m]))+sum by (instance) (rate(node_network_transmit_bytes_total[3m])) |Per Cluster by Node | +|Node CPU Reserved Capacity |sum(kube_node_status_capacity\{cluster!=""\}) by (node) |Per Cluster by Node | +|Number of Running Pods per Node |sum(kubelet_running_pods) by (instance) |Per Cluster by Node ||Node Filesystem Usage |rate(container_fs_reads_bytes_total\{job="kubelet", device=~"mmcblk.p.+|.*nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+", container!="", cluster="", namespace!=""\}[$__rate_interval]) + rate(container_fs_writes_bytes_total\{job="kubelet", device=~"mmcblk.p|.*nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+",container!="", cluster="", namespace!=""\} |Per Cluster by Node | +|Cluster CPU Utilization |sum(rate(node_cpu_seconds_total\{mode!="idle",mode!="iowait",mode!="steal"\}[5m])) |Per Cluster | +|Cluster Memory Utilization |1 - sum(:node_memory_MemAvailable_bytes:sum\{cluster=""\}) / sum(node_memory_MemTotal_bytes\job="node-exporter",cluster=""\}) |Per Cluster | +|Cluster Network Total Bytes |sum(rate(node_network_receive_bytes_total[3m]))+sum(rate(node_network_transmit_bytes_total[3m])) |Per Cluster | +|Number of Running Pods |sum(kubelet_running_pod_count\{cluster=""\}) |Per Cluster | +|Number of Running Containers |sum(kubelet_running_container_count\{cluster=""\}) |Per Cluster | +|Cluster CPU Limit |sum(kube_node_status_allocatable\{resource="cpu"\}) |Per Cluster | +|Cluster Memory Limit |sum(kube_node_status_allocatable\{resource="memory"\}) |Per Cluster | +|Cluster Node Count |count(kube_node_info) OR sum(kubelet_node_name\{cluster=""\}) |Per Cluster | + +# Additional Resources + +## AWS Services + +[https://aws-otel.github.io/](https://aws-otel.github.io/) + +[https://aws.amazon.com/prometheus](https://aws.amazon.com/prometheus) + +[https://aws.amazon.com/cloudwatch/features/](https://aws.amazon.com/cloudwatch/features/) + +## Blogs + +[https://aws.amazon.com/blogs/containers/](https://aws.amazon.com/blogs/containers/) + +[https://aws.amazon.com/blogs/containers/metrics-and-traces-collection-using-amazon-eks-add-ons-for-aws-distro-for-opentelemetry/](https://aws.amazon.com/blogs/containers/metrics-and-traces-collection-using-amazon-eks-add-ons-for-aws-distro-for-opentelemetry/) + +[https://aws.amazon.com/blogs/containers/](https://aws.amazon.com/blogs/containers/) + +[https://aws.amazon.com/blogs/containers/introducing-amazon-cloudwatch-container-insights-for-amazon-eks-fargate-using-aws-distro-for-opentelemetry/](https://aws.amazon.com/blogs/containers/introducing-amazon-cloudwatch-container-insights-for-amazon-eks-fargate-using-aws-distro-for-opentelemetry/) + +## Infrastructure as Code Resources + +[https://github.com/aws-observability/terraform-aws-observability-accelerator](https://github.com/aws-observability/terraform-aws-observability-accelerator) + +[https://github.com/aws-ia/terraform-aws-eks-blueprints](https://github.com/aws-ia/terraform-aws-eks-blueprints) diff --git a/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/cost-optimization.md b/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/cost-optimization.md new file mode 100644 index 000000000..e69de29bb diff --git a/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/keda-amp-eks.md b/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/keda-amp-eks.md new file mode 100644 index 000000000..a171ff560 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/keda-amp-eks.md @@ -0,0 +1,53 @@ +# Autoscaling applications using KEDA on AMP and EKS + +# Current Landscape + +Handling increased traffic on Amazon EKS applications is challenging, with manual scaling being inefficient and error-prone. Autoscaling offers a better solution for resource allocation. KEDA enables Kubernetes autoscaling based on various metrics and events, while Amazon Managed Service for Prometheus provides secure metric monitoring for EKS clusters. This solution combines KEDA with Amazon Managed Service for Prometheus, demonstrating autoscaling based on Requests Per Second (RPS) metrics. The approach delivers automated scaling tailored to workload demands, which users can apply to their own EKS workloads. Amazon Managed Grafana is used for monitoring and visualizing scaling patterns, allowing users to gain insights into autoscaling behaviors and correlate them with business events. + + +# Autoscaling application based on AMP metrics with KEDA + +![keda-arch](../../../../images/Containers/oss/eks/arch.png) + +This solution demonstrates AWS integration with open-source software to create an automated scaling pipeline. It combines Amazon EKS for managed Kubernetes, AWS Distro for Open Telemetry (ADOT) for metric collection, KEDA for event-driven autoscaling, Amazon Managed Service for Prometheus for metric storage, and Amazon Managed Grafana for visualization. The architecture involves deploying KEDA on EKS, configuring ADOT to scrape metrics, defining autoscaling rules with KEDA ScaledObject, and using Grafana dashboards to monitor scaling. The autoscaling process begins with user requests to the microservice, ADOT collecting metrics, and sending them to Prometheus. KEDA queries these metrics at regular intervals, determines scaling needs, and interacts with the Horizontal Pod Autoscaler (HPA) to adjust pod replicas. This setup enables metrics-driven autoscaling for Kubernetes microservices, providing a flexible, cloud-native architecture that can scale based on various utilization indicators. + + + +# Cross account EKS application scaling with KEDA on AMP metrics +In this case, lets assume KEDA EKS is running on AWS Account ending with ID 117 and central AMP Account ID is ending with 814. In the KEDA EKS account, setup the cross account IAM role as below: + +![keda1](../../../../images/Containers/oss/eks/keda1.png) + +Also the trust relationship to be updated as below: +![keda2](../../../../images/Containers/oss/eks/keda2.png) + +In the EKS cluster, you could see we dont use Pod identity since IRSA is being used here +![keda3](../../../../images/Containers/oss/eks/keda3.png) + +Whilst the central AMP account we have the AMP access set up as below +![keda4](../../../../images/Containers/oss/eks/keda4.png) + +The trust relationship has the access as well +![keda5](../../../../images/Containers/oss/eks/keda5.png) + +And take a note of the workspace ID as below +![keda6](../../../../images/Containers/oss/eks/keda6.png) + +## KEDA configuration +With the setup in place, lets ensure keda is running as below. For setup instructions refer to the blog link shared below + +![keda7](../../../../images/Containers/oss/eks/keda7.png) + +Ensure to use the central AMP role defined above in the configuration +![keda8](../../../../images/Containers/oss/eks/keda8.png) + +In the KEDA scaler configuration, point to the central AMP account as below +![keda9](../../../../images/Containers/oss/eks/keda9.png) + +And now you can see that the pods are scaled appropriately +![keda10](../../../../images/Containers/oss/eks/keda10.png) + + +## Blogs + +[https://aws.amazon.com/blogs/mt/autoscaling-kubernetes-workloads-with-keda-using-amazon-managed-service-for-prometheus-metrics/](https://aws.amazon.com/blogs/mt/autoscaling-kubernetes-workloads-with-keda-using-amazon-managed-service-for-prometheus-metrics/) diff --git a/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/resource-optimization.md b/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/resource-optimization.md new file mode 100644 index 000000000..e6ec647fd --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/containers/oss/eks/resource-optimization.md @@ -0,0 +1,84 @@ +# Resource Optimization best practices for Kubernetes workloads +Kubernetes adoption continues to accelerate, as many move to microservice based architectures. A lot of the initial focus was on designing and building new cloud native architectures to support the applications. As environments grow, we are starting to see the focus to optimize resource allocation from customers. Resource optimization is the second most important question operations team ask for after security. +Let's talk about guidance on how to optimize resource allocation and right-size applications on Kubernetes environments. This includes applications running on Amazon EKS deployed with managed node groups, self-managed node groups, and AWS Fargate. + +## Reasons for Right-sizing applications on Kubernetes +In Kubernetes, resource right-sizing is done through setting resource specifications on applications. These settings directly impact: + +* Performance — Kubernetes applications will arbitrarily compete for resources without proper resource specifications, This can adversely impact application performance. +* Cost Optimization — Applications deployed with oversized resource specifications will result in increased costs and under utilized infrastructure. +* Autoscaling — The Kubernetes Cluster Autoscaler and Horizontal Pod Autoscaling require resource specifications to function. + +The most common resource specifications in Kubernetes are for [CPU and memory requests and limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits). + +## Requests and Limits + +Containerized applications are deployed on Kubernetes as Pods. CPU and memory requests and limits are an optional part of the Pod definition. CPU is specified in units of [Kubernetes CPUs](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu) while memory is specified in bytes, usually as [mebibytes (Mi)](https://simple.wikipedia.org/wiki/Mebibyte). + +Request and limits each serve different functions in Kubernetes and impact scheduling and resource enforcement differently. + +## Recommendations +An application owner needs to choose the "right" values for their CPU and memory resource requests. An ideal way is to load test the application in a development environment and measure resource usage using observability tooling. While that might make sense for your organization’s most critical applications, it’s likely not feasible for every containerized application deployed in your cluster. Let's talk about the tools that can help us optimize and right-size the workloads: + +### Vertical Pod Autoscaler (VPA) +[VPA](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler) is Kubernetes sub-project owned by the Autoscaling special interest group (SIG). It’s designed to automatically set Pod requests based on observed application performance. VPA collects resource usage using the [Kubernetes Metric Server](https://github.com/kubernetes-sigs/metrics-server) by default but can be optionally configured to use Prometheus as a data source. +VPA has a recommendation engine that measures application performance and makes sizing recommendations. The VPA recommendation engine can be deployed stand-alone so VPA will not perform any autoscaling actions. It’s configured by creating a VerticalPodAutoscaler custom resource for each application and VPA updates the object’s status field with resource sizing recommendations. +Creating VerticalPodAutoscaler objects for every application in your cluster and trying to read and interpret the JSON results is challenging at scale. [Goldilocks](https://github.com/FairwindsOps/goldilocks) is an open source project that makes this easy. + +### Goldilocks +Goldilocks is an open source project from Fairwinds that is designed to help organizations get their Kubernetes application resource requests “just right". The default configuration of Goldilocks is an opt-in model. You choose which workloads are monitored by adding the goldilocks.fairwinds.com/enabled: true label to a namespace. + + +![Goldilocks-Architecture](../../../../images/goldilocks-architecture.png) + +The Metrics Server collects resource metrics from the Kubelet running on worker nodes and exposes them through Metrics API for use by the Vertical Pod Autoscaler. The Goldilocks controller watches for namespaces with the goldilocks.fairwinds.com/enabled: true label and creates VerticalPodAutoscaler objects for each workload in those namespaces. + +To enable namespaces for resource recommendation, run the below command: + +``` +kubectl create ns javajmx-sample +kubectl label ns javajmx-sample goldilocks.fairwinds.com/enabled=true +``` + +To deploy goldilocks in the Amazon EKS Cluster, run the below command: + +``` +helm repo add fairwinds-stable https://charts.fairwinds.com/stable +helm upgrade --install goldilocks fairwinds-stable/goldilocks --namespace goldilocks --create-namespace --set vpa.enabled=true +``` + +Goldilocks-dashboard will expose the dashboard in the port 8080 and we can access it to get the resource recommendation. Let’s run the below command to access the dashboard: + +``` +kubectl -n goldilocks port-forward svc/goldilocks-dashboard 8080:80 +``` +Then open your browser to http://localhost:8080 + +![Goldilocks-Dashboard](../../../../images/goldilocks-dashboard.png) + + +Let’s analyze the sample namespace to see the recommendations provided by Goldilocks. We should be able to see the recommendations for the deployment. +![Goldilocks-Recommendation](../../../../images/goldilocks-recommendation.png) + +We could see the request & limit recommendations for the javajmx-sample workload. The Current column under each Quality of Service (Qos) indicates the currently configured CPU and Memory request and limits. The Guranteed and Burstable column indicates the recommended CPU and Memory request limits for the respective QoS. + + We can clearly notice that we have over provisioned the resources and goldilocks has made the recommendations to optimize the CPU and Memory request. The CPU request and limits has been recommended to be 15m and 15m compared to 100m and 300m for Guranteed QoS and Memory request and limits to be 105M and 105M, compared to 180Mi and 300 Mi. +You can simply copy the respective manifest file for the QoS class, they are interested in and deploy the workloads which is right-sized and optimized. + +### Understand throttling using cAdvisor metric and configuring the resource appropriately +When we configure limits, we are telling the Linux node how long a specific containerized application can run during a specific period of time. We do this to protect the rest of the workloads on a node from a wayward set of processes from taking an unreasonable amount of CPU cycles. We are not defining several physical “cores” sitting on a motherboard; however, we are configuring how much time a grouping of processes or threads in a single container can run before we want to temporarily pause the container to avoid overwhelming other applications. + +There is a handy cAdvisor metrics called `container_cpu_cfs_throttled_seconds_total` which adds up all the throttled 5 ms slices and gives us an idea how far over the quota the process is. This metric is in seconds, so we divide the value by 10 to get 100 ms, which is the real period of time associated with the container. + +PromQl query to understand the top three pods CPU usage over a 100 ms time. +``` +topk(3, max by (pod, container)(rate(container_cpu_usage_seconds_total{image!="", instance="$instance"}[$__rate_interval]))) / 10 +``` + A value of 400 ms of vCPU usage is observed. + +![Throttled-Period](../../../../images/throttled-period.png) + +PromQL gives us a per second throttling, with 10 periods in a second. To get the per period throttling, we divide by 10. If we want to know how much to increase the limits setting, then we can multiple by 10 (e.g., 400 ms * 10 = 4000 m). + +While the above tools provide ways to identify opportunities for resource optimization, applications team should spend time in identifying whether a given application is CPU / Memory intensive and allocate resources to prevent throttling / over-provisioning. + diff --git a/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/AmazonManagedServiceforPrometheus.md b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/AmazonManagedServiceforPrometheus.md new file mode 100644 index 000000000..1c65e47ff --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/AmazonManagedServiceforPrometheus.md @@ -0,0 +1,45 @@ +# Real-time cost monitoring + +Amazon Managed Service for Prometheus is a server-less, Prometheus-compatible monitoring service for container metrics that makes it easier to securely monitor container environments at scale. Amazon Managed Service for Prometheus pricing model is based on Metric samples ingested, Query samples processed, and Metrics stored. You can find the latest pricing details [here][pricing]. + +As a managed service, Amazon Managed Service for Prometheus automatically scales the ingestion, storage, and querying of operational metrics as workloads scale up and down. Some of our customers asked us guidance on how to track `metric samples ingestion rate` and it's cost real-time. Let's explore how you can achieve that. + +### Solution +Amazon Managed Service for Prometheus [vends usage metrics][vendedmetrics] to Amazon CloudWatch. These metrics can be used to help you gain better visibility into your Amazon Managed Service for Prometheus workspace. The vended metrics can be found in the `AWS/Usage` and `AWS/Prometheus` namespaces in CloudWatch and these [metrics][AMPMetrics] are available in CloudWatch for no additional charge. You can always create a CloudWatch dashboard to further explore and visualize these metrics. + +Today, you will be using Amazon CloudWatch as a data-source for Amazon Managed Grafana and build dashboards in Grafana to visualize those metrics. The architecture diagram illustrates the following. + +- Amazon Managed Service for Prometheus publishing vended metrics to Amazon CloudWatch + +- Amazon CloudWatch as a data-source for Amazon Managed Grafana + +- Users accessing the dashboards created in Amazon Managed Grafana + +![prometheus-ingestion-rate](../../../images/ampmetricsingestionrate.png) + +### Amazon Managed Grafana Dashboards + +The dashboard created in Amazon Managed Grafana will enable you to visualize; + +1. Prometheus Ingestion Rate per workspace +![prometheus-ingestion-rate-dash1](../../../images/ampwsingestionrate-1.png) + +2. Prometheus Ingestion Rate and Real-time Cost per workspace + For real-time cost tracking, you will be using a `math expression` based on the pricing of `Metrics Ingested Tier` for the `First 2 billion samples` mentioned in the official [AWS pricing document][pricing]. Math operations take numbers and time series as input and change them to different numbers and time series and refer this [document][mathexpression] for further customization to fit your business requirements. +![prometheus-ingestion-rate-dash2](../../../images/ampwsingestionrate-2.png) + +3. Prometheus Active Series per workspace +![prometheus-ingestion-rate-dash3](../../../images/ampwsingestionrate-3.png) + + +A dashboard in Grafana is represented by a JSON object, which stores metadata of its dashboard. Dashboard metadata includes dashboard properties, metadata from panels, template variables, panel queries, etc. + +You can access the **JSON template** of the above dashboard [here](AmazonPrometheusMetrics.json). + +With the preceding dashboard, you can now identify ingestion rate per workspace and monitor real-time cost per workspace based on the metrics ingestion rate for Amazon Managed Service for Prometheus. You can use other Grafana [dashboard panels][panels] to build visuals to suit your requirements. + +[pricing]: https://aws.amazon.com/prometheus/pricing/ +[AMPMetrics]: https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-CW-usage-metrics.html +[vendedmetrics]: https://aws.amazon.com/blogs/mt/introducing-vended-metrics-for-amazon-managed-service-for-prometheus/ +[mathexpression]: https://grafana.com/docs/grafana/latest/panels-visualizations/query-transform-data/expression-queries/#math +[panels]: https://docs.aws.amazon.com/grafana/latest/userguide/Grafana-panels.html \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/AmazonPrometheus.json b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/AmazonPrometheus.json new file mode 100644 index 000000000..958c0d090 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/AmazonPrometheus.json @@ -0,0 +1,685 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": "-- Grafana --", + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": 14, + "iteration": 1678997649417, + "links": [], + "liveNow": false, + "panels": [ + { + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "decimals": 2, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "orange", + "value": 70 + }, + { + "color": "red", + "value": 85 + } + ] + }, + "unit": "currencyUSD" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 10, + "x": 0, + "y": 0 + }, + "id": 4, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "/^cost$/", + "values": false + }, + "textMode": "auto" + }, + "pluginVersion": "8.4.7", + "targets": [ + { + "column": "cost", + "connectionArgs": { + "catalog": "__default", + "database": "__default", + "region": "__default" + }, + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "format": 1, + "rawSQL": "select\r\n sum(line_item_unblended_cost) AS cost\r\nfrom curs3isengard_with_resourceids\r\nwhere\r\n (\"line_item_product_code\" = 'AmazonPrometheus')", + "refId": "A", + "table": "prometheus_cost_only" + } + ], + "title": "Prometheus Cost - (YTD)", + "type": "stat" + }, + { + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisGridShow": true, + "axisLabel": "", + "axisPlacement": "auto", + "axisSoftMin": 0, + "fillOpacity": 80, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineWidth": 1, + "scaleDistribution": { + "type": "linear" + } + }, + "mappings": [ + { + "options": { + "": { + "color": "dark-blue", + "index": 0 + } + }, + "type": "value" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "none" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "cost" + }, + "properties": [ + { + "id": "unit", + "value": "currencyUSD" + } + ] + } + ] + }, + "gridPos": { + "h": 8, + "w": 14, + "x": 10, + "y": 0 + }, + "id": 7, + "options": { + "barRadius": 0, + "barWidth": 0.97, + "colorByField": "line_item_usage_account_id", + "groupWidth": 0.7, + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom" + }, + "orientation": "horizontal", + "showValue": "always", + "stacking": "none", + "text": {}, + "tooltip": { + "mode": "single", + "sort": "none" + }, + "xField": "line_item_usage_account_id", + "xTickLabelRotation": 0, + "xTickLabelSpacing": 0 + }, + "pluginVersion": "8.4.7", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "__default", + "region": "__default" + }, + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "format": 1, + "rawSQL": "select\r\n line_item_usage_account_id,\r\n sum (line_item_unblended_cost) AS cost\r\nfrom curs3isengard_with_resourceids\r\nwhere\r\n (\"line_item_product_code\" = 'AmazonPrometheus')\r\nGROUP BY line_item_usage_account_id", + "refId": "A" + } + ], + "title": "AWS Accounts - Cost (YTD)", + "type": "barchart" + }, + { + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "axisSoftMin": 0, + "fillOpacity": 80, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineWidth": 1, + "scaleDistribution": { + "type": "linear" + } + }, + "decimals": 2, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "currencyUSD" + }, + "overrides": [] + }, + "gridPos": { + "h": 11, + "w": 11, + "x": 0, + "y": 8 + }, + "id": 5, + "options": { + "barRadius": 0, + "barWidth": 0.97, + "groupWidth": 0.7, + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom" + }, + "orientation": "auto", + "showValue": "auto", + "stacking": "none", + "tooltip": { + "mode": "single", + "sort": "none" + }, + "xTickLabelRotation": 0, + "xTickLabelSpacing": 0 + }, + "pluginVersion": "8.4.7", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "__default", + "region": "__default" + }, + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "format": 1, + "hide": false, + "rawSQL": "select\r\n line_item_operation,\r\n sum (line_item_unblended_cost) AS cost\r\nfrom curs3isengard_with_resourceids\r\nwhere\r\n (\"line_item_product_code\" = 'AmazonPrometheus')\r\nGROUP BY line_item_operation", + "refId": "A" + } + ], + "title": "Operations Cost - (YTD)", + "type": "barchart" + }, + { + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + } + }, + "mappings": [] + }, + "overrides": [] + }, + "gridPos": { + "h": 11, + "w": 13, + "x": 11, + "y": 8 + }, + "id": 8, + "options": { + "displayLabels": [ + "percent" + ], + "legend": { + "displayMode": "list", + "placement": "right", + "values": [] + }, + "pieType": "donut", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "8.4.7", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "__default", + "region": "__default" + }, + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "format": 1, + "hide": false, + "rawSQL": "select\r\n line_item_resource_id,\r\n sum (line_item_unblended_cost) AS cost\r\nfrom curs3isengard_with_resourceids\r\nwhere\r\n (\"line_item_product_code\" = 'AmazonPrometheus')\r\nGROUP BY line_item_resource_id\r\nORDER BY cost;\r\n\r\n", + "refId": "A" + } + ], + "title": "Workspaces Cost - (YTD)", + "type": "piechart" + }, + { + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "continuous-GrYlRd" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "cost" + }, + "properties": [ + { + "id": "unit", + "value": "currencyUSD" + } + ] + } + ] + }, + "gridPos": { + "h": 9, + "w": 24, + "x": 0, + "y": 19 + }, + "id": 9, + "options": { + "displayMode": "lcd", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [], + "fields": "", + "values": true + }, + "showUnfilled": true + }, + "pluginVersion": "8.4.7", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "__default", + "region": "__default" + }, + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "format": 1, + "hide": false, + "rawSQL": "select\r\n line_item_operation,\r\n line_item_resource_id,\r\n sum (line_item_unblended_cost) AS cost\r\nfrom curs3isengard_with_resourceids\r\nwhere\r\n (\"line_item_product_code\" = 'AmazonPrometheus')\r\nGROUP BY line_item_operation, line_item_resource_id\r\nORDER BY cost;", + "refId": "A" + } + ], + "title": "Operations per Workspace Cost - (YTD)", + "type": "bargauge" + }, + { + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "displayMode": "auto" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "cost" + }, + "properties": [ + { + "id": "custom.width", + "value": 164 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "line_item_resource_id" + }, + "properties": [ + { + "id": "custom.width", + "value": 495 + } + ] + } + ] + }, + "gridPos": { + "h": 15, + "w": 24, + "x": 0, + "y": 28 + }, + "id": 2, + "options": { + "footer": { + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "showHeader": true, + "sortBy": [ + { + "desc": true, + "displayName": "year" + } + ] + }, + "pluginVersion": "8.4.7", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "__default", + "region": "__default" + }, + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "format": 1, + "rawSQL": "SELECT\r\n line_item_usage_account_id\r\n, line_item_resource_id\r\n, line_item_operation\r\n, line_item_usage_type\r\n, month\r\n, year\r\n, \"sum\"(cast(line_item_unblended_cost as DECIMAL(16,2))) AS cost\r\n, \"sum\"(line_item_usage_amount) \"Usage\"\r\nFROM\r\n curs3isengard_with_resourceids\r\nWHERE (\"line_item_product_code\" = 'AmazonPrometheus')\r\nGROUP BY 1, 2, 3, 4, 5, 6", + "refId": "A", + "table": "curs3isengard_with_resourceids" + } + ], + "title": "Prometheus", + "type": "table" + } + ], + "refresh": false, + "schemaVersion": 35, + "style": "dark", + "tags": [], + "templating": { + "list": [ + { + "current": { + "selected": true, + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "definition": "", + "hide": 0, + "includeAll": true, + "label": "AWS Account", + "multi": true, + "name": "Account", + "options": [], + "query": { + "column": "line_item_usage_account_id", + "connectionArgs": { + "catalog": "__default", + "database": "__default", + "region": "__default" + }, + "format": 1, + "rawSQL": "select\r\n line_item_usage_account_id,\r\n sum (line_item_unblended_cost) AS cost\r\nfrom curs3isengard_with_resourceids\r\nwhere\r\n (\"line_item_product_code\" = 'AmazonPrometheus')\r\nGROUP BY line_item_usage_account_id\r\n", + "table": "prometheus_cost_only" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": true, + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "definition": "", + "hide": 0, + "includeAll": true, + "multi": true, + "name": "Workspace", + "options": [], + "query": { + "connectionArgs": { + "catalog": "__default", + "database": "__default", + "region": "__default" + }, + "format": 1, + "rawSQL": "select\r\n line_item_resource_id\r\nfrom curs3isengard_with_resourceids\r\nwhere\r\n (\"line_item_product_code\" = 'AmazonPrometheus')\r\nGROUP BY line_item_resource_id" + }, + "refresh": 1, + "regex": "/.*workspace/([^]*).*/", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": true, + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "grafana-athena-datasource", + "uid": "u-g8lI04k" + }, + "definition": "", + "hide": 0, + "includeAll": true, + "label": "Operation", + "multi": true, + "name": "Operation", + "options": [], + "query": { + "column": "line_item_operation", + "connectionArgs": { + "catalog": "__default", + "database": "__default", + "region": "__default" + }, + "format": 1, + "rawSQL": "select\r\n line_item_operation\r\nfrom curs3isengard_with_resourceids\r\nwhere\r\n (\"line_item_product_code\" = 'AmazonPrometheus')\r\nGROUP BY line_item_operation", + "table": "prometheus_cost_only" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + } + ] + }, + "time": { + "from": "now-6h", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "Amazon Prometheus 2023", + "uid": "yCmLT01Vz", + "version": 26, + "weekStart": "" + } \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/AmazonPrometheusMetrics.json b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/AmazonPrometheusMetrics.json new file mode 100644 index 000000000..b7375e671 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/AmazonPrometheusMetrics.json @@ -0,0 +1,546 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": 14, + "links": [], + "liveNow": false, + "panels": [ + { + "datasource": { + "type": "cloudwatch", + "uid": "hRQvHbX4k" + }, + "description": "Amazon Prometheus Ingestion Rate", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 12, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 4, + "options": { + "legend": { + "calcs": [], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "9.4.7", + "targets": [ + { + "datasource": { + "type": "cloudwatch", + "uid": "hRQvHbX4k" + }, + "dimensions": { + "Resource": "IngestionRate" + }, + "expression": "", + "hide": false, + "id": "", + "label": "", + "logGroups": [], + "matchExact": false, + "metricEditorMode": 0, + "metricName": "ResourceCount", + "metricQueryType": 0, + "namespace": "AWS/Usage", + "period": "", + "queryMode": "Metrics", + "refId": "A", + "region": "default", + "sqlExpression": "", + "statistic": "Average" + } + ], + "title": "Prometheus Ingestion Rate", + "type": "timeseries" + }, + { + "datasource": { + "type": "cloudwatch", + "uid": "hRQvHbX4k" + }, + "description": "Amazon Prometheus Ingestion Rate & Cost (workspace 1)", + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "none" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "ResourceCount" + }, + "properties": [ + { + "id": "unit", + "value": "none" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Cost (in USD) {Resource=\"IngestionRate\", ResourceId=\"ws-8fd76312-db92-41a4-b775-72cd73c1c28f\"}" + }, + "properties": [ + { + "id": "unit", + "value": "currencyUSD" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "ResourceCount" + }, + "properties": [ + { + "id": "displayName", + "value": "Ingestion Rate" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Cost (in USD) {Resource=\"IngestionRate\", ResourceId=\"ws-8fd76312-db92-41a4-b775-72cd73c1c28f\"}" + }, + "properties": [ + { + "id": "displayName", + "value": "Cost (in USD)" + } + ] + } + ] + }, + "gridPos": { + "h": 12, + "w": 24, + "x": 0, + "y": 12 + }, + "id": 5, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "auto" + }, + "pluginVersion": "9.4.7", + "targets": [ + { + "datasource": { + "type": "cloudwatch", + "uid": "hRQvHbX4k" + }, + "dimensions": { + "Resource": "IngestionRate", + "ResourceId": "ws-8fd76312-db92-41a4-b775-72cd73c1c28f" + }, + "expression": "METRICS()", + "hide": false, + "id": "", + "label": "", + "logGroups": [], + "matchExact": false, + "metricEditorMode": 0, + "metricName": "ResourceCount", + "metricQueryType": 0, + "namespace": "AWS/Usage", + "period": "", + "queryMode": "Metrics", + "refId": "A", + "region": "default", + "sqlExpression": "", + "statistic": "Average" + }, + { + "datasource": { + "name": "Expression", + "type": "__expr__", + "uid": "__expr__" + }, + "expression": "$A*(0.90/10000000)", + "hide": false, + "refId": "Cost (in USD)", + "type": "math" + } + ], + "title": "Prometheus Ingestion Rate & Cost (workspace 1)", + "transformations": [], + "type": "stat" + }, + { + "datasource": { + "type": "cloudwatch", + "uid": "hRQvHbX4k" + }, + "description": "Amazon Prometheus Ingestion Rate & Cost (workspace 2)", + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "none" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "ResourceCount" + }, + "properties": [ + { + "id": "unit", + "value": "none" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Cost (in USD) {Resource=\"IngestionRate\", ResourceId=\"ws-fe9c4f9b-ff1d-4e17-acab-4fb140e830c8\"}" + }, + "properties": [ + { + "id": "unit", + "value": "currencyUSD" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "ResourceCount" + }, + "properties": [ + { + "id": "displayName", + "value": "Ingestion Rate" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Cost (in USD) {Resource=\"IngestionRate\", ResourceId=\"ws-fe9c4f9b-ff1d-4e17-acab-4fb140e830c8\"}" + }, + "properties": [ + { + "id": "displayName", + "value": "Cost (in USD)" + } + ] + } + ] + }, + "gridPos": { + "h": 12, + "w": 24, + "x": 0, + "y": 24 + }, + "id": 6, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "textMode": "auto" + }, + "pluginVersion": "9.4.7", + "targets": [ + { + "datasource": { + "type": "cloudwatch", + "uid": "hRQvHbX4k" + }, + "dimensions": { + "Resource": "IngestionRate", + "ResourceId": "ws-fe9c4f9b-ff1d-4e17-acab-4fb140e830c8" + }, + "expression": "METRICS()", + "hide": false, + "id": "", + "label": "", + "logGroups": [], + "matchExact": false, + "metricEditorMode": 0, + "metricName": "ResourceCount", + "metricQueryType": 0, + "namespace": "AWS/Usage", + "period": "", + "queryMode": "Metrics", + "refId": "A", + "region": "default", + "sqlExpression": "", + "statistic": "Average" + }, + { + "datasource": { + "name": "Expression", + "type": "__expr__", + "uid": "__expr__" + }, + "expression": "$A*(0.90/10000000)", + "hide": false, + "refId": "Cost (in USD)", + "type": "math" + } + ], + "title": "Prometheus Ingestion Rate & Cost (workspace 2)", + "transformations": [], + "type": "stat" + }, + { + "datasource": { + "type": "cloudwatch", + "uid": "hRQvHbX4k" + }, + "description": "Amazon Prometheus Active Series", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 12, + "w": 24, + "x": 0, + "y": 36 + }, + "id": 3, + "options": { + "legend": { + "calcs": [], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "9.4.7", + "targets": [ + { + "datasource": { + "type": "cloudwatch", + "uid": "hRQvHbX4k" + }, + "dimensions": { + "Resource": "ActiveSeries" + }, + "expression": "", + "hide": false, + "id": "", + "label": "", + "logGroups": [], + "matchExact": false, + "metricEditorMode": 0, + "metricName": "ResourceCount", + "metricQueryType": 0, + "namespace": "AWS/Usage", + "period": "", + "queryMode": "Metrics", + "refId": "A", + "region": "default", + "sqlExpression": "", + "statistic": "Average" + } + ], + "title": "Prometheus Active Series", + "type": "timeseries" + } + ], + "refresh": "", + "revision": 1, + "schemaVersion": 38, + "style": "dark", + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-6h", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "Amazon Prometheus (Metrics ingestion rate real-time monitoring)", + "uid": "J4527S94k", + "version": 1, + "weekStart": "" +} \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/amazon-cloudwatch.md b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/amazon-cloudwatch.md new file mode 100644 index 000000000..537dadeb8 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/amazon-cloudwatch.md @@ -0,0 +1,59 @@ +# Amazon CloudWatch + +Amazon CloudWatch cost and usage visuals will allow you to gain insights into cost of individual AWS Accounts, AWS Regions, and all CloudWatch operations like GetMetricData, PutLogEvents, GetMetricStream, ListMetrics, MetricStorage, HourlyStorageMetering, and ListMetrics to name a few! + +To visualize and analyze the CloudWatch cost and usage data, you need to create a custom Athena view. An Amazon Athena [view][view] is a logical table and it creates a subset of columns from the original CUR table to simplify the querying of data. + +1. Before proceeding, make sure that you’ve created the CUR (step #1) and deployed the AWS Conformation Template (step #2) mentioned in the [Implementation overview][cid-implement]. + +2. Now, Create a new Amazon Athena [view][view] by using the following query. This query fetches cost and usage of Amazon CloudWatch across all the AWS Accounts in your Organization. + + CREATE OR REPLACE VIEW "cloudwatch_cost" AS + SELECT + line_item_usage_type + , line_item_resource_id + , line_item_operation + , line_item_usage_account_id + , month + , year + , "sum"(line_item_usage_amount) "Usage" + , "sum"(line_item_unblended_cost) cost + FROM + database.tablename #replace database.tablename with your database and table name + WHERE ("line_item_product_code" = 'AmazonCloudWatch') + GROUP BY 1, 2, 3, 4, 5, 6 + + +### Create Amazon QuickSight dashboard + +Now, let’s create a QuickSight dashboard to visualize the cost and usage of Amazon CloudWatch. + +1. On AWS Management Console, navigate to QuickSight service and then select your AWS Region from top right corner. Note that QuickSight Dataset should be in the same AWS Region as that of Amazon Athena table. +2. Make sure that QuickSight can [access][access] Amazon S3 and AWS Athena. +3. [Create QuickSight Dataset][create-dataset] by selecting the data-source as the Amazon Athena view that you created before. Use this procedure to [schedule refreshing][schedule-refresh] the Dataset on a daily basis. +4. Create QuickSight [Analysis][analysis]. +5. Create QuickSight [Visuals][visuals] to meet your needs. +6. [Format][format] the Visual to meet your needs. +7. Now, you can [publish][publish] your dashboard from the Analysis. +8. You can send the dashboard in [report][report] format to individuals or groups, either once or on a schedule. + +The following **QuickSight dashboard** shows Amazon CloudWatch cost and usage across all AWS Accounts in your AWS Organizations along with CloudWatch operations like GetMetricData, PutLogEvents, GetMetricStream, ListMetrics, MetricStorage, HourlyStorageMetering, and ListMetrics to name a few. + +![cloudwatch-cost1](../../../images/cloudwatch-cost-1.PNG) +![cloudwatch-cost2](../../../images/cloudwatch-cost-2.PNG) + +With the preceding dashboard, you can now identify the cost of Amazon CloudWatch in the AWS accounts across your Organization. You can use other QuickSight [visual types][types] to build different dashboards to suit your requirements. + + +[view]: https://athena-in-action.workshop.aws/30-basics/303-create-view.html +[access]: https://docs.aws.amazon.com/quicksight/latest/user/accessing-data-sources.html +[create-dataset]: https://docs.aws.amazon.com/quicksight/latest/user/create-a-data-set-athena.html +[schedule-refresh]: https://docs.aws.amazon.com/quicksight/latest/user/refreshing-imported-data.html +[analysis]: https://docs.aws.amazon.com/quicksight/latest/user/creating-an-analysis.html +[visuals]: https://docs.aws.amazon.com/quicksight/latest/user/creating-a-visual.html +[format]: https://docs.aws.amazon.com/quicksight/latest/user/formatting-a-visual.html +[publish]: https://docs.aws.amazon.com/quicksight/latest/user/creating-a-dashboard.html +[report]: https://docs.aws.amazon.com/quicksight/latest/user/sending-reports.html +[types]: https://docs.aws.amazon.com/quicksight/latest/user/working-with-visual-types.html +[cid-implement]: ../../../guides/cost/cost-visualization/cost.md#implementation + diff --git a/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/amazon-grafana.md b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/amazon-grafana.md new file mode 100644 index 000000000..fc4cd3013 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/amazon-grafana.md @@ -0,0 +1,31 @@ +# Amazon Managed Grafana + +Amazon Managed Grafana cost and usage visuals will allow you to gain insights into cost of individual AWS Accounts, AWS Regions, specific Grafana Workspace instances and licensing cost of Admin, Editor, and Viewer users! + +To visualize and analyze the cost and usage data, you need to create a custom Athena view. + +1. Before proceeding, make sure that you’ve created the CUR (step #1) and deployed the AWS Conformation Template (step #2) mentioned in the [Implementation overview][cid-implement]. + +2. Now, Create a new Amazon Athena [view][view] by using the following query. This query fetches cost and usage of Amazon Managed Grafana across all the AWS Accounts in your Organization. + + CREATE OR REPLACE VIEW "grafana_cost" AS + SELECT + line_item_usage_type + , line_item_resource_id + , line_item_operation + , line_item_usage_account_id + , month + , year + , "sum"(line_item_usage_amount) "Usage" + , "sum"(line_item_unblended_cost) cost + FROM + database.tablename #replace database.tablename with your database and table name + WHERE ("line_item_product_code" = 'AmazonGrafana') + GROUP BY 1, 2, 3, 4, 5, 6 + +Using Athena as a data source, you can build dashboards in either Amazon Managed Grafana or Amazon QuickSight to suit your business requirements. As well, you could directly run [SQL queries][sql-query] against the Athena view that you created. + + +[view]: https://athena-in-action.workshop.aws/30-basics/303-create-view.html +[sql-query]: https://docs.aws.amazon.com/athena/latest/ug/querying-athena-tables.html +[cid-implement]: ../../../guides/cost/cost-visualization/cost.md#implementation \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/amazon-prometheus.md b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/amazon-prometheus.md new file mode 100644 index 000000000..2b839160f --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/amazon-prometheus.md @@ -0,0 +1,42 @@ +# Amazon Managed Service for Prometheus + +Amazon Managed Service for Prometheus cost and usage visuals will allow you to gain insights into cost of individual AWS Accounts, AWS Regions, specific Prometheus Workspace instances along with Operations like RemoteWrite, Query, and HourlyStorageMetering! + +To visualize and analyze the cost and usage data, you need to create a custom Athena view. + +1. Before proceeding, make sure that you’ve created the CUR (step #1) and deployed the AWS Conformation Template (step #2) mentioned in the [Implementation overview][cid-implement]. + +2. Now, Create a new Amazon Athena [view][view] by using the following query. This query fetches cost and usage of Amazon Managed Service for Prometheus across all the AWS Accounts in your Organization. + + CREATE OR REPLACE VIEW "prometheus_cost" AS + SELECT + line_item_usage_type + , line_item_resource_id + , line_item_operation + , line_item_usage_account_id + , month + , year + , "sum"(line_item_usage_amount) "Usage" + , "sum"(line_item_unblended_cost) cost + FROM + database.tablename #replace database.tablename with your database and table name + WHERE ("line_item_product_code" = 'AmazonPrometheus') + GROUP BY 1, 2, 3, 4, 5, 6 + +## Create Amazon Managed Grafana dashboard + +With Amazon Managed Grafana, you can add Athena as a data source by using the AWS data source configuration option in the Grafana workspace console. This feature simplifies adding Athena as a data source by discovering your existing Athena accounts and manages the configuration of the authentication credentials that are required to access Athena. For prerequisites associated with using the Athena data source, see [Prerequisites][Prerequisites]. + + +The following **Grafana dashboard** shows Amazon Managed Service for Prometheus cost and usage across all AWS Accounts in your AWS Organizations along with cost of individual Prometheus Workspace instances and the Operations like RemoteWrite, Query, and HourlyStorageMetering! + +![prometheus-cost](../../../images/prometheus-cost.png) + +A dashboard in Grafana is represented by a JSON object, which stores metadata of its dashboard. Dashboard metadata includes dashboard properties, metadata from panels, template variables, panel queries, etc. Access the JSON template of the above dashboard [here](AmazonPrometheus.json). + +With the preceding dashboard, you can now identify the cost and usage of Amazon Managed Service for Prometheus in the AWS accounts across your Organization. You can use other Grafana [dashboard panels][panels] to build visuals to suit your requirements. + +[Prerequisites]: https://docs.aws.amazon.com/grafana/latest/userguide/Athena-prereq.html +[view]: https://athena-in-action.workshop.aws/30-basics/303-create-view.html +[panels]: https://docs.aws.amazon.com/grafana/latest/userguide/Grafana-panels.html +[cid-implement]: ../../../guides/cost/cost-visualization/cost.md#implementation \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/aws-xray.md b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/aws-xray.md new file mode 100644 index 000000000..aada242ad --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/aws-xray.md @@ -0,0 +1,29 @@ +# AWS X-Ray + +AWS X-Ray cost and usage visuals will allow you to gain insights into cost of individual AWS Accounts, AWS Regions, and TracesStored! + +To visualize and analyze the cost and usage data, you need to create a custom Athena view. + +1. Before proceeding, make sure that you’ve created the CUR (step #1) and deployed the AWS Conformation Template (step #2) mentioned in the [Implementation overview][cid-implement]. + +2. Now, Create a new Amazon Athena [view][view] by using the following query. This query fetches cost and usage of Amazon Managed Grafana across all the AWS Accounts in your Organization. + + CREATE OR REPLACE VIEW "xray_cost" AS + SELECT + line_item_usage_type + , line_item_resource_id + , line_item_usage_account_id + , month + , year + , "sum"(line_item_usage_amount) "Usage" + , "sum"(line_item_net_unblended_cost) cost + FROM + database.tablename #replace database.tablename with your database and table name + WHERE ("line_item_product_code" = 'AWSXRay') + GROUP BY 1, 2, 3, 4, 5 + +Using Athena as a data source, you can build dashboards in either Amazon Managed Grafana or Amazon QuickSight to suit your business requirements. As well, you could directly run [SQL queries][sql-query] against the Athena view that you created. + +[view]: https://athena-in-action.workshop.aws/30-basics/303-create-view.html +[sql-query]: https://docs.aws.amazon.com/athena/latest/ug/querying-athena-tables.html +[cid-implement]: ../../../guides/cost/cost-visualization/cost.md#implementation \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/cost.md b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/cost.md new file mode 100644 index 000000000..1b32e378a --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/cost.md @@ -0,0 +1,59 @@ +# AWS Observability services and Cost + +As you invest in your Observability stack, it’s important that you monitor the **cost** of your observability products on a regular basis. This allows you to ensure that you are only incurring the costs you need and that you are not overspending on resources you don't need. + +## AWS Tools for Cost Optimization + +Most organizations’ core focus lies on scaling their IT infrastructure on cloud, and usually are uncontrolled, unprepared, and unaware of their actual or forthcoming cloud spend. To help you track, report, and analyze costs over time, AWS provides several cost-optimization tools: + +[AWS Cost Explorer][cost-explorer] – See patterns in AWS spending over time, project future costs, identify areas that need further inquiry, observe Reserved Instance utilization, observe Reserved Instance coverage, and receive Reserved Instance recommendations. + +[AWS Cost and Usage Report(CUR)][CUR]– Granular raw data files detailing your hourly AWS usage across accounts used for Do-It-Yourself (DIY) analysis. The AWS Cost and Usage Report has dynamic columns that populate depending on the services you use. + +## Architecture overview: Visualizing AWS Cost and Usage Report + +You can build AWS cost and usage dashboards in Amazon Managed Grafana or Amazon QuickSight. The following architecture diagram illustrates both the solutions. + +![Architecture diagram](../../../images/cur-architecture.png) +*Architecture diagram* + +## Cloud Intelligence Dashboards + +The [Cloud Intelligence Dashboards][cid] are a collection of [Amazon QuickSight][quicksight] dashboards built on top of AWS Cost and Usage report (CUR). These dashboards work as your own cost management and optimization (FinOps) tool. You get in-depth, granular, and recommendation-driven dashboards that can help you get a detailed view of your AWS usage and costs. + +### Implementation + +1. Create a [CUR report][cur-report] with [Amazon Athena][amazon-athnea] integration enabled. +*During the initial configuration, it can take up to 24 hours for AWS to start delivering reports to your Amazon S3 bucket. Reports are delivered once a day. To streamline and automate integration of your Cost and Usage Reports with Athena, AWS provides an AWS CloudFormation template with several key resources along with the reports that you set up for Athena integration.* + +2. Deploy the [AWS CloudFormation template][cloudformation]. +*This template includes an AWS Glue crawler, an AWS Glue database, and an AWS Lambda event. At this point, CUR data is made available through tables in Amazon Athena for you to query.* + + - Run [Amazon Athena][athena-query] queries directly on your CUR data. +*To run Athena queries on your data, first use the Athena console to check whether AWS is refreshing your data and then run your query on the Athena console.* + +3. Deploy Cloud Intelligence dashboards. + - For manual deployment, refer the AWS Well-Architected **[Cost Optimization lab][cost-optimization-lab]**. + - For automated deployment, refer the [GitHub repo][GitHub-repo]. + +Cloud Intelligence dashboards are great for Finance teams, Executives, and IT managers. However, one common question that we get from customers is how to gain insights into organizational wide cost of individual AWS Observability products like Amazon CloudWatch, AWS X-Ray, Amazon Managed Service for Prometheus, and Amazon Managed Grafana. + +In the next section, you will dive-deep into cost and usage of each of those products. Companies of any size can adopt this proactive approach to cloud cost optimization strategy and improve business efficiency through cloud cost analytics and data-driven decisions, without any performance impact or operational overhead. + + +[cost-explorer]: https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/ce-what-is.html +[CUR]: https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html +[cid]: https://wellarchitectedlabs.com/cost/200_labs/200_cloud_intelligence/ +[quicksight]: https://aws.amazon.com/quicksight/ +[cur-report]: https://docs.aws.amazon.com/cur/latest/userguide/cur-create.html +[amazon-athnea]: https://aws.amazon.com/athena/ +[cloudformation]: https://docs.aws.amazon.com/cur/latest/userguide/use-athena-cf.html +[athena-query]: https://docs.aws.amazon.com/cur/latest/userguide/cur-ate-run.html +[cost-optimization-lab]: https://www.wellarchitectedlabs.com/cost/200_labs/200_cloud_intelligence/ +[GitHub-repo]: https://github.com/aws-samples/aws-cudos-framework-deployment + + + + + + diff --git a/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/reducing-cw-cost.md b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/reducing-cw-cost.md new file mode 100644 index 000000000..ecdc7765a --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/cost-visualization/reducing-cw-cost.md @@ -0,0 +1,23 @@ +# Reducing CloudWatch cost + +## GetMetricData + +Typically `GetMetricData` is caused by calls from 3rd party Observability tools and/or cloud financial tools using the CloudWatch Metrics in their platform. + +- Consider reducing the frequency with which the 3rd party tool is making requests. For example, reducing frequency from 1 min to 5 mins should result in a 1/5 (20%) of the cost. +- To identify the trend, consider turning off any data collection from 3rd party tools for a short while. + +## CloudWatch Logs + +- Find the top contributors using this [knowledge center document][log-article]. +- Reduce the logging level of top contributors unless deemed necessary. +- Find out if you are using 3rd party tooling for logging in addition to Cloud Watch. +- VPC Flow Log costs can add up quick if you have enabled it on every VPC and has a lot of traffic. If you still need it, consider delivering it to Amazon S3. +- See if logging is necessary on all AWS Lambda functions. If it’s not, deny “logs:PutLogEvents” permissions in the Lambda role. +- CloudTrail logs are often a top contributor. Sending them to Amazon S3 and using Amazon Athena to query and Amazon EventBridge for alarms/notifications is cheaper. + +Refer this [knowledge center article][article] for further details. + + +[article]: https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-understand-and-reduce-charges/ +[log-article]: https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-logs-bill-increase/ \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/cost/kubecost.md b/docusaurus/observability-best-practices/docs/guides/cost/kubecost.md new file mode 100644 index 000000000..2ae77b153 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/cost/kubecost.md @@ -0,0 +1,129 @@ +# Using Kubecost +Kubecost provides customers with visibility into spend and resource efficiency in Kubernetes environments. At a high level, Amazon EKS cost monitoring is deployed with Kubecost, which includes Prometheus, an open-source monitoring system and time series database. Kubecost reads metrics from Prometheus then performs cost allocation calculations and writes the metrics back to Prometheus. Finally, the Kubecost front end reads metrics from Prometheus and shows them on the Kubecost user interface (UI). The architecture is illustrated by the following diagram: + +![Architecture](../../images/kubecost-architecture.png) + +## Reasons to use Kubecost +As customers modernize their applications and deploy workloads using Amazon EKS, they gain efficiencies by consolidating the compute resources required to run their applications. However, this utilization efficiency comes at a tradeoff of increased difficulty measuring application costs. Today, you can use one of these methods to distribute costs by tenant: + +* Hard multi-tenancy — Run separate EKS clusters in dedicated AWS accounts. +* Soft multi-tenancy — Run multiple node groups in a shared EKS cluster. +* Consumption based billing — Use resource consumption to calculate the cost incurred in a shared EKS cluster. + +With Hard multi-tenancy, workloads get deployed in separate EKS clusters and you can identify the cost incurred for the cluster and its dependencies without having to run reports to determine each tenant’s spend. +With Soft multi-tenancy, you can use Kubernetes features like [Node Selectors](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) and [Node Affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity) to instruct Kubernetes Scheduler to run a tenant’s workload on dedicated node groups. You can tag the EC2 instances in a node group with an identifier (like product name or team name) and use [tags](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html) to distribute costs. +A downside of the above two approach is that you may end up with unused capacity and may not fully utilize the cost savings that come when you run a densely packed cluster. You still need ways to allocate cost of shared resources like Elastic Load Balancing, network transfer charges. + +The most efficient way to track costs in multi-tenant Kubernetes clusters is to distribute incurred costs based on the amount of resources consumed by workloads. This pattern allows you to maximize the utilization of your EC2 instances because different workloads can share nodes, which allows you to increase the pod-density on your nodes. However, calculating costs by workload or namespaces is a challenging task. Understanding the cost-responsibility of a workload requires aggregating all the resources consumed or reserved during a time-frame, and evaluating the charges based on the cost of the resource and the duration of the usage. This is the exact challenge that Kubecost is dedicated to tackling. + +:::tip + Take a look at our [One Observability Workshop](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/amp/ingest-kubecost-metrics) to get a hands-on experience on Kubecost. +::: + +## Recommendations +### Cost Allocation +The Kubecost Cost Allocation dashboard allows you to quickly see allocated spend and optimization opportunity across all native Kubernetes concepts, e.g. namespace, k8s label, and service. It also allows for allocating cost to organizational concepts like team, product/project, department, or environment. You can modify Date range, filters to derive insights about specific workload and save the report. To optimize the Kubernetes cost, you should be paying attention to the efficiency and cluster idle costs. + +![Allocations](../../images/allocations.png) + +### Efficiency + +Pod resource efficiency is defined as the resource utilization versus the resource request over a given time window. It is cost-weighted and can be expressed as follows: +``` +(((CPU Usage / CPU Requested) * CPU Cost) + ((RAM Usage / RAM Requested) * RAM Cost)) / (RAM Cost + CPU Cost) +``` +where CPU Usage = rate(container_cpu_usage_seconds_total) over the time window RAM Usage = avg(container_memory_working_set_bytes) over the time window + +As explicit RAM, CPU or GPU prices are not provided by AWS, the Kubecost model falls back to the ratio of base CPU, GPU and RAM price inputs supplied. The default values for these parameters are based on the marginal resource rates of the cloud provider, but they can be customized within Kubecost. These base resource (RAM/CPU/GPU) prices are normalized to ensure the sum of each component is equal to the total price of the node provisioned, based on billing rates from your provider + +It is the responsibility of each service team to move towards maximum efficiency and fine tune the workloads to achieve the goal. + +### Idle Cost +Cluster idle cost is defined as the difference between the cost of allocated resources and the cost of the hardware they run on. Allocation is defined as the max of usage and requests. It can also be expressed as follows: +``` +idle_cost = sum(node_cost) - (cpu_allocation_cost + ram_allocation_cost + gpu_allocation_cost) +``` +where allocation = max(request, usage) + +So, idle costs can also be thought of as the cost of the space that the Kubernetes scheduler could schedule pods, without disrupting any existing workloads, but it is not currently. It can be distributed to the workloads or cluster or by nodes depending on how you want to configure. + + +### Network Cost + +Kubecost uses best-effort to allocate network transfer costs to the workloads generating those costs. The accurate way of determining the network cost is by using the combination of [AWS Cloud Integration](https://docs.kubecost.com/install-and-configure/install/cloud-integration/aws-cloud-integrations) and [Network costs daemonset](https://docs.kubecost.com/install-and-configure/advanced-configuration/network-costs-configuration). + +You would want to take into account your efficiency score and Idle cost to fine tune the workloads to ensure you utilize the cluster to its complete potential. This takes us to the next topic namely Cluster right-sizing. + +### Right-Sizing Workloads + +Kubecost provides right-sizing recommendations for your workloads based on Kubernetes-native metrics. The savings panel in the kubecost UI is a great place to start. + +![Savings](../../images/savings.png) + +![Right-sizing](../../images/right-sizing.png) + +Kubecost can give you recommendations on: + +* Right sizing container request by taking a look at both over-provisioned and under-provisioned container request +* Adjust the number and size of the cluster nodes to stop over-spending on unused capacity +* Scale down, delete / resize pods that don’t send or receive meaningful rate of traffic +* Identifying workloads ready for spot nodes +* Identifying volumes that are unused by any pods + + +Kubecost also has a pre-release feature that can automatically implement its recommendations for container resource requests if you have the Cluster Controller component enabled. Using automatic request right-sizing allows you to instantly optimize resource allocation across your entire cluster, without testing excessive YAML or complicated kubectl commands. You can easily eliminate resource over-allocation in your cluster, which paves the way for vast savings via cluster right-sizing and other optimizations. + +### Integrating Kubecost with Amazon Managed Service for Prometheus + +Kubecost leverages the open-source Prometheus project as a time series database and post-processes the data in Prometheus to perform cost allocation calculations. Depending on the cluster size and scale of the workload, it could be overwhelming for a Prometheus server to scrape and store the metrics. In such case, you can use the Amazon Managed Service for Prometheus, a managed Prometheus-compatible monitoring service to store the metrics reliably and enable you to easily monitor Kubernetes cost at scale. + +You must setup [IAM roles for Kubecost service accounts](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html). Using the OIDC provider for the cluster, you grant IAM permissions to your cluster’s service accounts. You must grant appropriate permissions to the kubecost-cost-analyzer and kubecost-prometheus-server service accounts. These will be used to send and retrieve metrics from the workspace. Run the following commands on the command line: + +``` +eksctl create iamserviceaccount \ +--name kubecost-cost-analyzer \ +--namespace kubecost \ +--cluster \ +--region \ +--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess \ +--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \ +--override-existing-serviceaccounts \ +--approve + +eksctl create iamserviceaccount \ +--name kubecost-prometheus-server \ +--namespace kubecost \ +--cluster --region \ +--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess \ +--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \ +--override-existing-serviceaccounts \ +--approve + +``` +`CLUSTER_NAME` is the name of the Amazon EKS cluster where you want to install Kubecost and "REGION" is the region of the Amazon EKS cluster. + +Once complete, you will have to upgrade the Kubecost helm chart as below : +``` +helm upgrade -i kubecost \ +oci://public.ecr.aws/kubecost/cost-analyzer --version <$VERSION> \ +--namespace kubecost --create-namespace \ +-f https://tinyurl.com/kubecost-amazon-eks \ +-f https://tinyurl.com/kubecost-amp \ +--set global.amp.prometheusServerEndpoint=${QUERYURL} \ +--set global.amp.remoteWriteService=${REMOTEWRITEURL} +``` +### Accessing Kubecost UI + +Kubecost provides a web dashboard that you can access either through kubectl port-forward, an ingress, or a load balancer. The enterprise version of Kubecost also supports restricting access to the dashboard using [SSO/SAML](https://docs.kubecost.com/install-and-configure/advanced-configuration/user-management-oidc) and providing varying level of access. For example, restricting team’s view to only the products they are responsible for. + +In AWS environment, consider using the [AWS Load Balancer Controller](https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html) to expose Kubecost and use [Amazon Cognito](https://aws.amazon.com/cognito/) for authentication, authorization, and user management. You can learn more on this [How to use Application Load Balancer and Amazon Cognito to authenticate users for your Kubernetes web apps](https://aws.amazon.com/blogs/containers/how-to-use-application-load-balancer-and-amazon-cognito-to-authenticate-users-for-your-kubernetes-web-apps/) + + +### Multi-cluster view + +Your FinOps team would want to review the EKS cluster to share recommendations with business owners. When operating at large scale, it becomes challenging for the teams to log into each cluster to view the recommendations. Multi cluster allows you to have a single-pane-of-glass view into all aggregated cluster costs globally. There are three options that Kubecost supports for environments with multiple clusters: Kubecost Free, Kubecost Business, and Kubecost enterprise. In the free and business mode, the cloud-billing reconciliation will be performed at each cluster level. In the enterprise mode, the cloud billing reconciliation will be performed in a primary cluster that serves the kubecost UI and uses the shared bucket where the metrics are stored. +It is important to note that metrics retention is unlimited only when you use enterprise mode. + +### References +* [Hands-On Kubecost experience on One Observability Workshop](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/amp/ingest-kubecost-metrics) +* [Blog - Integrating Kubecost with Amazon Managed Service for Prometheus](https://aws.amazon.com/blogs/mt/integrating-kubecost-with-amazon-managed-service-for-prometheus/) diff --git a/docusaurus/observability-best-practices/docs/guides/dashboards.md b/docusaurus/observability-best-practices/docs/guides/dashboards.md new file mode 100644 index 000000000..7773bbaa1 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/dashboards.md @@ -0,0 +1,18 @@ +# Dashboarding + +## Best practices + +### Create views for your important personas + +### Share your business metrics along side operational ones + +### Dashboards should build bridges, not walls + +### Your dashboards should tell a story + +### Dashboards should be available to technical and non-technical users + +## Recommendations + +### Leverage existing identity providers + diff --git a/docusaurus/observability-best-practices/docs/guides/databases/rds-and-aurora.md b/docusaurus/observability-best-practices/docs/guides/databases/rds-and-aurora.md new file mode 100644 index 000000000..ee0413617 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/databases/rds-and-aurora.md @@ -0,0 +1,212 @@ +# Monitor Amazon RDS and Aurora databases + +Monitoring is a critical part of maintaining the reliability, availability, and performance of Amazon RDS and Aurora database clusters. AWS provides several tools for monitoring health of your Amazon RDS and Aurora databases resources, detect issues before they become critical and optimize performance for consistent user experience. This guide provides the observability best practices to ensure your databases are running smoothly. + +## Performance guidelines + +As a best practice, you want to start with establishing a baseline performance for your workloads. When you set up a DB instance and run it with a typical workload, capture the average, maximum, and minimum values of all performance metrics. Do so at a number of different intervals (for example, one hour, 24 hours, one week, two weeks). This can give you an idea of what is normal. It helps to get comparisons for both peak and off-peak hours of operation. You can then use this information to identify when performance is dropping below standard levels. + +## Monitoring Options + +### Amazon CloudWatch metrics + +[Amazon CloudWatch](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/monitoring-cloudwatch.html) is a critical tool for monitoring and managing your [RDS](https://aws.amazon.com/rds/) and [Aurora](https://aws.amazon.com/rds/aurora/) databases. It provides valuable insights into database performance and helps you identify and resolve issues quickly. Both Amazon RDS and Aurora database sends metrics to CloudWatch for each active database instance at 1 minute granularity. Monitoring is enabled by default and metrics are available for 15 days. RDS and Aurora publish instance-level metrics metrics to Amazon CloudWatch in the **AWS/RDS** namespace. + +Using CloudWatch Metrics, you can identify trends or patterns in your database performance, and use this information to optimize your configurations and improve your application's performance. Here are key metricsto monitor : + +* **CPU Utilization** - Percentage of computer processing capacity used. +* **DB Connections** - The number of client sessions that are connected to the DB instance. Consider constraining database connections if you see high numbers of user connections in conjunction with decreases in instance performance and response time. The best number of user connections for your DB instance will vary based on your instance class and the complexity of the operations being performed. To determine the number of database connections, associate your DB instance with a parameter group. +* **Freeable Memory** - How much RAM is available on the DB instance, in megabytes. The red line in the Monitoring tab metrics is marked at 75% for CPU, Memory and Storage Metrics. If instance memory consumption frequently crosses that line, then this indicates that you should check your workload or upgrade your instance. +* **Network throughput** - The rate of network traffic to and from the DB instance in bytes per second. +* **Read/Write Latency** - The average time for a read or write operation in milliseconds. +* **Read/Write IOPS** - The average number of disk read or write operations per second. +* **Free Storage Space** - How much disk space is not currently being used by the DB instance, in megabytes. Investigate disk space consumption if space used is consistently at or above 85 percent of the total disk space. See if it is possible to delete data from the instance or archive data to a different system to free up space. + +![db_cw_metrics.png](../../images/db_cw_metrics.png) + +For troubleshooting performance related issues, first step is to tune the most used and expensive queries. Tune them to see if doing so lowers the pressure on system resources. For more information, see [Tuning queries](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html#CHAP_BestPractices.TuningQueries). + +If your queries are tuned and the issue still persists, consider upgrading your database instance classes. You can upgrade it to an instance with more resources (CPU, RAM, disk space, network bandwidth, I/O capacity). + +Then, you can set up alarms to alert when these metrics reach critical thresholds, and take action to resolve any issues as quickly as possible. + +For more information on CloudWatch metrics, refer [Amazon CloudWatch metrics for Amazon RDS]( https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html) and [Viewing DB instance metrics in the CloudWatch console and AWS CLI](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/metrics_dimensions.html). + +#### CloudWatch Logs Insights + +[CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) enables you to interactively search and analyze your log data in Amazon CloudWatch Logs. You can perform queries to help you more efficiently and effectively respond to operational issues. If an issue occurs, you can use CloudWatch Logs Insights to identify potential causes and validate deployed fixes. + +To publish logs from RDS or Aurora database cluster to CloudWatch, see [Publish logs for Amazon RDS or Aurora for MySQL instances to CloudWatch](https://repost.aws/knowledge-center/rds-aurora-mysql-logs-cloudwatch) + +For more information on monitoring RDS or Aurora logs with CloudWatch, see [Monitoring Amazon RDS log file](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.html). + +#### CloudWatch Alarms + +To identify when performance is degraded for your database clusters, you should monitor and alert on key performance metrics on a regular basis. Using [Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html), you can watch a single metric over a time period that you specify. If the metric exceeds a given threshold, a notification is sent to an Amazon SNS topic or AWS Auto Scaling policy. CloudWatch alarms do not invoke actions simply because they are in a particular state. Rather the state must have changed and been maintained for a specified number of periods. Alarms invoke actions only when alarm change state occurs. Being in alarm state is not enough. + +To set a CloudWatch alarm - + +* Navigate to AWS Management Console and open the Amazon RDS console at [https://console.aws.amazon.com/rds/](https://console.aws.amazon.com/rds/). +* In the navigation pane, choose Databases, and then choose a DB instance. +* Choose Logs & events. + +In the CloudWatch alarms section, choose Create alarm. + +![db_cw_alarm.png](../../images/db_cw_alarm.png) + +* For Send notifications, choose Yes, and for Send notifications to, choose New email or SMS topic. +* For Topic name, enter a name for the notification, and for With these recipients, enter a comma-separated list of email addresses and phone numbers. +* For Metric, choose the alarm statistic and metric to set. +* For Threshold, specify whether the metric must be greater than, less than, or equal to the threshold, and specify the threshold value. +* For Evaluation period, choose the evaluation period for the alarm. For consecutive period(s) of, choose the period during which the threshold must have been reached in order to trigger the alarm. +* For Name of alarm, enter a name for the alarm. +* Choose Create Alarm. + +The alarm appears in the CloudWatch alarms section. + +Take a look at this [example](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-cluster-cloudwatch-alarm.html) to create an Amazon CloudWatch alarm for Multi-AZ DB cluster replica lag. + +#### Database Audit Logs + +Database Audit Logs provide a detailed record of all actions taken on your RDS and Aurora databases, enabling you to monitor for unauthorized access, data changes, and other potentially harmful activities. Here are some best practices for using Database Audit Logs: + +* Enable Database Audit Logs for all of your RDS and Aurora instances, and configure them to capture all relevant data. +* Use a centralized log management solution, such as Amazon CloudWatch Logs or Amazon Kinesis Data Streams, to collect and analyze your Database Audit Logs. +* Monitor your Database Audit Logs regularly for suspicious activity, and take action to investigate and resolve any issues as quickly + +For more information on how to configure database audit logs, see [Configuring an Audit Log to Capture database activities for Amazon RDS and Aurora](https://aws.amazon.com/blogs/database/configuring-an-audit-log-to-capture-database-activities-for-amazon-rds-for-mysql-and-amazon-aurora-with-mysql-compatibility/). + +#### Database Slow Query and Error Logs + +Slow query logs help you find slow-performing queries in the database so you can investigate the reasons behind the slowness and tune the queries if needed. Error logs help you to find the query errors, which further helps you find the changes in the application due to those errors. + +You can monitor the slow query log and error log by creating a CloudWatch dashboard using Amazon CloudWatch Logs Insights (which enables you to interactively search and analyze your log data in Amazon CloudWatch Logs). + +To activate and monitor the error log, the slow query log, and the general log for an Amazon RDS, see [Manage slow query logs and general logs for RDS MySQL](https://repost.aws/knowledge-center/rds-mysql-logs). To activate slow query log for Aurora PostgreSQL, see [Enable slow query logs for PostgreSQL](https://catalog.us-east-1.prod.workshops.aws/workshops/31babd91-aa9a-4415-8ebf-ce0a6556a216/en-US/postgresql-logs/enable-slow-query-log). + +## Performance Insights and operating-system metrics + +#### Enhanced Monitoring + +[Enhanced Monitoring](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html) enables you to get fine-grain metrics in real time for the operating system (OS) that your DB instance runs on. + +RDS delivers the metrics from Enhanced Monitoring into your Amazon CloudWatch Logs account. By default, these metrics are stored for 30 days and stored in **RDSOSMetrics** Log group in Amazon CloudWatch. You have the option to choose a granularity between 1s to 60s. You can create custom metrics filters in CloudWatch from CloudWatch Logs and display the graphs on the CloudWatch dashboard. + +![db_enhanced_monitoring_loggroup.png](../../images/db_enhanced_monitoring_loggroup.png) + +Enhanced monitoring also include the OS level process list. Currently, Enhanced Monitoring is available for the following database engines: + +* MariaDB +* Microsoft SQL Server +* MySQL +* Oracle +* PostgreSQL + +**Different between CloudWatch and Enhanced Monitoring** +CloudWatch gathers metrics about CPU utilization from the hypervisor for a DB instance. In contrast, Enhanced Monitoring gathers its metrics from an agent on the DB instance. A hypervisor creates and runs virtual machines (VMs). Using a hypervisor, an instance can support multiple guest VMs by virtually sharing memory and CPU. You might find differences between the CloudWatch and Enhanced Monitoring measurements, because the hypervisor layer performs a small amount of work. The differences can be greater if your DB instances use smaller instance classes. In this scenario, more virtual machines (VMs) are probably managed by the hypervisor layer on a single physical instance. + + +To learn about all the metrics available with Enhanced Monitoring, please refer [OS metrics in Enhanced Monitoring](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring-Available-OS-Metrics.html) + + +![db-enhanced-monitoring.png](../../images/db_enhanced_monitoring.png) + +#### Performance Insights + +[Amazon RDS Performance Insights](https://aws.amazon.com/rds/performance-insights/) is a database performance tuning and monitoring feature that helps you quickly assess the load on your database, and determine when and where to take action. With the Performance Insights dashboard, you can visualize the database load on your db cluster and filter the load by waits, SQL statements, hosts, or users. It allows you to pin point on the root cause rather than chasing symptoms. Performance Insights uses lightweight data collection methods that do not impact the performance of your applications and makes it easy to see which SQL statements are causing the load and why. + +Performance Insights provides seven days of free performance history retention and you can extend that up to 2 years with a fees. You can enable Performance Insights from RDS management console or AWS CLI. Performance Insights also exposes a publicly available API to enable customers and third parties to integrate Performance Insights with their own custom tooling. + +:::note + Currently, RDS Performance Insights is available only for Aurora (PostgreSQL- and MySQL-compatible editions), Amazon RDS for PostgreSQL, MySQL, MariaDB, SQL Server and Oracle. +::: + +**DBLoad** is the key metric which represents the average number of database active sessions. In Performance Insights, this data is queried as **db.load.avg** metric. + +![db_perf_insights.png](../../images/db_perf_insights.png) + +For more information on using Performance Insights with Aurora, refer: [Monitoring DB load with Performance Insights on Amazon Aurora](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_PerfInsights.html). + + +## Open-source Observability Tools + +#### Amazon Managed Grafana +[Amazon Managed Grafana](https://aws.amazon.com/grafana/) is a fully managed service that makes it easy to visualize and analyze data from RDS and Aurora databases. + +The **AWS/RDS namespace** in Amazon CloudWatch includes the key metrics that apply to database entities running on Amazon RDS and Amazon Aurora.To visualize and track the health and potential performance issues of our RDS/Aurora databases in Amazon Managed Grafana, we can leverage CloudWatch data source. + +![amg-rds-aurora.png](../../images/amg-rds-aurora.png) + +As of now, only basic Performance Insights metrics are available in CloudWatch which is not sufficient to analyze database performance and identify bottlenecks in your database. To visualize RDS Performance Insight metrics in Amazon Managed Grafana and have a single pane of glass visibility, customers can use a custom lambda function to collect all the RDS Performance insights metrics and publish them in a custom CloudWatch metrics namespace. Once you have these metrics available in Amazon CloudWatch, you can visualize them in Amazon Managed Grafana. + +To deploy the custom lambda function to gather RDS Performance Insights metrics, clone the following GitHub repository and run the install.sh script. + +``` +$ git clone https://github.com/aws-observability/observability-best-practices.git +$ cd sandbox/monitor-aurora-with-grafana + +$ chmod +x install.sh +$ ./install.sh +``` + +Above script uses AWS CloudFormation to deploy a custom lambda function and an IAM role. Lambda function auto triggers every 10 mins to invoke RDS Performance Insights API and publish custom metrics to /AuroraMonitoringGrafana/PerformanceInsights custom namespace in Amazon CloudWatch. + +![db_performanceinsights_amg.png](../../images/db_performanceinsights_amg.png) + +For detailed step-by-step information on custom lambda function deployment and grafana dashboards, refer [Performance Insights in Amazon Managed Grafana](https://aws.amazon.com/blogs/mt/monitoring-amazon-rds-and-amazon-aurora-using-amazon-managed-grafana/). + +By quickly identifying unintended changes in your database and notifying using alerts, you can take actions to minimize disruptions. Amazon Managed Grafana supports multiple notification channels such as SNS, Slack, PagerDuty etc. to which you can send alerts notifications. [Grafana Alerting](https://docs.aws.amazon.com/grafana/latest/userguide/alerts-overview.html) will show you more information on how to set up alerts in Amazon Managed Grafana. + + +
+ +
+ + +## AIOps - Machine learning based performance bottlenecks detection + +#### Amazon DevOps Guru for RDS + +With [Amazon DevOps Guru for RDS](https://aws.amazon.com/devops-guru/features/devops-guru-for-rds/), you can monitor your databases for performance bottlenecks and operational issues. It uses Performance Insights metrics, analyzes them using Machine Learning (ML) to provide database-specific analyses of performance issues, and recommends corrective actions. DevOps Guru for RDS can identify and analyze various performance-related database issues, such as over-utilization of host resources, database bottlenecks, or misbehavior of SQL queries, among others. When an issue or anomalous behavior is detected, DevOps Guru for RDS displays the finding on the DevOps Guru console and sends notifications using [Amazon EventBridge](https://aws.amazon.com/pm/eventbridge) or [Amazon Simple Notification Service (SNS)](https://aws.amazon.com/pm/sns), allowing DevOps or SRE teams to take real-time action on performance and operational issues before they become customer-impacting outages. + +DevOps Guru for RDS establishes a baseline for the database metrics. Baselining involves analyzing the database performance metrics over a period of time to establish a normal behavior. Amazon DevOps Guru for RDS then uses ML to detect anomalies against the established baseline. If your workload pattern changes, then DevOps Guru for RDS establishes a new baseline that it uses to detect anomalies against the new normal. + +:::note + For new database instances, Amazon DevOps Guru for RDS takes up to 2 days to establish an initial baseline, because it requires an analysis of the database usage patterns and establishing what is considered a normal behavior. +::: + +![db_dgr_anomaly.png.png](../../images/db_dgr_anomaly.png) + +![db_dgr_recommendation.png](../../images/db_dgr_recommendation.png) + +For more information on how to get started, please visit [Amazon DevOps Guru for RDS to Detect, Diagnose, and Resolve Amazon Aurora-Related Issues using ML](https://aws.amazon.com/blogs/aws/new-amazon-devops-guru-for-rds-to-detect-diagnose-and-resolve-amazon-aurora-related-issues-using-ml/) + + +
+ +
+ + +## Auditing and Governance + +#### AWS CloudTrail Logs + +[AWS CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html) provides a record of actions taken by a user, role, or an AWS service in RDS. CloudTrail captures all API calls for RDS as events, including calls from the console and from code calls to RDS API operations. Using the information collected by CloudTrail, you can determine the request that was made to RDS, the IP address from which the request was made, who made the request, when it was made, and additional details. For more information, see [Monitoring Amazon RDS API calls in AWS CloudTrail](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/logging-using-cloudtrail.html). + +For more information, refer [Monitoring Amazon RDS API calls in AWS CloudTrail](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/logging-using-cloudtrail.html). + +## References for more information + +[Blog - Monitor RDS and Aurora databases with Amazon Managed Grafana](https://aws.amazon.com/blogs/mt/monitoring-amazon-rds-and-amazon-aurora-using-amazon-managed-grafana/) + +[Video - Monitor RDS and Aurora databases with Amazon Managed Grafana](https://www.youtube.com/watch?v=Uj9UJ1mXwEA) + +[Blog - Monitor RDS and Aurora databases with Amazon CloudWatch](https://aws.amazon.com/blogs/database/creating-an-amazon-cloudwatch-dashboard-to-monitor-amazon-rds-and-amazon-aurora-mysql/) + +[Blog - Build proactive database monitoring for Amazon RDS with Amazon CloudWatch Logs, AWS Lambda, and Amazon SNS](https://aws.amazon.com/blogs/database/build-proactive-database-monitoring-for-amazon-rds-with-amazon-cloudwatch-logs-aws-lambda-and-amazon-sns/) + +[Official Doc - Amazon Aurora Monitoring Guide](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/MonitoringOverview.html) + +[Hands-on Workshop - Observe and Identify SQL Performance Issues in Amazon Aurora](https://catalog.workshops.aws/awsauroramysql/en-US/provisioned/perfobserve) + + diff --git a/docusaurus/observability-best-practices/docs/guides/ec2-monitoring.md b/docusaurus/observability-best-practices/docs/guides/ec2-monitoring.md new file mode 100644 index 000000000..26fdbc232 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/ec2-monitoring.md @@ -0,0 +1,275 @@ +# EC2 Monitoring and Observability + +## Introduction + +Continuous Monitoring & Observability increases agility, improves customer experience and reduces risk of the cloud environment. According to Wikipedia, [Observability](https://en.wikipedia.org/wiki/Observability) is a measure of how well internal states of a system can be inferred from the knowledge of its external outputs. The term observability itself originates from the field of control theory, where it basically means that you can infer the internal state of the components in a system by learning about the external signals/outputs it is producing. + +The difference between Monitoring and Observability is that Monitoring tells you whether a system is working or not, while Observability tells you why the system isn’t working. Monitoring is usually a reactive measure whereas the goal of Observability is to be able to improve your Key Performance Indicators in a proactive manner. A system cannot be controlled or optimized unless it is observed. Instrumenting workloads through collection of metrics, logs, or traces and gaining meaningful insights & detailed context using the right monitoring and observability tools help customers control and optimize the environment. + +![three-pillars](../images/three-pillars.png) + +AWS enables customers to transform from monitoring to observability so that they can have full end-to-end service visibility. In this article we focus on Amazon Elastic Compute Cloud (Amazon EC2) and the best practices for improving the monitoring and observability of the service in AWS Cloud environment through AWS native and open-source tools. + +## Amazon EC2 + +[Amazon Elastic Compute Cloud](https://aws.amazon.com/ec2/) (Amazon EC2) is a highly scalable compute platform in Amazon Web Services (AWS) Cloud. Amazon EC2 eliminates the need for up front hardware investment, so customers can develop and deploy applications faster while paying just for what they use. Some of the key features that EC2 provide are virtual computing environments called Instances, pre-configured templates of Instances called Amazon Machine Images, various configurations of resources like CPU, Memory, Storage and Networking capacity available as Instance Types. + +## Monitoring and Observability using AWS Native Tools + +### Amazon CloudWatch + +[Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) is a monitoring and management service that provides data and actionable insights for AWS, hybrid, and on-premises applications and infrastructure resources. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events. It also provides a unified view of AWS resources, applications, and services that run on AWS and on-premises servers. CloudWatch helps you gain system-wide visibility into resource utilization, application performance, and operational health. + +![CloudWatch Overview](../images/cloudwatch-intro.png) + +### Unified CloudWatch Agent + +The Unified CloudWatch Agent is an open-source software under the MIT license which supports most operating systems utilizing x86-64 and ARM64 architectures. The CloudWatch Agent helps collect system-level metrics from Amazon EC2 Instances & on-premise servers in a hybrid environment across operating systems, retrieve custom metrics from applications or services and collect logs from Amazon EC2 instances and on-premises servers. + +![CloudWatch Agent](../images/cw-agent.png) + +### Installing CloudWatch Agent on Amazon EC2 Instances + +#### Command Line Install + +The CloudWatch Agent can be installed through the [command line](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/installing-cloudwatch-agent-commandline.html). The required package for various architectures and various operating systems are available for [download](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/download-cloudwatch-agent-commandline.html). Create the necessary [IAM role](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-iam-roles-for-cloudwatch-agent-commandline.html) which provides permissions for CloudWatch agent to read information from the Amazon EC2 instance and write it to CloudWatch. Once the required IAM role is created, you can [install and run](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-commandline-fleet.html) the CloudWatch agent on the required Amazon EC2 Instance. + +:::note + "References" + + Documentation: [Installing the CloudWatch agent using the command line](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/installing-cloudwatch-agent-commandline.html) + + AWS Observability Workshop: [Setup and install CloudWatch agent](https://catalog.workshops.aws/observability/en-US/aws-native/ec2-monitoring/install-ec2) +::: + +#### Installation through AWS Systems Manager + +The CloudWatch Agent can also be installed through [AWS Systems Manager](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/installing-cloudwatch-agent-ssm.html). Create the necessary IAM role which provides permissions for CloudWatch agent to read information from the Amazon EC2 instance and write it to CloudWatch & communicate with AWS Systems Manager. Before installing the CloudWatch agent on the EC2 instances, [install or update](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/download-CloudWatch-Agent-on-EC2-Instance-SSM-first.html#update-SSM-Agent-EC2instance-first) the SSM agent on the required EC2 instances. The CloudWatch agent can be downloaded through the AWS Systems Manager. JSON Configuration file can be created to specify what metrics (including custom metrics), logs are to be collected. Once the required IAM role is created & the configuration file is created, you can install and run the CloudWatch agent on the required Amazon EC2 Instances. + +:::note + References: + Documentation: [Installing the CloudWatch agent using AWS Systems Manager](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/installing-cloudwatch-agent-ssm.html) + + AWS Observability Workshop: [Install CloudWatch agent using AWS Systems Manager Quick Setup](https://catalog.workshops.aws/observability/en-US/aws-native/ec2-monitoring/install-ec2/ssm-quicksetup) + + Related Blog Article: [Amazon CloudWatch Agent with AWS Systems Manager Integration – Unified Metrics & Log Collection for Linux & Windows](https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-agent-with-aws-systems-manager-integration-unified-metrics-log-collection-for-linux-windows/) + + YouTube Video: [Collect Metrics and Logs from Amazon EC2 instances with the CloudWatch Agent](https://www.youtube.com/watch?v=vAnIhIwE5hY) +::: + +#### Installing CloudWatch Agent on on-premise servers in hybrid environment. + +In hybrid customer environments, where servers are on-premisis as well as in the cloud. A similar approach of can be taken to accomplish unified observability in Amazon CloudWatch. The CloudWatch agent can be directly downloaded from Amazon S3 or through AWS Systems Manager. Create an IAM User for the on-premise server to send data to Amazon CloudWatch. Install and Start the Agent on the on-premise servers. + +:::note + Documentation: [Installing the CloudWatch agent on on-premises servers](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-premise.html) +::: + +### Monitoring of Amazon EC2 Instances using Amazon CloudWatch + +A key aspect of maintaining the reliability, availability, and performance of your Amazon EC2 Instances and your applications is through [continuous monitoring](https://catalog.workshops.aws/observability/en-US/aws-native/ec2-monitoring). With CloudWatch Agent installed on the required Amazon EC2 instances, monitoring the health of the instances and their performance is necessary to maintain a stable environment. As a baseline, items like CPU utilization, Network utilization, Disk performance, Disk Reads/Writes, Memory utilization, disk swap utilization, disk space utilization, page file utilization, and log collection of EC2 Instances are recommended. + +#### Basic & Detailed Monitoring + +Amazon CloudWatch collects and processes raw data from Amazon EC2 into readable near real-time metrics. By default, Amazon EC2 sends metric data to CloudWatch in 5-minute periods as Basic Monitoring for an instance. To send metric data for your instance to CloudWatch in 1-minute periods, [detailed monitoring](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html) can be enabled on the instance. + +#### Automated & Manual Tools for Monitoring + +AWS provides two types of tools, automated and manual that help customers monitor their Amazon EC2 and report back when something is wrong. Some of these tools require a little configuration and a few require manual intervention. +[Automated Monitoring tools](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring_automated_manual.html#monitoring_automated_tools) include AWS System status checks, Instance status checks, Amazon CloudWatch alarms, Amazon EventBridge, Amazon CloudWatch Logs, CloudWatch agent, AWS Management Pack for Microsoft System Center Operations Manager. [Manual monitoring](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring_automated_manual.html#monitoring_manual_tools) tools include Dashboards which we’ll look in detail a separate section below in this article. + +:::note + Documentation: [Automated and manual monitoring](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring_automated_manual.html) +::: +### Metrics from Amazon EC2 Instances using CloudWatch Agent + +Metrics are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. Think of a metric as a variable to monitor, and the data points as representing the values of that variable over time. For example, the CPU usage of a particular EC2 instance is one metric provided by Amazon EC2. + +![cw-metrics]((../imagescw-metrics.png) + +#### Default Metrics using CloudWatch Agent + +Amazon CloudWatch collects metrics from Amazon EC2 instance which can be viewed through AWS Management Console, AWS CLI, or an API. The available metrics are data points which are covered for 5 minute interval through Basic Monitoring or at a 1 minute interval for detailed monitoring (if turned on). + +![default-metrics]((../imagesdefault-metrics.png) + +#### Custom Metrics using CloudWatch Agent + +Customers can also publish their own custom metrics to CloudWatch using the API or CLI through standard resolution of 1 minute granularity or high resolution granularity down to 1 sec interval. The unified CloudWatch agent supports retrieval of custom metrics through [StatsD](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-custom-metrics-statsd.html) and [collectd](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-custom-metrics-collectd.html). + +Custom metrics from applications or services can be retrieved using the CloudWatch agent with StatsD protocol. StatsD is a popular open-source solution that can gather metrics from a wide variety of applications. StatsD is especially useful for instrumenting own metrics, which supports both Linux and Windows based servers. + +Custom metrics from applications or services can also be retrieved using the CloudWatch agent with the collectd protocol, which is a popular open-source solution supported only on Linux Servers with plugins that can gather system statistics for a wide variety of applications. By combining the system metrics that the CloudWatch agent can already collect with the additional metrics from collectd, you can better monitor, analyze, and troubleshoot your systems and applications. + +#### Additional Custom Metrics using CloudWatch Agent + +The CloudWatch agent supports collecting custom metrics from your EC2 instances. A few popular examples are: + +- Network performance metrics for EC2 instances running on Linux that use the Elastic Network Adapter (ENA). +- Nvidia GPU metrics from Linux servers. +- Process metrics using procstat plugin from individual processes on Linux & Windows servers. + +### Logs from Amazon EC2 Instances using CloudWatch Agent + +Amazon CloudWatch Logs helps customers monitor and troubleshoot systems and applications in near real time using existing system, application and custom log files. To collect logs from Amazon EC2 Instances and on-premise servers to CloudWatch, the unified CloudWatch Agent needs to be installed. The latest unified CloudWatch Agent is recommended, since it can collect both logs and advanced metrics. It also supports a variety of operating systems. If the instance uses Instance Metadata Service Version 2 (IMDSv2) then the unified agent is required. + +![cw-logs]((../imagescw-logs.png) + +The logs collected by the unified CloudWatch agent are processed and stored in Amazon CloudWatch Logs. Logs can be collected from Windows or Linux Servers and from both Amazon EC2 and on-premise servers. The CloudWatch agent configuration wizard can be used to setup the config JSON file which defines the setup of the CloudWatch agent. + +![logs-view]((../imageslogs-view.png) + +:::note + AWS Observability Workshop: [Logs](https://catalog.workshops.aws/observability/en-US/aws-native/logs) +::: + +### Amazon EC2 Instance Events + +An event indicates a change in your AWS environment. AWS resources and applications can generate events when their state changes. CloudWatch Events provides a near real-time stream of system events that describe changes to your AWS resources and applications. For example, Amazon EC2 generates an event when the state of an EC2 instance changes from pending to running. Customers can also generate custom application-level events and publish them to CloudWatch Events. + +Customers can [monitor the status of Amazon EC2 Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check.html) by viewing status checks and scheduled events. A status check provides the results from automated checks performed by Amazon EC2. These automated checks detect whether specific issues are affecting the instances. The status check information, together with the data provided by Amazon CloudWatch, gives detailed operational visibility into each of the instances. + +#### Amazon EventBridge Rule for Amazon EC2 Instance Events + +Amazon CloudWatch Events can use Amazon EventBridge to automate system events to respond automatically for actions such as resource changes or issues. Events from AWS services including Amazon EC2 are delivered to CloudWatch Events in near real time and customers can create EventBridge rules to take appropriate actions when an event matches a rule. +Actions can be, Invoke an AWS Lambda function, Invoke Amazon EC2 Run Command, Relay the event to Amazon Kinesis Data Streams, Activate an AWS Step Functions state machine, Notify an Amazon SNS topic, Notify an Amazon SQS queue, piping to internal or external incident response application or SIEM tool. + +:::note + AWS Observability Workshop: [Incident Response - EventBridge Rule](https://catalog.workshops.aws/observability/en-US/aws-native/ec2-monitoring/incident-response/create-eventbridge-rule) +::: + +#### Amazon CloudWatch Alarms for Amazon EC2 Instances + +Amazon [CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) can watch a metric over a time period you specify, and perform one or more actions based on the value of the metric relative to a given threshold over a number of time periods. An alarm invokes actions only when the alarmchanges state. The action can be a notification sent to an Amazon Simple Notification Service (Amazon SNS) topic or Amazon EC2 Auto Scaling or take other appropriate actions like [stop, terminate, reboot, or recover an EC2 instance.](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/UsingAlarmActions.html) + +![CloudWatch Alarm]((../imagescw-alarm.png) + +Once the alarm is triggered an Email notification is sent to an SNS Topic as an action. + +![sns-alert]((../imagessns-alert.png) + +#### Monitoring for Auto Scaling Instances + +Amazon EC2 Auto Scaling helps customer ensure that you have the correct number of Amazon EC2 instances are available to handle the load for your application. [Amazon EC2 Auto Scaling metrics](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-cloudwatch-monitoring.html) collect information about Auto Scaling groups and are in the AWS/AutoScaling namespace. Amazon EC2 instance metrics representing CPU and other usage data from Auto Scaling instances are in the AWS/EC2 namespace. + +### Dashboarding in CloudWatch + +Getting to know the inventory details of resources in AWS accounts, the resources performance and health checks is important for a stable resource management. [Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) are customizable home pages in the CloudWatch console that you can be used to monitor your resources in a single view, even those resources that are spread across different Regions. There are ways to get a good view and details of the Amazon EC2 Instances that are available + +#### Automatic Dashboards in CloudWatch + +Automatic Dashboards are available in all AWS public regions which provides an aggregated view of the health and performance of all AWS resources including Amazon EC2 instances under CloudWatch. This helps customers quickly get started with monitoring, resource-based view of metrics and alarms, and easily drill-down to understand the root cause of performance issues. Automatic Dashboards are pre-built with AWS service recommended [best practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/cloudwatch-dashboards-visualizations.html), remain resource aware, and dynamically update to reflect the latest state of important performance metrics. + +![ec2 dashboard]((../imagesec2-auto-dashboard.png) + +#### Custom Dashboards in CloudWatch + +With [Custom Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create_dashboard.html) Customers can create as many additional dashboards as they want with different widgets and customize it accordingly . Dashboards can be configured for cross-region and cross account view and can be added to a favorites list. + +![ec2 custom dashboard]((../imagesec2-custom-dashboard.png) + +#### Resource Health Dashboards in CloudWatch + +Resource Health in CloudWatch ServiceLens is a fully managed solution that customers can use to automatically discover, manage, and visualize the [health and performance of Amazon EC2 hosts](https://aws.amazon.com/blogs/mt/introducing-cloudwatch-resource-health-monitor-ec2-hosts/) across their applications. Customers can visualize the health of their hosts by performance dimension such as CPU or memory, and slice and dice hundreds of hosts in a single view using filters such as instance type, instance state, or security groups. It enables a side-by-side comparison of a group of EC2 hosts and provides granular insights into an individual host. + +![ec2 resource health]((../imagesec2-resource-health.png) + +## Monitoring And Observability using Open Source Tools + +### Monitoring of Amazon EC2 Instances using AWS Distro for OpenTelemetry + +[AWS Distro for OpenTelemetry (ADOT)](https://aws.amazon.com/otel) is a secure, production-ready, AWS-supported distribution of the OpenTelemetry project. Part of the Cloud Native Computing Foundation, OpenTelemetry provides open source APIs, libraries, and agents to collect distributed traces and metrics for application monitoring. With AWS Distro for OpenTelemetry, customers can instrument applications just once to send correlated metrics and traces to multiple AWS and Partner monitoring solutions. + +![AWS Distro for Open Telemetry Overview]((../imagesadot.png) + +AWS Distro for OpenTelemetry (ADOT) provides a distributed monitoring framework that enables correlating data for monitoring application performance and health in an easy way which is critical for greater service visibility and maintenance. + +The key components of ADOT are SDKs, auto-instrumentation agents, collectors and exporters to send data to back-end services. + +[OpenTelemetry SDK](https://github.com/aws-observability): To enable the collection of AWS resource-specific metadata, support to the OpenTelemetry SDKs for the X-Ray trace format and context. OpenTelemetry SDKs now correlate ingested trace and metrics data from AWS X-Ray and CloudWatch. + +[Auto-instrumentation agent](https://aws-otel.github.io/docs/getting-started/java-sdk/auto-instr): Support in the OpenTelemetry Java auto-instrumentation agent are added for AWS SDK and AWS X-Ray trace data. + +[OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector): The collector in the distribution is built using the upstream OpenTelemetry collector. Added AWS-specific exporters to the upstream collector to send data to AWS services including AWS X-Ray, Amazon CloudWatch and Amazon Managed Service for Prometheus. + +![adot architecture]((../imagesadot-arch.png) + +#### Metrics & Traces through ADOT Collector & Amazon CloudWatch + +AWS Distro for OpenTelemetry (ADOT) Collector along with the CloudWatch agent can be installed side-by-side on Amazon EC2 Instance and OpenTelemetry SDKs can be used to collect application traces & metrics from your workloads running on Amazon EC2 Instances. + +To support OpenTelemetry metrics in Amazon CloudWatch, [AWS EMF Exporter for OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsemfexporter) converts OpenTelemetry format metrics to CloudWatch Embedded Metric Format(EMF) which enables applications integrated in OpenTelemetry metrics to be able to send high-cardinality application metrics to CloudWatch. [The X-Ray exporter](https://aws-otel.github.io/docs/getting-started/x-ray#configuring-the-aws-x-ray-exporter) allows traces collected in an OTLP format to be exported to [AWS X-ray](https://aws.amazon.com/xray/). + +![adot emf architecture]((../imagesadot-emf.png) + +ADOT Collector on Amazon EC2 can be installed through AWS CloudFormation or using [AWS Systems Manager Distributor](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/ec2-monitoring/configure-adot-collector) to collect application metrics. + +### Monitoring of Amazon EC2 Instances using Prometheus + +[Prometheus](https://prometheus.io/) is a standalone open-source project and maintained independently for systems monitoring and alerting. Prometheus collects and stores metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. + +![Prometheus Architecture]((../imagesPrometheus.png) + +Prometheus is configured via command line flags and all the configuration details are maintained in the prometheus.yaml file. The 'scrape_config' section within the configuration file specifies the targets and parameters specifying how to scrape them. [Prometheus Service Discovery](https://github.com/prometheus/prometheus/tree/main/discovery) (SD) is a methodology of finding endpoints to scrape for metrics. Amazon EC2 service discovery configurations allow retrieving scrape targets from AWS EC2 instances are configured in the `ec2_sd_config`. + + +#### Metrics through Prometheus & Amazon CloudWatch + +The CloudWatch agent on EC2 instances can be installed & configured with Prometheus to scrape metrics for monitoring in CloudWatch. This can be helpful to customers who prefer container workloads on EC2 and require custom metrics that are compatible with open source Prometheus monitoring. Installation of CloudWatch Agent can be done by following the steps explained in the earlier section above. The CloudWatch agent with Prometheus monitoring needs two configurations to scrape the Prometheus metrics. One is for the standard Prometheus configurations as documented in 'scrape_config' in the Prometheus documentation. The other is for the [CloudWatch agent configuration](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-PrometheusEC2.html#CloudWatch-Agent-PrometheusEC2-configure). + +#### Metrics through Prometheus & ADOT Collector + +Customers can choose to have an all open-source setup for their observability needs. For which, AWS Distro for OpenTelemetry (ADOT) Collector can be configured to scrape from a Prometheus-instrumented application and send the metrics to Prometheus Server. There are three OpenTelemetry components involved in this flow, that are the Prometheus Receiver, the Prometheus Remote Write Exporter, and the Sigv4 Authentication Extension. Prometheus Receiver receives metric data in Prometheus format. Prometheus Exporter exports data in Prometheus format. Sigv4 Authenticator extension provides Sigv4 authentication for making requests to AWS services. + +![adot prometheus architecture]((../imagesadot-prom-arch.png) + +#### Prometheus Node Exporter + +[Prometheus Node Exporter](https://github.com/prometheus/node_exporter) is an open-source time series monitoring and alerting system for cloud environments. Amazon EC2 Instances can be instrumented with Node Exporter to collect and store node-level metrics as time-series data, recording information with a timestamp. Node exporter is a Prometheus exporter which can expose variety of host metrics via URL http://localhost:9100/metrics. + +![prometheus metrics screenshot]((../imagesprom-metrics.png) + + Once the metrics are created, they can be sent to [Amazon Managed Prometheus](https://aws.amazon.com/prometheus/). + +![amp overview]((../imagesamp-overview.png) + +### Streaming Logs from Amazon EC2 Instances using Fluent Bit Plugin + +[Fluent Bit](https://fluentbit.io/) is an open source and multi-platform log processor tool for handling data collection at scale, collecting & aggregating diverse data that deal with various sources of information, variety of data formats, data reliability, security, flexible routing and multiple destinations. + +![fluent architecture]((../imagesfluent-arch.png) + +Fluent Bit helps create an easy extension point for streaming logs from Amazon EC2 to AWS services including Amazon CloudWatch for log retention and analytics. The newly-launched [Fluent Bit plugin](https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin) can route logs to Amazon CloudWatch. + +### Dashboarding with Amazon Managed Grafana + +[Amazon Managed Grafana](https://aws.amazon.com/grafana/) is a fully managed service based on the open source Grafana project, with rich, interactive & secure data visualizations to help customers instantly query, correlate, analyze, monitor, and alarm on metrics, logs, and traces across multiple data sources. Customers can create interactive dashboards and share them with anyone in their organization with an automatically scaled, highly available, and enterprise-secure service. With Amazon Managed Grafana, customers can manage user and team access to dashboards across AWS accounts, AWS regions, and data sources. + +![grafana overview]((../imagesgrafana-overview.png) + +Amazon Managed Grafana can be added with Amazon CloudWatch as a data source by using the AWS data source configuration option in the Grafana workspace console. This feature simplifies adding CloudWatch as a data source by discovering existing CloudWatch accounts and manage the configuration of the authentication credentials that are required to access CloudWatch. Amazon Managed Grafana also supports [Prometheus data sources](https://docs.aws.amazon.com/grafana/latest/userguide/prometheus-data-source.html), i.e. both self-managed Prometheus servers and Amazon Managed Service for Prometheus workspaces as data sources. + +Amazon Managed Grafana comes with a variety of panels, makes it easy to construct the right queries and customize the display properties allowing customers to create the dashboards they need. + +![grafana dashboard]((../imagesgrafana-dashboard.png) + +## Conclusion + +Monitoring keeps you informed of whether a system is working properly. Observability lets you understand why the system is not working properly. Good observability allows you to answer the questions you didn't know that you needed to be aware of. Monitoring & Observability paves way for measuring the internal states of a system which can be inferred from its outputs. + +Modern applications, those running on cloud in microservices, serverless and asynchronous architectures, generate large volumes of data in the form of metrics, logs, traces and events. Amazon CloudWatch along with open source tools such as Amazon Distro for OpenTelemetry, Amazon Managed Prometheus, and Amazon Managed Grafana, enable customers to collect, access, and correlate this data on a unified platform. Helping customers break down data silos so you can easily gain system-wide visibility and quickly resolve issues. + + + + + + + + + + + + + + + diff --git a/docusaurus/observability-best-practices/docs/guides/full-stack.md b/docusaurus/observability-best-practices/docs/guides/full-stack.md new file mode 100644 index 000000000..100ce8657 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/full-stack.md @@ -0,0 +1 @@ +# Full-stack diff --git a/docusaurus/observability-best-practices/docs/guides/hybrid-and-multicloud.md b/docusaurus/observability-best-practices/docs/guides/hybrid-and-multicloud.md new file mode 100644 index 000000000..acb1247b1 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/hybrid-and-multicloud.md @@ -0,0 +1,97 @@ +# Best practices for hybrid and multicloud + +## Intro + +We consider multicloud to be the concurrent use of more than one cloud services provider to operate your own workloads, and hybrid is the extending of your workloads across both on-premises and cloud environments. Observability across hybrid and multicloud environments may add significant complexity due to tool diversity, latency, and heterogenous workloads. Regardless, this remains a common goal for both development and business users. A rich ecosystem of products and services address this. + +However, the usefulness of observability tools for cloud-native workloads can vary dramatically. Consider the different requirements of monitoring a containerized batch processing workload, compared to a real-time banking application using a serverless framework: both have logs, metrics, and traces; however, the toolchain for observing them will vary, with a number of cloud-native, open source, and ISV products available. An open-source tool such as Prometheus may be an excellent fit for one, whereas a cloud-native tool provided as a managed service could better meet your requirements. + +Add to this the complexity of multicloud and hybrid, and gaining insights from your applications becomes considerably harder. + +In order to deal with these added dimensions and facilitate approaches to observability, customers tend to invest in a single toolchain with a unified interface. After all, reducing the signal-to-noise ratio is usually a good thing! However, a single approach does not work for all use cases, and the operating models of various platforms may add confusion. Our goal is to help you make informed decisions that compliment your needs and reduce your mean time to remediation when issues do occur. Below are the best practices that we have learned through working with customers of all sizes, and across every industry. + +:::tip + These best practices are intended for a broad set of roles: enterprise architects, developers, DevOps, and more. We suggest evaluating them through the lens of your organization’s business needs, and how observability in distributed environments can provide as much value as possible. +::: +## Don’t let your tooling dictate your decisions + +Your applications, tools, and processes exist to help achieve business outcomes, such as increasing sales and customer satisfaction. A well-advised technology strategy is one that does everything possible to help you achieve those business goals. But the things that help you get there are simply tools, and they are meant to support your strategy – not be the strategy. To make an analogy, if you needed to build a house, you would not ask your tools how to design and build it. Rather, your tools are a means to an end. + +In a single, homogeneous environment, the decisions around tooling are easier. After all, if you run a single application in one environment, then you tooling can easily be the same across the board. But for hybrid and multicloud environments things are less clear, and keeping an eye on your business outcomes - and [the value added](https://arxiv.org/abs/2303.13402) by observing your workloads across these environments - is critical. Each Cloud Services Provider (CSP) has their own native observability solutions, and a rich set of partner and Independent Software Vendors (ISVs) whom you can use as well. + +Just because you operate in multiple environments does not mean that a single tool for every workload is advisable, nor even recommended. This can potentially mean using multiple services, frameworks, or providers, to observe your workloads. See "[a single pane of glass is less important than your workload’s context](#a-single-pane-of-glass-is-less-important-than-your-workloads-context)” below for details of how your operating model needs to reflect your needs. Regardless, when implementing your tools, remember to create “[two-way doors](https://aws.amazon.com/executive-insights/content/how-amazon-defines-and-operationalizes-a-day-1-culture/)” so you can evolve your observability solution in the future. + +Here are some examples of “tool-first” outcomes to avoid: + +1. Focusing on implementation of a single tool without a two-way door to upgrade it, or move to a new solution in the future, may create technical debt that is otherwise avoidable. This can happen when the tool is the solution, and one day may become the problem you need to solve. +2. A company standard to use a single tool due to a volume discount may end-up without features they would benefit from. This may be “cost over quality”, and inadvertently creates a monolithic anti-pattern. This may discourage the collection of telemetry in order to remain under a volume threshold, thereby de-incentivizing the use of observability tooling. +3. Not collecting an entire type of telemetry (usually traces) due to a lack of existing trace collection infrastructure, but a rich set of log and metric collectors, can lead to an incomplete observability solution. +4. Support staff having been trained on only a single toolchain, in the desire to reduce labour and training costs, thereby reducing the potential value of other observability patterns. + +:::info + If your tooling is dictating your observability strategy, then you need to invert the approach. Tools are meant to enable and empower observability, not to limit your choices. +::: + +:::info + Tool sprawl is a very real issue that companies struggle with, however a radical shift to a singular toolchain can likewise reduce your observability solution’s usefulness. Hybrid and multicloud workloads have technologies that are unique to each platform, and higher-level services from each CSP are useful – though the trade-offs in using a single-source product require a value-based analysis. See “[Invest in OpenTelemetry](#invest-in-opentelemetry)” for an approach that mitigates some of these risks. +::: + +## (Observability) data has gravity + +All data has gravity – which is to say that it attracts workloads, solutions, tools, people, processes, and projects around it. For example, a database with your customer transactions in it will be the attractive force that brings compute and analytics workloads to it. This has direct implications for where you place your workloads, in which environment, and how you operate them going-forward. And the same is true for observability signals, though the gravity this data creates is tied to your workload and organizational context (see "[a single pane of glass is less important than your workload’s context](#a-single-pane-of-glass-is-less-important-than-your-workloads-context)”). + +One cannot completely separate the context of your observability telemetry from the underlying workload and data that it relates to. The same rule applies here: your telemetry is data, and it has gravity to it. This should influence where you place your telemetry agents, collectors, or systems that aggregate and analyze signals. + +:::tip + The value of observability data over time is considerably less than most other data types. You could call it the “half-life” of observability data. Consider the additional latency in relaying telemetry to another environment as a potential forced devaluation of this data prior to its potential use, and then weigh that against the requirements you have for alerting when issues occur. +::: + +:::info + The best practice is to emit data between environments only when there is business value to be gained from this aggregation. Having a single source for querying data does not solve many business needs on its own, and may create a more expensive solution than desired, with more points of failure. +::: + +## A single pane of glass is less important than your workload’s context + +A common ask is for a “single pane of glass” to observe all of your workloads. This arises from a natural desire to view as much data as possible, but in as simple a way as can be achieved, and reduce churn, frustration, and diagnosis time. Creating this one interface to see your entire observability solution at once is useful, but can come with the trade-off of separating your telemetry from the context it came from. + +For example, a dashboard with the CPU utilization of a hundred servers may show some anomalous spikes in consumption, but this does nothing to explain why this has happened, or what the contributing factors are for this behavior. And the importance of this metric may not be immediately clear. + +We have seen customers sometimes pursue the single pane of glass so aggressively that all business context is lost, and trying to see everything in one tool can actually dilute the value of that data. Your dashboards, and your tools, need to [tell a story](https://aws-observability.github.io/observability-best-practices/tools/dashboards/). And this story needs to include the business metrics and outcomes that are impacted by events in your workloads. + +Moreover, your tooling needs to align to your operating model. A single pane of glass can add value when your support teams are global with access to all of your environments, but if they are limited to only accessing a single workload, in a single CSP or hybrid environment, then there is no value added through this approach. In these instances, allowing teams to create dashboards within each environment natively may hasten time to value, and be more flexible changes in the future. + +:::info + The value of observability data is deeply integrated into the application from which it came. Your telemetry requires contextual awareness that comes from its environment. In hybrid and multicloud environments, the differences between technologies makes the need for context even greater (though systems such as Kubernetes can be similar between different cloud providers and on-premises). +::: + +:::info + When building a single pane of glass for distributed system, display your business metrics and Service Level Objectives (SLOs) in the same view as other data (such as infrastructure metrics) that contributes to these SLOs. This gives context that may otherwise be lacking. +::: + +:::tip + A single pane of glass can help to rapidly diagnose issues and reduce Time to Detection (MTTD) and thereby Mean Time to Resolution (MTTR), but only if the meaning of telemetry data can be preserved. Without this, a single pane of glass approach can increase the time to value, or become a net-negative for operations teams. +::: + +:::info + If the value of a single pane of glass cannot be determined, or if workloads are bound entirely to a single CSP or on-premises environment, consider only rolling-up top-level business metrics to a single pane of glass, leaving the raw metrics and other contributing factors in their original environments. +::: + +## Invest in OpenTelemetry + +Across the observability vendor landscape, OpenTelemetry (OTel) has become the de-facto standard. OTEL can marshal each of your telemetry types into one or many collectors, which can include cloud-native services, or a wide variety of SaaS and ISV products. OTel agents and collectors communicate using the OpenTelemetry Protocol (OTLP), which encapsulates signals into a format allowing a wide variety of deployment patterns. + +To collect transaction traces with the most value, and with your business and infrastructure context, you will need to integrate trace collection into your application. Some auto-instrumentation agents can perform this with almost no effort. However, the most sophisticated use cases do require code changes on your behalf to support transaction traces. This creates some technical debt and ties-down your workload to a particular technology. + +OTel captures logs, metrics, and traces using a concept of a span. Spans contain these signals grouped together from a single transaction, packaging them into a contextualized, searchable object. This means you can view your signals from a single application event in one simple entity. For example, a user logging into a web site, and the requests this creates to all the downstream services this integrates with, can be presented as a single span. + +:::tip + OTel is not limited to application traces, and is widely used for logs and metrics. And many [ISV products accept OTLP directly today](https://opentelemetry.io/ecosystem/vendors/). +::: + +:::info + By instrumenting your applications using OTel, you remove the need to replace this instrumentation at the application layer in the future, should you choose to move from one observability platform to another. This turns part of your observability solution into a [two-way door](https://aws.amazon.com/executive-insights/content/how-amazon-defines-and-operationalizes-a-day-1-culture/). +::: + +:::info + OTel is future-proofing, scalable, and makes it easier to change your collection and analysis systems in the future without having to change application code, making it an efficient [shift to the left](https://www.youtube.com/watch?v=99r7cxKW8Rg). +::: diff --git a/docusaurus/observability-best-practices/docs/guides/index.md b/docusaurus/observability-best-practices/docs/guides/index.md new file mode 100644 index 000000000..86bb37afa --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/index.md @@ -0,0 +1,95 @@ + +# Best practices overview + +Observability is a broad topic with a mature landscape of tools. Not every tool is right for every solution though! To help you navigate through your observability requirements, configuration, and final deployment, we have summarized five key best practices that will inform your decision making process on your Observability strategy. + +## Monitor what matters + +The most important consideration with observability is not your servers, network, applications, or customers. It is what matters to *you*, your business, your project, or your users. + +Start first with what your success criteria are. For example, if you run an e-commerce application, your measures of success may be number of purchases made over the past hour. If you run a non-profit, then it may be donations vs. your target for the month. A payment processor may watch for transaction processing time, whereas universities would want to measure student attendance. + +:::tip + Success metrics are different for everyone! We may use an e-commerce application as an example here, but your projects can have a very different measurement. Regardless, the advice remains the same: know what *good* looks like and measure for it. +::: + +Regardless of your application, you must start with identifying your key metrics. Then *work backwards[^1]* from that to see what impacts it from an application or infrastructure perspective. For example, if high CPU on your web servers endangers customer satisfaction, and in-turn your sales, then monitoring CPU utilization is important! + +#### Know your objectives, and measure them! + +Having identified your important top-level KPIs, your next job is to have an automated way to track and measure them. A critical success factor is doing so in the same system that watches your workload's operations. For our e-commerce workload example this may mean: + +* Publishing sales data into a [*time series*](https://en.wikipedia.org/wiki/Time_series) +* Tracking user registrations in this same system +* Measure how long customers stay on web pages, and (again) push this data to a time series + +Most customers have this data already, though not necessarily in the right places from an observability perspective. Sales data can typically be found in relational databases or business intelligence reporting systems, along with user registrations. And data from visit duration can be extracted from logs or from [Real User Monitoring](../tools/rum). + +Regardless of your metric data's original location or format, it must be maintained as a [*time series*](https://en.wikipedia.org/wiki/Time_series). Every key metric that matters most to you, whether it is business, personal, academic, or for any other purpose, must be in a time series format for you to correlate it with other observability data (sometimes known as *signals* or *telemetry*). + +![Example of a time series](../images/time_series.png) +*Figure 1: example of a time series* + +## Context propagation and tool selection + +Tool selection is important and has a profound difference in how you operate and remediate problems. But worse than choosing a sub-optimal tool is tooling for all basic signal types. For example, collecting basic [logs](../signals/logs) from a workload, but missing transaction traces, leaves you with a gap. The result is an incohesive view of your entire application experiece. All modern approaches to observability depend on "connecting the dots" with application traces. + +A complete picture of your health and operations requires tools that collect [logs](../signals/logs), [metrics](../signals/metrics), and [traces](../signals/traces), and then performs correlation, analysis, [anomaly detection](../signals/anomalies), [dashboarding](../tools/dashboards), [alarms](../tools/alarms) and more. + +:::info + Some observability solutions may not contain all of the above but are intended to augment, extend, or give added value to existing systems. In all cases, tool interoperability and extensibility is an important consideration when beginning an observability project. +::: + +#### Every workload is different, but common tools make for a faster results + +Using a common set of tools across every workload has add benefits such as reducing operational friction and training, and generally you should strive for a reduced number of tools or vendors. Doing so lets you rapidly deploy existing observability solutions to new environments or workloads, and with faster time-to-resolution when things go wrong. + +Your tools should be broad enough to observe every tier of your workload: basic infrastructure, applications, web sites, and everything in between. In places where a single tool is not possible, the best practice is to use those that have an open standard, are open source, and therefore have the broadest cross-platform integration possibilities. + +#### Integrate with existing tools and processes + +Don't reinvent the wheel! "Round" is a great shape already, and we should always be building collaborative and open systems, not data silos. + +* Integrate with existing identity providers (e.g. Active Directory, SAML based IdPs). +* If you have existing IT trouble tracking system (e.g. JIRA, ServiceNow) then integrate with it to quickly manage problems as they arise. +* Use existing workload management and escalation tools (e.g. PagerDuty, OpsGenie) if you already have them! +* Infrastructure as code tools such as Ansible, SaltStack, CloudFormation, TerraForm, and CDK are all great tools. Use them to manage your observability as well as everything else, and build your observability solution with the same infrastructure as code tools you already use today (see [include observability from day one](#include-observability-from-day-one)). + +#### Use automation and machine learning + +Computers are good at finding patterns, and at finding when data does *not* follow a pattern! If you have hundreds, thousands, or even millions of datapoints to monitor, then it would impossible to understand healthy thresholds for every single one of them. But many observability solutions have anomaly detection and machine learning capabilities that manage the undifferentiated heavy lifting of baselining your data. + +We refer to this as "knowing what good looks like". If you have load-tested your workload thoroughly then you may know these healthy performance metrics already, but for a complex distributed application it can be unwieldy to create baselines for every metric. This is where anomaly detection, automation, and machine learning are invaluable. + +Leverage tools that manage the baselining and alerting of applications health on your behalf, thereby letting you focus on your goals, and [monitor what matters](#monitor-what-matters). + +## Collect telemetry from all tiers of your workload + +Your applications do not exist in isolation, and interactions with your network infrastructure, cloud providers, internet service providers, SasS partners, and other components both within and outside your control can all impact your outcomes. It is important that you have a holistic view of your entire workload. + +#### Focus on integrations + +If you have to pick one area to instrument, it will undoubtedly be your integrations between components. This is where the power of observability is most evident. As a rule, every time one component or service calls another, that call must have at least these data points measured: + +1. The duration of the request and response +1. The status of the response + +And to create the cohesive, holistic view that observability requires, a [single unique identier](../signals/traces) for the entire request chain must be included in the signals collected. + +#### Don't forget about the end-user experience + +Having a complete view of your workload means understanding it at all tiers, including how your end users experience it. Measuring, quantifying, and understanding when your objectives are at risk from a poor user experience is just as important as watching for free disk space or CPU utilization - if not more important! + +If your workloads are ones that interact directly with the end user (such as any application served as a web site or mobile app) then [Real User Monitoring](../tools/rum) monitors not just the "last mile" of delivery to the user, but how they actually have experienced your application. Ultimately, none of the observability journey matters if your users are unable to actually use your services. + +## Data is power, but don't sweat the small stuff + +Depending on the size of your application, you may have a very large number of components to collect signals from. While doing so is important and empowering, there can be diminished returns from your efforts. This is why the best practice is to start by [monitoring what matters](#monitor-what-matters), use this as a way to map your important integrations and critical components, and focus on the right details. + +## Include observability from day one + +Like security, observability should not be an afterthought to your development or operations. The best practice is to put observability early in your planning, just like security, which creates a model for people to work with and reduces opaque corners of your application. Adding transaction tracing after major development work is done takes time, even with auto-instrumentation. The effort returns far greater returns! But doing so late in your development cycle may create some rework. + +Rather than bolting observability in your workload later one, use it to help *accelerate* your work. Proper [logging](../signals/logs), [metric](../signals/metrics), and [trace](../signals/traces) collection enables faster application development, fosters good practices, and lays the foundation for rapid problem solving going forward. + +[^1]: Amazon uses the *working backwards* process extensively as a way to obsession over our customers and their outcomes, and we highly recommend that anyone working on observability solutions work backwards from their own objectives in the same way. You can read more about *working backwards* on [Werner Vogels's blog](https://www.allthingsdistributed.com/2006/11/working_backwards.html). \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/observability-maturity-model.md b/docusaurus/observability-best-practices/docs/guides/observability-maturity-model.md new file mode 100644 index 000000000..e2f094e59 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/observability-maturity-model.md @@ -0,0 +1,156 @@ +# AWS Observability Maturity Model + +## Introduction + +At its core, observability is the ability to understand and gain insights into the internal state of a system by analyzing its external outputs. This concept has evolved from traditional monitoring approaches that focus on predefined metrics or events, to a more holistic approach that encompasses the collection, analysis, and visualization of data generated by various components in an environment. A system cannot be controlled or optimized unless it is observed. An effective observability strategy allows teams to quickly identify and resolve issues, optimize resource usage, and gain insights into the overall health of their systems. Observability gives the ability to efficiently detect, investigate and remediate issues that can and should improve overall operational availability and the health of the workloads. + +![Why Observability](../images/Why_is_Observability_Important.png) + +The difference between Monitoring and Observability is that Monitoring tells whether a system is working or not, while Observability tells why the system isn’t working. Monitoring is usually a reactive measure whereas the goal of Observability is to be able to improve your Key Performance Indicators (KPIs) in a proactive manner. Continuous Monitoring & Observability increases agility, improves customer experience and reduces risk in the cloud environment. + +## Observability maturity model + +The observability maturity model serves as an essential framework for organizations looking to optimize their workload observability and management processes. This model provides a comprehensive roadmap for businesses to assess their current capabilities, identify areas for improvement, and strategically invest in the right tools and processes to achieve optimal observability. In the era of cloud computing, microservices, ephemeral and distributed systems, observability has become a critical factor in ensuring the reliability and performance of digital services. By providing a structured approach to improving observability, this model allows organizations to gain a more profound understanding and control over their systems, paving way for a more resilient, efficient, and high-performing business. + +## Stages of Observability Maturity Model + +As organizations expand their workloads, the observability maturity model is expected to mature as well. However, the path to the observability maturity doesn’t always grow along with the workload. The intention is to help customers achieve the required maturity level as they expand and grow their organizational capabilities. + +1. The first stage in the observability maturity model typically involves establishing a baseline understanding of the organization's current state. This entails assessing existing monitoring tools and processes, as well as identifying gaps in visibility or functionality. At this stage, organizations can take stock of their current capabilities and set realistic goals for improvement starting even at the early stages of engineering cycle. + +2. In the next stage, organizations move towards a more sophisticated approach by adopting advanced observability strategies and services. This may include implementing proactive alerting, distributed tracing to gain insights into the interactions between disparate systems, by which organizations can begin to reap the benefits of increased visibility, reduce cognitive load and more efficient troubleshooting. + +3. As businesses progress through the third stage of the observability maturity model, they can leverage additional capabilities such as automated remediation, artificial intelligence and machine learning technologies to automate anomaly detection and root cause analysis. These advanced features enable organizations to not only detect issues but also take corrective actions before they impact end-users or disrupt business operations. By integrating observability tools with other critical systems such as incident management platforms, organizations can streamline their incident response processes and minimize the time it takes to resolve issues. + +4. The final stage of the observability maturity model involves leveraging the wealth of data generated by monitoring and observability tools to drive continuous improvement. This can involve using advanced analytics to identify patterns and trends in workload performance, as well as feeding this information back into engineering and operations processes to optimize resource allocation, architecture, and deployment strategies. + +![Observability maturity model stages](../images/AWS-Observability-maturity-model.png) + +### Stage1: Foundational monitoring - Collecting Telemetry Data + +Adopted as the bare minimum and worked in siloes, basic monitoring has an undefined strategy of what is required to monitor the totality of the systems or workloads in an organization. Most of the time, different teams like application owners, Network Operations Center (NOC) or CloudOps or DevOps teams use different tools for their monitoring needs, hence this approach is of little value in terms of debugging across or for optimization of the environment. + +Typically, customers at this stage have disparate solutions for monitoring their workloads. Different teams, most of the time they gather same data in different ways since there is no or limited partnership with others. The teams tend to optimize what they need by working with the data they obtain. Also, teams cannot use each other’s data since the data obtained from another team could be in a dissimilar format. Creating a plan to identify critical workloads, aiming for a unified solution for observability, defining metrics and logs are key aspects in this level. Designing your workload to capture the essential [telemetry](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/implement-observability.html) it provides is necessary to understand its internal state and the [workload health](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/utilizing-workload-observability.html). + +To build a foundation towards improving the maturity level, instrumenting workloads through collection of metrics, logs, traces and gaining meaningful insights using the right monitoring and observability tools help customers control and optimize the environment. Instrumentation refers to measuring, tracking and capturing key data from environments that can be used to observe the behavior and performance of workloads. Examples include application metrics such as errors, successful or non-successful transactions, and infrastructure metrics such as the utilization of CPU and disk resources. + +### Stage 2: Intermediate Monitoring - Telemetry Analysis and Insights + +In this stage, customers see their organizations becoming clearer in terms of collecting signals from various environments like on-premise and cloud. They have devised mechanisms to collect metrics, logs and traces from workloads as these form the foundational structure of observability, created visualizations, alerting strategies and also have the ability to prioritize issues based on well-defined criteria. Instead of being reactive and guessing, customers have a workflow that invokes required actions and relevant teams are able to analyze and troubleshoot based on captured information and historical knowledge. Customers in this level work towards accomplishing practices for observability of their environments that could be traditional or modern, highly scalable, distributed, agile and microservices architecture. + +![Observability pillars](../images/three-pillars.png) + +Although monitoring seems to be working well in most cases, organizations tend to spend more time debugging issues and as a result the overall Mean Time-To-Resolution (MTTR) is not consistent or meaningfully improved over a period of time. Also, there is higher than expected cognitive time and effort to debug issues hence longer incident response. There tends to be a data overload situation that overwhelms operations as well. We find most enterprises being caught in this stage without realizing where they could go next. Specific actions that can be taken to move the organization to the next level are: 1) Review your systems’ architecture designs at regular intervals and deploy policies and practices to reduce the impact and downtime leading to fewer alerts. 2) Prevent alert fatigue by defining actionable [KPIs](https://aws-observability.github.io/observability-best-practices/guides/operational/business/key-performance-indicators/), add valuable context to the alert findings, categorizing by severity/urgency, sending to different tools and teams to help engineers resolve the issues faster. + +Analyze these alerts on a regular basis and automate remediation for common repeated alerts. Share and communicate the alert findings with relevant teams to provide feedback on operational and process improvement. + +Develop a plan to gradually build a knowledge graph that helps you correlate different entities and understand the dependencies between different parts of a system. It enables customers visualize the impact of changes to a system, helping to predict and mitigate potential issues. + +### Stage 3: Advanced Observability - Correlation and Anomaly Detection + +In this stage organizations are able to clearly understand the root cause of issues without having to spend a lot of time troubleshooting. When an issue arises, alerts provide enough contextual information to relevant teams like Network Operations Center (NOC) or CloudOps or DevOps teams. The monitoring team are able to look at an alert and immediately determine the root cause of the issue through correlation of signals like metrics, logs as well as traces. Traces are data collected from your application about requests that can be used with tools to view, filter, and gain insights to identify issues and opportunities for optimization. Traced requests of your application provides detailed information not only about the request and response, but also about calls that your application makes to downstream AWS resources, microservices, databases, and web APIs. They can look at a trace, find the corresponding log events as traces are captured and also look at metrics from the infrastructure and applications obtaining a 360° view of the situation, they are in. + +Appropriate teams can take remedial actions at once by providing a fix that solves the issue. In this scenario, the MTTR is very small, the Service Level Objectives (SLO) are green, and the burn rate through the error budget is tolerable. Typically, customers in this level have accomplished practices for observability of their modern, agile, highly scalable and microservices environments. + +There are many organizations that have achieved this level of sophistication and maturity in their observability environments. This stage already gives organizations the ability to support complex infrastructure, operate their systems with high availability, provide higher Service Level Availability (SLA) for their applications and achieve business innovation by providing reliable infrastructure. Customers also use anomaly detectors to monitor anomalies & outliers which do not match usual patterns and have near real time alerting mechanisms. + +However, teams in such organizations always want to go beyond the art of the possible. Teams would like to understand repeated issues, create a knowledge base that they can make use of to model against scenarios in order to predict issues that might arise in the future. That is when customers move to the next stage of the maturity model, in which they get insights into the unknown. In order to get there, new tools are needed and also new skills and techniques in storing and making use of the data needs to be identified. One can make use of Artificial intelligence for IT operations (AIOps) to create systems that automatically correlate signals, identify root cause, create resolution plans based on models trained using data collected in the past. + +![Observability with AIOps](../images/o11y4AIOps.png) + +### Stage 4: Proactive Observability - Automatic and Proactive Root Cause Identification + +Here Observability data is not only used “after” an issue occurs, rather makes use of the data in real-time “before” an issue occurs. Using well-trained models, issue identifications are made proactively and the resolutions are accomplished easier and simpler. By analyzing collected signals, the monitoring system is able to provide insights into the issue automatically and also lay out resolution option(s) to resolve the issue. + +Observability software vendors are continuously expanding their capabilities into this space and this has only accelerated with Generative AI becoming popular, so that organizations aspiring to achieve such maturity level can accomplish with ease. Once this stage matures and takes shape, customers see a situation where the observability services are able to automatically create dynamic dashboards. The dashboards can only contain information that is relevant to the issue on hand. This will save time and cost in querying and visualizing data that don’t really matter. With Generative AI (GenAI) and compute to perform Machine Learning being democratized by the day, we may see proactive monitoring capabilities becoming more common in future than it is now. + +An overview of the observability portfolio providing a holistic picture, with various AWS native and open-source solutions for data Collection, data processing, data insight & analysis which the customers can make use of by choosing appropriate solutions for their end-to-end observability needs. + +![AWS Observability stack](../images/AWS_O11y_Stack.png) + +## AWS Well-Architected and Cloud Adoption Framework for Observability + +Organizations can leverage [AWS Well-Architected](https://aws.amazon.com/architecture/well-architected/) and [Cloud Adoption Framework](https://docs.aws.amazon.com/whitepapers/latest/aws-caf-operations-perspective/observability.html) to enhance their observability capabilities and effectively monitor and troubleshoot their cloud environment. + +AWS Well-Architected and Cloud Adoption Framework for observability provides a structured approach for designing, deploying, and operating workloads, ensuring best practices are followed. This leads to improved availability, system performance, scalability and reliability. These frameworks also provide organizations with a standardized set of practices and prescriptive guidance, making it easier to collaborate, share knowledge and implement consistent solutions across the organization. + +To effectively leverage, organizations need to understand the key components called the pillars ([operational excellence](https://docs.aws.amazon.com/wellarchitected/latest/framework/operational-excellence.html), security, [reliability](https://docs.aws.amazon.com/wellarchitected/latest/framework/reliability.html), [performance efficiency](https://docs.aws.amazon.com/wellarchitected/latest/framework/performance-efficiency.html), cost optimization and sustainability) of AWS Well-Architected framework, which provide a holistic approach for designing and operating cloud environment. On the other hand, the Cloud Adoption Framework provides a structured approach to cloud adoption, focusing on areas such as business, people, governance, and platform. By aligning these components with observability requirements, organizations can build robust and scalable workloads. + +Implementing AWS Well-Architected and Cloud Adoption Frameworks for observability involves a few steps. Firstly, organizations need to assess their current state and identify areas for improvement. This can be done by conducting an Observability Maturity Model assessment, which evaluates the workloads against these frameworks. Based on the review findings, organizations can prioritize and plan their observability initiatives. This includes defining monitoring and logging requirements, selecting appropriate AWS services, and implementing the necessary infrastructure and tools. Lastly, organizations need to continuously monitor and optimize their observability solutions to ensure ongoing effectiveness. + +Also, customers can utilize [AWS Well-Architected Tool](https://aws.amazon.com/well-architected-tool/) which is a service in AWS to document and measure their workload using the best practices of AWS Well-Architected Framework. This tool provides a consistent process for measuring their workloads through the pillars of AWS Well-Architected Framework, assisting with documenting the decisions that they make, providing recommendations for improving their workloads, and guiding them in making their workloads more reliable, secure, efficient, and cost-effective. + +## Assessment + +Observability Maturity Model assessment can be used to gauge your current state of observability and identify areas for improvement. An assessment of each stage involves evaluating existing monitoring and management practices across different teams, identifying gaps and areas for improvement, and determining the overall readiness for the next stage is imperative. A maturity assessment begins with business process outline, workload inventory & tools discovery, identifying current challenges and understanding organization priorities and objectives. + +The assessment helps identify the targeted metrics and KPIs that lays the foundation for further development and optimization of the existing layout. The assessment of your Observability Maturity Model plays a crucial role in ensuring that your business is prepared to handle the complex, dynamic nature of modern systems. It aids in identifying blind spots and areas of weakness that could potentially lead to system failures or performance issues. + +Moreover, regular assessments ensure that your business remains agile and adaptable. It allows you to keep pace with evolving technologies and methodologies, thereby ensuring that your systems are always at the peak of efficiency and reliability. + +The assessment is designed to help you review the state of your observability strategy against AWS best practices, identify opportunities for improvement, and track progress over time. The questions below should help you assess your current observability maturity level. To have an assessment performed using our "AWS Observability Maturity Model Assessment" tool at no cost to you, please contact your AWS account team. + +**Logs** + +1. How do you collect logs? +2. How do you use logs? +3. How do you access logs? +4. What is your log retention policy for security and regulatory compliance? +5. Do you use any ML/AI capability today? + +**Metrics** + +6. What type of metrics do you collect? +7. How do you use metrics? +8. How do you access metrics? + +**Traces** + +9. How do you collect traces? +10. How do you use traces? + +**Dashboards and Alerting** + +11. How do you use alarms? +12. How do you use dashboards? + +**Organization** + +13. Do you have an enterprise observability strategy? +14. How do you use SLOs? + +## Building the observability strategy + +Once the organization has identified their observability stage, they should start to build the strategy to optimize the current processes & tools and also start to work towards the maturity. Organizations want to ensure that their customers have a great customer experience, so they start with those customer requirements and work backwards from there. Then work with your stakeholders because they understand those requirements really well. With the aim for an observability strategy, organizations must first define their observability goals as they should be aligned with the overall business objectives and should clearly articulate what the organization aims to achieve through the strategy, providing a roadmap for building and implementing the observability plan. + +Next, organizations need to identify key metrics (KPIs) that will provide insights into system performance. These could range from latency and error rates to resource utilization and transaction volumes. It is important to note that the choice of metrics will largely depend on the nature of the business and its specific needs. + +Once the key metrics have been identified, organizations can then decide on the tools and technologies required for data collection. The choice of tool should be based on its alignment with the organization's goals, its ease of integration with existing systems, optimize cost, achieve scalability, meet customer needs and improve the overall customer experience. + +Finally, organizations should also encourage a culture that values observability. This involves training team members on the importance of observability, encouraging them to proactively monitor system performance, and fostering a culture of continuous learning and improvement. This strategy creates virtuous cycle of continuous process of collection, action and improvement for the best possible customer experience. + +![Observability virtuous cycle](../images/o11y-virtuous-cycle.png) + +In summary, to build an observability strategy the three main aspects need to be considered: 1) what needs be collected 2) what are all the systems and workloads that need to be observed and 3/ how to react when there are issues and what mechanisms should be in place to remediate them. + +## Conclusion + +The observability maturity model serves as a roadmap for organizations to assess their current state and seeking ways to improve their ability to understand, analyze, and respond to the behavior of workloads and infrastructure. By following a structured approach to assess current capabilities, adopt advanced monitoring techniques, and leverage data-driven insights, businesses can achieve a higher level of observability and make more informed decisions about their workloads and infrastructure. This model outlines the key capabilities and practices that organizations need to develop in order to progress through different levels of maturity, ultimately reaching to a state where they can fully leverage the benefits of proactive observability. + +## Helpful Resources + +- [Building an effective observability strategy](https://youtu.be/7PQv9eYCJW8?si=gsn0qPyIMhrxU6sy) - AWS re:Invent 2023 +- [AWS Observability Best Practices](https://aws-observability.github.io/observability-best-practices/) +- [What is observability and Why does it matter?](https://aws.amazon.com/blogs/mt/what-is-observability-and-why-does-it-matter-part-1/) +- [How to develop an Observability strategy?](https://aws.amazon.com/blogs/mt/how-to-develop-an-observability-strategy/) +- [Guidance for Deep Application Observability on AWS](https://aws.amazon.com/solutions/guidance/deep-application-observability-on-aws/) +- [How Discovery increased operational efficiency with AWS observability](https://www.youtube.com/watch?v=zm30JNYmxlY) - AWS re:Invent 2022 +- [Developing an observability strategy](https://www.youtube.com/watch?v=Ub3ATriFapQ) - AWS re:Invent 2022 +- [Explore Cloud Native Observability with AWS](https://www.youtube.com/watch?v=UW7aT25Mbng) - AWS Virtual Workshop +- [Increase availability with AWS observability solutions](https://www.youtube.com/watch?v=_d_9xCfVBTM) - AWS re:Invent 2020 +- [Observability best practices at Amazon](https://www.youtube.com/watch?v=zZPzXEBW4P8) - AWS re:Invent 2022 +- [Observability: Best practices for modern applications](https://www.youtube.com/watch?v=YiegAlC_yyc) - AWS re:Invent 2022 +- [Observability the open-source way](https://www.youtube.com/watch?v=2IJPpdp9xU0) - AWS re:Invent 2022 +- [Elevate your Observability Strategy with AIOps](https://www.youtube.com/watch?v=L4b_eDSAwfE) +- [Let’s Architect! Monitoring production systems at scale](https://aws.amazon.com/blogs/architecture/lets-architect-monitoring-production-systems-at-scale/) +- [Full-stack observability and application monitoring with AWS](https://www.youtube.com/watch?v=or7uFFyHIX0) - AWS Summit SF 2022 diff --git a/docusaurus/observability-best-practices/docs/guides/operational/adot-at-scale/adot-java-spring/adot-java-spring.md b/docusaurus/observability-best-practices/docs/guides/operational/adot-at-scale/adot-java-spring/adot-java-spring.md new file mode 100644 index 000000000..e73c39fe2 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/operational/adot-at-scale/adot-java-spring/adot-java-spring.md @@ -0,0 +1,151 @@ +# Instrumenting Java Spring Integration Applications + +This article describes an approach for manually instrumenting [Spring-Integration](https://docs.spring.io/spring-integration/reference/overview.html) applications utilizing [Open Telemetry](https://opentelemetry.io/) and [X-ray](https://aws.amazon.com/xray/). + +The Spring-Integration framework is designed to enable the development of integration solutions typical of event-driven architectures and messaging-centric architectures. On the other hand, OpenTelemetry tends to be more focused on micro services architectures, in which services communicate and coordinate with each other using HTTP requests. Therefore this guide will provide an example of how to instrument Spring-Integration applications using manual instrumentation with the OpenTelemetry API. + +## Background Information + +### What is tracing? + +The following quote from the [OpenTelemetry documentation](https://opentelemetry.io/docs/concepts/signals/traces/) gives a good overview of what a trace's purpose is: + +:::note + Traces give us the big picture of what happens when a request is made to an application. Whether your application is a monolith with a single database or a sophisticated mesh of services, traces are essential to understanding the full “path” a request takes in your application. +::: +Given that one of the main benefits of tracing is end-to-end visibility of a request, it is important for traces to link properly all the way from the request origin to the backend. A common way of doing this in OpenTelemetry is to utilize [nested spans](https://opentelemetry.io/docs/instrumentation/java/manual/#create-nested-spans). This works in a microservices architecture where the spans are passed from service to service until they reach the final destination. In a Spring Integration application, we need to create parent/child relationships between spans created both remotely AND locally. + +## Tracing Utilizing Context Propagation + +We will demonstrate an approach using context propagation. Although this approach is traditionally used when you need to create parent/child relationship between spans created locally and in remote locations, it will be used for the case of the Spring Integration Application because it simplifies the code and will allow the application to scale: it will be possible to process messages in parallel in multiple threads and it will also be possible to scale horizontally in case we need to process messages in different hosts. + +Here is an overview of what is necessary to achieve this: + +- Create a ```ChannelInterceptor``` and register it as a ```GlobalChannelInterceptor``` so that it can capture messages being sent across all channels. + +- In the ```ChannelInterceptor```: + - In the ```preSend``` method: + - try to read the context from the previous message that is being generated upstream.This is where we are able to connect spans from upstream messages. If no context exists, a new trace is started (this is done by the OpenTelemetry SDK). + - Create a Span with a unique name that identifies that operation. This can be the name of the channel where this message is being processed. + - Save current context in the message. + - Store the context and scope in thread.local so that they can be closed afterwards. + - inject context in the message being sent downstream. + - In the ```afterSendCompletion```: + - Restore the context and scope from thread.local + - Recreate the span from the context. + - Register any exceptions raised while processing the message. + - Close Scope. + - End Span. + +This is a simplified description of what needs to be done. We are providing a functional sample application that uses the Spring-Integration framework. The code for this application can be found [here](https://github.com/rapphil/spring-integration-samples/tree/rapphil-5.5.x-otel/applications/file-split-ftp). + +To view only the changes that were put in place to instrument the application, view this [diff](https://github.com/rapphil/spring-integration-samples/compare/30e01ce9eefd8dae288eca44013810afa8c1a585..6f056a76350340a9658db0cad7fc12dbda505437). + +### To run this sample application use: + +``` bash +# build and run +mvn spring-boot:run +# create sample input file to trigger flow +echo 'testcontent\nline2content\nlastline' > /tmp/in/testfile.txt +``` + +To experiment with this sample application, you will need to have the [ADOT collector](https://aws-otel.github.io/docs/getting-started/collector) running in the same machine as the application with a configuration similar to the following one: + +``` yaml +receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 +processors: + batch/traces: + timeout: 1s + send_batch_size: 50 + batch/metrics: + timeout: 60s +exporters: + aws xray: region:us-west-2 + aws emf: + region: us-west-2 +service: + pipelines: + traces: + receivers: [otlp] + processors: [batch/traces] + exporters: [awsxray] + metrics: + receivers: [otlp] + processors: [batch/metrics] + exporters: [awsemf] +``` + +## Results + +If we run the sample application and then run the following command, this is what we get: + +``` bash +echo 'foo123\nbar123\nfoo1234' > /tmp/in/testfile.txt +``` + +![X-ray Results](x-ray-results.png) + +We can see that the segments above match the workflow described in the sample application. Exceptions are expected when some of the messages were processed, therefore we can see that they are being properly registered and will allow us to troubleshoot them in X-Ray. + + +## FAQ + +### How do we create nested spans? + +There are three mechanisms in OpenTelemetry that can be used to connect spans: + +##### Explicitly + +You need to pass the parent span to the place where the child span is created and link both of them using: + +``` java + Span childSpan = tracer.spanBuilder("child") + .setParent(Context.current().with(parentSpan)) + .startSpan(); +``` + +##### Implicitly + +The span context will be stored in thread.local under the hood. +This method is indicated when you are sure that you are creating spans in the same thread. + +``` java + void parentTwo() { + Span parentSpan = tracer.spanBuilder("parent").startSpan(); + try(Scope scope = parentSpan.makeCurrent()) { + childTwo(); + } finally { + parentSpan.end(); + } + } + void childTwo() { + Span childSpan = tracer.spanBuilder("child") + // NOTE: setParent(...) is not required; + // `Span.current()` is automatically added as the parent + .startSpan(); + try(Scope scope = childSpan.makeCurrent()) { + // do stuff + } finally { + childSpan.end(); + } + } +``` + +##### Context Propagation + +This method will store the context somewhere (HTTP headers or in a message) so that it can be transported to a remote location where the child span is created. It is not a strict requirement to be a remote location. This can be used in the same process as well. + +### How are OpenTelemetry properties translated into X-Ray properties? + +Please see the following [guide](https://opentelemetry.io/docs/instrumentation/java/manual/#context-propagation) to view the relationship. + + + + diff --git a/docusaurus/observability-best-practices/docs/guides/operational/adot-at-scale/adot-java-spring/x-ray-results.png b/docusaurus/observability-best-practices/docs/guides/operational/adot-at-scale/adot-java-spring/x-ray-results.png new file mode 100644 index 000000000..63ed894a9 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/guides/operational/adot-at-scale/adot-java-spring/x-ray-results.png differ diff --git a/docusaurus/observability-best-practices/docs/guides/operational/adot-at-scale/operating-adot-collector.md b/docusaurus/observability-best-practices/docs/guides/operational/adot-at-scale/operating-adot-collector.md new file mode 100644 index 000000000..28d622763 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/operational/adot-at-scale/operating-adot-collector.md @@ -0,0 +1,375 @@ +# Operating the AWS Distro for OpenTelemetry (ADOT) Collector + +The [ADOT collector](https://aws-otel.github.io/) is a downstream distribution of the open-source [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) by [CNCF](https://www.cncf.io/). + +Customers can use the ADOT Collector to collect signals such as metrics and traces from different environments including on-prem, AWS and from other cloud providers. + +In order to operate the ADOT Collector in a real world environment and at scale, operators should monitor the collector health, and scale as needed. In this guide, you will learn about the actions one can take to operate the ADOT Collector in a production environment. + +## Deployment architecture + +Depending on the your requirements, there are a few deployment options that you might want to consider. + +* No Collector +* Agent +* Gateway + + +:::tip + Check out the [OpenTelemetry documentation](https://opentelemetry.io/docs/collector/deployment/) + for additional information on these concepts. +::: + +### No Collector +This option essentially skips the collector from the equation completely. If you are not aware, it is possible to make the API calls to destination services directly from the OTEL SDK and send the signals. Think about you making calls to the AWS X-Ray's [PutTraceSegments](https://docs.aws.amazon.com/xray/latest/api/API_PutTraceSegments.html) API directly from your application process instead of sending the spans to an out-of-process agent such as the ADOT Collector. + +We strongly encourage you to take a look at the [section](https://opentelemetry.io/docs/collector/deployment/no-collector/) in the upstream documentation for more specifics as there isn't any AWS specific aspect that changes the guidance for this approach. + +![No Collector option](../../../images/adot-collector-deployment-no-collector.png) + +### Agent +In this approach, you will run the collector in a distributed manner and collect signals into the destinations. Unlike the `No Collector` option, here we separate the concerns and decouple the application from having to use its resources to make remote API calls and instead communicate to a locally accessible agent. + +Essentially it will look like this below in an Amazon EKS environment **running the collector as a Kubernetes sidecar:** + +![ADOT Collector Sidecar](../../../images/adot-collector-eks-sidecar.png) + +In this above architecture, your scrape configuration shouldn't really have to make use of any service discovery mechanisms at all since you will be scraping the targets from `localhost` given that the collector is running in the same pod as the application container. + +The same architecture applies to collecting traces as well. You will simply have to create a OTEL pipeline as [shown here](https://aws-otel.github.io/docs/getting-started/x-ray#sample-collector-configuration-putting-it-together) + +##### Pros and Cons +* One argument advocating for this design is that you don't have to allocate extra-ordinary amount of resources (CPU, Memory) for the Collector to do its job since the targets are limited to localhost sources. + +* The disadvantage of using this approach could be that, the number of varied configurations for the collector pod configuration is directly proportional to the number of applications you are running on the cluster. +This means, you will have to manage CPU, Memory and other resource allocation individually for each Pod depending on the workload that is expected for the Pod. By not being careful with this, you might over or under-allocate resources for the Collector Pod that will result in either under-performing or locking up CPU cycles and Memory which could otherwise be used by other Pods in the Node. + +You could also deploy the collector in other models such as Deployments, Daemonset, Statefulset etc based on your needs. + +#### Running the collector as a Daemonset on Amazon EKS + +You can choose to run the collector as a [Daemonset](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) in case you want to evenly distribute the load (scraping and sending the metrics to Amazon Managed Service for Prometheus workspace) of the collectors across the EKS Nodes. + +![ADOT Collector Daemonset](../../../images/adot-collector-eks-daemonset.png) + +Ensure you have the `keep` action that makes the collector only scrape targets from its own host/Node. + +See sample below for reference. Find more such configuration details [here.](https://aws-otel.github.io/docs/getting-started/adot-eks-add-on/config-advanced#daemonset-collector-configuration) + +```yaml +scrape_configs: + - job_name: kubernetes-apiservers + bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + kubernetes_sd_configs: + - role: endpoints + relabel_configs: + - action: keep + regex: $K8S_NODE_NAME + source_labels: [__meta_kubernetes_endpoint_node_name] + scheme: https + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: true +``` + +The same architecture can also be used for collecting traces. In this case, instead of the Collector reaching out to the endpoints to scrape Prometheus metrics, the trace spans will be sent to the Collector by the application pods. + +##### Pros and Cons +**Advantages** + +* Minimal scaling concerns +* Configuring High-Availability is a challenge +* Too many copies of Collector in use +* Can be easy for Logs support + +**Disadvantages** + +* Not the most optimal in terms of resource utilization +* Disproportionate resource allocation + + +#### Running the collector on Amazon EC2 +As there is no side car approach in running the collector on EC2, you would be running the collector as an agent on the EC2 instance. You can set a static scrape configuration such as the one below to discover targets in the instance to scrape metrics from. + +The config below scrapes endpoints at ports `9090` and `8081` on localhost. + +Get a hands-on deep dive experience in this topic by going through our [EC2 focused module in the One Observability Workshop.](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/ec2-monitoring) + +```yaml +global: + scrape_interval: 15s # By default, scrape targets every 15 seconds. + +scrape_configs: +- job_name: 'prometheus' + static_configs: + - targets: ['localhost:9090', 'localhost:8081'] +``` + +#### Running the collector as Deployment on Amazon EKS + +Running the collector as a Deployment is particularly useful when you want to also provide High Availability for your collectors. Depending on the number of targets, metrics available to scrape etc the resources for the Collector should be adjusted to ensure the collector isn't starving and hence causing issues in signal collection. + +[Read more about this topic in the guide here.](https://aws-observability.github.io/observability-best-practices/guides/containers/oss/eks/best-practices-metrics-collection) + +The following architecture shows how a collector is deployed in a separate node outside of the workload nodes to collect metrics and traces. + +![ADOT Collector Deployment](../../../images/adot-collector-deployment-deployment.png) + +To setup High-Availability for metric collection, [read our docs that provide detailed instructions on how you can set that up](https://docs.aws.amazon.com/prometheus/latest/userguide/Send-high-availability-prom-community.html) + +#### Running the collector as a central task on Amazon ECS for metrics collection + +You can use the [ECS Observer extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/observer/ecsobserver) to collect Prometheus metrics across different tasks in an ECS cluster or across clusters. + +![ADOT Collector Deployment ECS](../../../images/adot-collector-deployment-ecs.png) + +Sample collector configuration for the extension: + +```yaml +extensions: + ecs_observer: + refresh_interval: 60s # format is https://golang.org/pkg/time/#ParseDuration + cluster_name: 'Cluster-1' # cluster name need manual config + cluster_region: 'us-west-2' # region can be configured directly or use AWS_REGION env var + result_file: '/etc/ecs_sd_targets.yaml' # the directory for file must already exists + services: + - name_pattern: '^retail-.*$' + docker_labels: + - port_label: 'ECS_PROMETHEUS_EXPORTER_PORT' + task_definitions: + - job_name: 'task_def_1' + metrics_path: '/metrics' + metrics_ports: + - 9113 + - 9090 + arn_pattern: '.*:task-definition/nginx:[0-9]+' +``` + + +##### Pros and Cons +* An advantage in this model is that there are fewer collectors and configurations to manage yourself. +* When the cluster is rather large and there are thousands of targets to scrape, you will have to carefully design the architecture in such a way that the load is balanced across the collectors. Adding this to having to run near-clones of the same collectors for HA reasons should be done carefully in order to avoid operational issues. + +### Gateway + +![ADOT Collector Gateway](../../../images/adot-collector-deployment-gateway.png) + + +## Managing Collector health +The OTEL Collector exposes several signals for us to keep tab of its health and performance. It is essential that the collector's health is closely monitored in order to take corrective actions such as, + +* Scaling the collector horizontally +* Provisioning additional resources to the collector for it to function as desired + + +### Collecting health metrics from the Collector + +The OTEL Collector can be configured to expose metrics in Prometheus Exposition Format by simply adding the `telemetry` section to the `service` pipeline. The collector also can expose its own logs to stdout. + +More details on telemetry configuration can be found in the [OpenTelemetry documentation here.](https://opentelemetry.io/docs/collector/configuration/#service) + +Sample telemetry configuration for the collector. + +```yaml +service: + telemetry: + logs: + level: debug + metrics: + level: detailed + address: 0.0.0.0:8888 +``` +Once configured, the collector will start exporting metrics such as this below at `http://localhost:8888/metrics`. + +```bash +# HELP otelcol_exporter_enqueue_failed_spans Number of spans failed to be added to the sending queue. +# TYPE otelcol_exporter_enqueue_failed_spans counter +otelcol_exporter_enqueue_failed_spans{exporter="awsxray",service_instance_id="523a2182-539d-47f6-ba3c-13867b60092a",service_name="aws-otel-collector",service_version="v0.25.0"} 0 + +# HELP otelcol_process_runtime_total_sys_memory_bytes Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys') +# TYPE otelcol_process_runtime_total_sys_memory_bytes gauge +otelcol_process_runtime_total_sys_memory_bytes{service_instance_id="523a2182-539d-47f6-ba3c-13867b60092a",service_name="aws-otel-collector",service_version="v0.25.0"} 2.4462344e+07 + +# HELP otelcol_process_memory_rss Total physical memory (resident set size) +# TYPE otelcol_process_memory_rss gauge +otelcol_process_memory_rss{service_instance_id="523a2182-539d-47f6-ba3c-13867b60092a",service_name="aws-otel-collector",service_version="v0.25.0"} 6.5675264e+07 + +# HELP otelcol_exporter_enqueue_failed_metric_points Number of metric points failed to be added to the sending queue. +# TYPE otelcol_exporter_enqueue_failed_metric_points counter +otelcol_exporter_enqueue_failed_metric_points{exporter="awsxray",service_instance_id="d234b769-dc8a-4b20-8b2b-9c4f342466fe",service_name="aws-otel-collector",service_version="v0.25.0"} 0 +otelcol_exporter_enqueue_failed_metric_points{exporter="logging",service_instance_id="d234b769-dc8a-4b20-8b2b-9c4f342466fe",service_name="aws-otel-collector",service_version="v0.25.0"} 0 +``` + +In the above sample output, you can see that the collector is exposing a metric called `otelcol_exporter_enqueue_failed_spans` showing the number of spans that were failed to get added to the sending queue. This metric is one to watch out to understand if the collector is having issues in sending trace data to the destination configured. In this case, you can see that the `exporter` label with value `awsxray` indicating the trace destination in use. + +The other metric `otelcol_process_runtime_total_sys_memory_bytes` is an indicator to understand the amount of memory being used by the collector. If this memory goes too close to the value in `otelcol_process_memory_rss` metric, that is an indication that the Collector is getting close to exhausting the allocated memory for the process and it might be time for you to take action such as allocating more memory for the collector to avoid issues. + +Likewise, you can see that there is another counter metric called `otelcol_exporter_enqueue_failed_metric_points` that indicates the number of metrics that failed to be sent to the remote destination + +#### Collector health check +There is a liveness probe that the collector exposes in-order for you to check whether the collector is live or not. It is recommended to use that endpoint to periodically check the collector's availability. + +The [`healthcheck`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension) extension can be used to have the collector expose the endpoint. See sample configuration below: + +```yaml +extensions: + health_check: + endpoint: 0.0.0.0:13133 +``` + +For the complete configuration options, refer to [the GitHub repo here.](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension) + +```bash +❯ curl -v http://localhost:13133 +* Trying 127.0.0.1:13133... +* Connected to localhost (127.0.0.1) port 13133 (#0) +> GET / HTTP/1.1 +> Host: localhost:13133 +> User-Agent: curl/7.79.1 +> Accept: */* +> +* Mark bundle as not supporting multiuse +< HTTP/1.1 200 OK +< Date: Fri, 24 Feb 2023 19:09:22 GMT +< Content-Length: 0 +< +* Connection #0 to host localhost left intact +``` + +#### Setting limits to prevent catastrophic failures +Given that resources (CPU, Memory) are finite in any environment, you should set limits to the collector components in-order to avoid failures due to unforeseen situations. + +It is particularly important when you are operating the ADOT Collector to collect Prometheus metrics. +Take this scenario - You are in the DevOps team and are responsible for deploying and operating the ADOT Collector in an Amazon EKS cluster. Your application teams can simply drop their application Pods at will anytime of the day, and they expect the metrics exposed from their pods to be collected into an Amazon Managed Service for Prometheus workspace. + +Now it is your responsibility to ensure that this pipeline works without any hiccups. There are two ways to solve this problem at a high level: + +* Scaling the collector infinitely (hence adding Nodes to the cluster if needed) to support this requirement +* Set limits on metric collection and advertise the upper threshold to the application teams + +There are pros and cons to both approaches. You can argue that you want to choose option 1, if you are fully committed to supporting your ever growing business needs not considering the costs or the overhead that it might bring in. While supporting the ever growing business needs infinitely might sound like `cloud is for infinite scalability` point of view, this can bring in a lot of operational overhead and might lead into much more catastrophical situations if not given infinite amount of time, and people resources to ensure continual uninterrupted operations, which in most cases is not practical. + +A much more pragmatic and frugal approach would be to choose option 2, where you are setting upper limits (and potentially increasing gradually based on needs progressively) at any given time to ensure the operational boundary is obvious. + +Here is an example of how you can do that with using Prometheus receiver in the ADOT Collector. + +In Prometheus [scrape_config,](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config) you can set several limits for any particular scrape job. You could put limits on, + +* The total body size of the scrape +* Limit number of labels to accept (the scrape will be discarded if this limit exceeds and you can see that in the Collector logs) +* Limit the number of targets to scrape +* ..more + +You can see all available options in the [Prometheus documentation.](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config) + +##### Limiting Memory usage +The Collector pipeline can be configured to use [`memorylimiterprocessor`](https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/memorylimiterprocessor) to limit the amount of memory the processor component will use. It is common to see customers wanting the Collector to do complex operations that require intense Memory and CPU requirements. + +While using processors such as [`redactionprocessor,`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/redactionprocessor)[`filterprocessor,`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessor)[`spanprocessor,`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/spanprocessor) are exciting and very useful, you should also remember that processors in general deal with data transformation tasks and it requires them to keep data in-memory in-order to complete the tasks. This can lead to a specific processor breaking the Collector entirely and also the Collector not having enough memory to expose its own health metrics. + +You can avoid this by limiting the amount of memory the Collector can use by making use of the [`memorylimiterprocessor.`](https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/memorylimiterprocessor). The recommendation for this is to provide buffer memory for the Collector to make use of for exposing health metrics and perform other tasks so the processors do not take all the allocated memory. + +For example, if your EKS Pod has a memory limit of `10Gi`, then set the `memorylimitprocessor` to less than `10Gi`, for example `9Gi` so the buffer of `1Gi` can be used to perform other operations such as exposing health metrics, receiver and exporter tasks. + +#### Backpressure management + +Some architecture patterns (Gateway pattern) such as the one shown below can be used to centralize some operational tasks such as (but not limited to) filtering out sensitive data out of signal data to maintain compliance requirements. + +![ADOT Collector Simple Gateway](../../../images/adot-collector-deployment-simple-gateway.png) + +However, it is possible to overwhelm the Gateway Collector with too many such _processing_ tasks that can cause issues. The recommended approach would be is to distribute the process/memory intense tasks between the individual collectors and the gateway so the workload is shared. + +For example, you could use the [`resourceprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/resourceprocessor) to process resource attributes and use the [`transformprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor) to transform the signal data from within the individual Collectors as soon as the signal collection happens. + +Then you could use the [`filterprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessor) to filter out certain parts of the signal data and use the [`redactionprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/redactionprocessor) to redact sensitive information such as Credit Card numbers etc. + +The high-level architecture diagram would look like the one below: + +![ADOT Collector Simple Gateway with processors](../../../images/adot-collector-deployment-simple-gateway-pressure.png) + +As you might have observed already, the Gateway Collector can soon become a single point of failure. One obvious choice there is to spin up more than one Gateway Collector and proxy requests through a load balancer like [AWS Application Load Balancer (ALB)](https://aws.amazon.com/elasticloadbalancing/application-load-balancer/) as shown below. + +![ADOT Collector Gateway batching pressure](../../../images/adot-collector-deployment-gateway-batching-pressure.png) + + +##### Handling out-of-order samples in Prometheus metric collection + +Consider the following scenario in the architecture below: + +![ADOT Collector Gateway batching pressure](../../../images/adot-collector-deployment-gateway-batching.png) + +1. Assume that metrics from **ADOT Collector-1** in the Amazon EKS Cluster are sent to the Gateway cluster, which is being directed to the **Gateway ADOT Collector-1** +1. In a moment, the metrics from the same **ADOT Collector-1** (which is collecting the same targets, hence the same metric samples are being dealt with) is being sent to **Gateway ADOT Collector-2** +1. Now if the **Gateway ADOT Collector-2** happens to dispatch the metrics to Amazon Managed Service for Prometheus workspace first and then followed by the **Gateway ADOT Collector-1** which contains older samples for the same metrics series, you will receive the `out of order sample` error from Amazon Managed Service for Prometheus. + +See example error below: + +```bash +Error message: + 2023-03-02T21:18:54.447Z error exporterhelper/queued_retry.go:394 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(): user=820326043460_ws-5f42c3b6-3268-4737-b215-1371b55a9ef2: err: out of order sample. timestamp=2023-03-02T21:17:59.782Z, series={__name__=\"otelcol_exporter_send_failed_metric_points\", exporter=\"logging\", http_scheme=\"http\", instance=\"10.195.158.91:28888\", ", "dropped_items": 6474} +go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send + go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/queued_retry.go:394 +go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send + go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/metrics.go:135 +go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1 + go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/queued_retry.go:205 +go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1 + go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/internal/bounded_memory_queue.go:61 +``` + +###### Solving out of order sample error + +You can solve the out of order sample error in this particular setup in a couple of ways: + +* Use a sticky load balancer to direct requests from a particular source to go to the same target based on IP address. + + Refer to the [link here](https://aws.amazon.com/premiumsupport/knowledge-center/elb-route-requests-with-source-ip-alb/) for additional details. + + +* As an alternate option, you can add an external label in the Gateway Collectors to distinguish the metric series so Amazon Managed Service for Prometheus considers these metrics are individual metric series and are not from the same. + +:::warning + Using this solution can will result in multiplying the metric series in proportion to the Gateway Collectors in the setup. This is might mean that you can overrun some limits such as [`Active time series limits`](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP_quotas.html) +::: + +* **If you are deploying ADOT Collector as a Daemonset**: make sure you are using `relabel_configs` to only keep samples from the same node where each ADOT Collector pod is running. Check the links below to learn more. + - [Advanced Collector Configuration for Amazon Managed Prometheus](https://aws-otel.github.io/docs/getting-started/adot-eks-add-on/config-advanced) - Expand the *Click to View* section, and look for the entried similar to the following: + ```yaml + relabel_configs: + - action: keep + regex: $K8S_NODE_NAME + ``` + - [ADOT Add-On Advanced Configuration](https://aws-otel.github.io/docs/getting-started/adot-eks-add-on/add-on-configuration) - Learn how to deploy ADOT Collector using the ADOT Add-On for EKS advanced configurations. + - [ADOT Collector deployment strategies](https://aws-otel.github.io/docs/getting-started/adot-eks-add-on/installation#deploy-the-adot-collector) - Learn more about the different alternatives to deploy ADOT Collector at scale and the advantages of each approach. + + +#### Open Agent Management Protocol (OpAMP) + +OpAMP is a client/server protocol that supports communication over HTTP and over WebSockets. OpAMP is implemented in the OTel Collector and hence the OTel Collector can be used as a server as part of the control plane to manage other agents that support OpAMP, like the OTel Collector itself. The "manage" portion here involves being able to update configurations for collectors, monitoring health or even upgrading the Collectors. + +The details of this protocol is well [documented in the upstream OpenTelemetry website.](https://opentelemetry.io/docs/collector/management/) + +### Horizontal Scaling +It may become necessary to horizontally scale an ADOT Collector depending on your workload. The requirement to horizontally scale is entirely dependent on your use case, Collector configuration, and +telemetry throughput. + +Platform specific horizontal scaling techniques can be applied to a Collector as you would any other application while being cognizant of stateful, stateless, and scraper Collector components. + +Most collector components are `stateless`, meaning that they do not hold state in memory, and if they do it is not relevant for scaling purposes. Additional replicas of stateless Collectors can be scaled behind an +application load balancer. + +`Stateful` Collector components are collector components that retain information in memory which is crucial for the operation of that component. + +Examples of stateful components in the ADOT Collector include but are not limited to: + +* [Tail Sampling Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor) - requires all spans for a trace to make an accurate sampling decisions. Avanced sampling scaling techniques is [documented on the ADOT developer portal](https://aws-otel.github.io/docs/getting-started/advanced-sampling). +* [AWS EMF Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsemfexporter) - performs cummulative to delta conversions on some metric types. This conversion requires the previous metric value to be stored in memory. +* [Cummulative to Delta Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/cumulativetodeltaprocessor#cumulative-to-delta-processor) - cummulative to delta conversion requires storing the previous metric value in memory. + +Collector components that are `scrapers` actively obtain telemetry data rather than passively receive it. Currently, the [Prometheus receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver) is the only scraper +type component in the ADOT Collector. Horizontally scaling a collector configuration that contains a prometheus receiver will require splitting the scraping jobs per collector to ensure +that no two Collectors scrape the same endpoint. Failure to do this may lead to Prometheus out of order sample errors. + +The process of and techniques of scaling collectors is [documunted in greater detail in the upstream OpenTelemetry website](https://opentelemetry.io/docs/collector/scaling/). + + diff --git a/docusaurus/observability-best-practices/docs/guides/operational/alerting/amp-alertmgr.md b/docusaurus/observability-best-practices/docs/guides/operational/alerting/amp-alertmgr.md new file mode 100644 index 000000000..945551d23 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/operational/alerting/amp-alertmgr.md @@ -0,0 +1,334 @@ +# Amazon Managed Service for Prometheus Alert Manager + +## Introduction + +[Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) (AMP) supports two types of rules namely '**Recording rules**' and '**Alerting rules**', which can be imported from your existing Prometheus server and are evaluated at regular intervals. + +[Alerting rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) allow customers to define alert conditions based on [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) and a threshold. When the value of the alerting rule exceeds threshold, a notification is sent to Alert manager in Amazon Managed Service for Prometheus which provides similar functionality to alert manager in standalone Prometheus. An alert is the outcome of an alerting rule in Prometheus when it is active. + +## Alerting Rules File + +An Alerting rule in Amazon Managed Service for Prometheus is defined by a rules file in YAML format, which follows the same format as a rules file in standalone Prometheus. Customers can have multiple rules files in an Amazon Managed Service for Prometheus workspace. A workspace is a logical space dedicated to the storage and querying of Prometheus metrics. + +A rules file typically has the following fields: + +``` +groups: + - name: + rules: + - alert: + expr: + for: + labels: + annotations: +``` + +```console +Groups: A collection of rules that are run sequentially at a regular interval +Name: Name of the group +Rules: The rules in a group +Alert: Name of the alert +Expr: The expression for the alert to trigger +For: Minimum duration for an alert’s expression to be exceeding threshold before updating to a firing status +Labels: Any additional labels attached to the alert +Annotations: Contextual details such as a description or link +``` + +A sample rule file looks like below + +``` +groups: + - name: test + rules: + - record: metric:recording_rule + expr: avg(rate(container_cpu_usage_seconds_total[5m])) + - name: alert-test + rules: + - alert: metric:alerting_rule + expr: avg(rate(container_cpu_usage_seconds_total[5m])) > 0 + for: 2m +``` + +## Alert Manager Configuration File + +The Amazon Managed Service for Prometheus Alert Manager uses a configuration file in YAML format to set up the alerts (for the receiving service) that is in the same structure as an alert manager config file in standalone Prometheus. The configuration file consists of two key sections for alert manager and templating + +1. [template_files](https://prometheus.io/docs/prometheus/latest/configuration/template_reference/), contains the templates of annotations and labels in alerts exposed as the `$value`, `$labels`, `$externalLabels`, and `$externalURL` variables for convenience. The `$labels` variable holds the label key/value pairs of an alert instance. The configured external labels can be accessed via the `$externalLabels` variable. The `$value` variable holds the evaluated value of an alert instance. `.Value`, `.Labels`, `.ExternalLabels`, and `.ExternalURL` contain the alert value, the alert labels, the globally configured external labels, and the external URL (configured with `--web.external-url`) respectively. + +2. [alertmanager_config](https://prometheus.io/docs/alerting/latest/configuration/), contains the alert manager configuration that uses the same structure as an alert manager config file in standalone Prometheus. + +A sample alert manager configuration file having both template_files and alertmanager_config looks like below, + +``` +template_files: + default_template: | + {{ define "sns.default.subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]{{ end }} + {{ define "__alertmanager" }}AlertManager{{ end }} + {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }} +alertmanager_config: | + global: + templates: + - 'default_template' + route: + receiver: default + receivers: + - name: 'default' + sns_configs: + - topic_arn: arn:aws:sns:us-east-2:accountid:My-Topic + sigv4: + region: us-east-2 + attributes: + key: severity + value: SEV2 +``` + +## Key aspects of alerting + +There are three important aspects to be aware of when creating Amazon Managed Service for Prometheus [Alert Manager configuration file](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alert-manager.html). + +- **Grouping**: This helps collect similar alerts into a single notification, which is useful when the blast radius of failure or outage is large affecting many systems and several alerts fire simultaneously. This can also be used to group into categories (e.g., node alerts, pod alerts). The [route](https://prometheus.io/docs/alerting/latest/configuration/#route) block in the alert manager configuration file can be used to configure this grouping. +- **Inhibition**: This is a way to suppress certain notifications to avoid spamming similar alerts that are already active and fired. [inhibit_rules](https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule) block can be used to write inhibition rules. +- **Silencing**: Alerts can be muted for a specified duration, such as during a maintenance window or a planned outage. Incoming alerts are verified for matching all equality or regular expression before silencing the alert. [PutAlertManagerSilences](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-APIReference.html#AMP-APIReference-PutAlertManagerSilences) API can be used to create silencing. + +## Route alerts through Amazon Simple Notification Service (SNS) + +Currently [Amazon Managed Service for Prometheus Alert Manager supports Amazon SNS](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-receiver-AMPpermission.html) as the only receiver. The key section in the alertmanager_config block is the receivers, which lets customers configure [Amazon SNS to receive alerts](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-receiver-config.html). The following section can be used as a template for the receivers block. + +``` +- name: name_of_receiver + sns_configs: + - sigv4: + region: + topic_arn: + subject: somesubject + attributes: + key: + value: +``` + +The Amazon SNS configuration uses the following template as default unless its explicitly overridden. + +``` +{{ define "sns.default.message" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }} + {{ if gt (len .Alerts.Firing) 0 -}} + Alerts Firing: + {{ template "__text_alert_list" .Alerts.Firing }} + {{- end }} + {{ if gt (len .Alerts.Resolved) 0 -}} + Alerts Resolved: + {{ template "__text_alert_list" .Alerts.Resolved }} + {{- end }} +{{- end }} +``` + +Additional Reference: [Notification Template Examples](https://prometheus.io/docs/alerting/latest/notification_examples/) + +## Routing alerts to other destinations beyond Amazon SNS + +Amazon Managed Service for Prometheus Alert Manager can use [Amazon SNS to connect to other destinations](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-SNS-otherdestinations.html) such as email, webhook (HTTP), Slack, PageDuty, and OpsGenie. + +- **Email** A successful notification will result in an email received from Amazon Managed Service for Prometheus Alert Manager through Amazon SNS topic with the alert details as one of the targets. +- Amazon Managed Service for Prometheus Alert Manager can [send alerts in JSON format](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-receiver-JSON.html), so that they can be processed downstream from Amazon SNS in AWS Lambda or in webhook-receiving endpoints. +- **Webhook** An existing Amazon SNS topic can be configured to output messages to a webhook endpoint. Webhooks are messages in serialized form-encoded JSON or XML formats, exchanged over HTTP between applications based on event driven triggers. This can be used to hook any existing [SIEM or collaboration tools](https://repost.aws/knowledge-center/sns-lambda-webhooks-chime-slack-teams) for alerting, ticketing or incident management systems. +- **Slack** Customers can integrate with [Slack’s](https://aws.amazon.com/blogs/mt/how-to-integrate-amazon-managed-service-for-prometheus-with-slack/) email-to-channel integration where Slack can accept an email and forward it to a Slack channel, or use a Lambda function to rewrite the SNS notification to Slack. +- **PagerDuty** The template used in `template_files` block in the `alertmanager_config` definition can be customized to send the payload to [PagerDuty](https://aws.amazon.com/blogs/mt/using-amazon-managed-service-for-prometheus-alert-manager-to-receive-alerts-with-pagerduty/) as a destination of Amazon SNS. + +Additional Reference: [Custom Alert manager Templates](https://prometheus.io/blog/2016/03/03/custom-alertmanager-templates/) + +## Alert status + +Alerting rules define alert conditions based on expressions to send alerts to any notification service, whenever the set threshold is crossed. An example rule and its expression is shown below. + +``` +rules: +- alert: metric:alerting_rule + expr: avg(rate(container_cpu_usage_seconds_total[5m])) > 0 + for: 2m + +``` + +Whenever the alert expression results in one or more vector elements at a given point in time, the alert counts as active. The alerts take active (pending | firing) or resolved status. + +- **Pending**: The time elapsed since threshold breach is less than the recording interval +- **Firing**: The time elapsed since threshold breach is more than the recording interval and Alert Manager is routing alerts. +- **Resolved**: The alert is no longer firing because the threshold is no longer breached. + +This can be manually verified by querying the Amazon Managed Service for Prometheus Alert Manager endpoint with [ListAlerts](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-APIReference.html#AMP-APIReference-ListAlerts) API using [awscurl](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-compatible-APIs.html) command. A sample request is shown below. + +``` +awscurl https://aps-workspaces.us-east-1.amazonaws.com/workspaces/$WORKSPACE_ID/alertmanager/api/v2/alerts --service="aps" -H "Content-Type: application/json" +``` + +## Amazon Managed Service for Prometheus Alert Manager rules in Amazon Managed Grafana + +Amazon Managed Grafana (AMG) alerting feature allows customers to gain visibility into Amazon Managed Service for Prometheus Alert Manager alerts from their Amazon Managed Grafana workspace. Customers using the Amazon Managed Service for Prometheus workspaces to collect Prometheus metrics utilize the fully managed Alert Manager and Ruler features in the service to configure alerting and recording rules. With this feature, they can visualize all their alert and recording rules configured in their Amazon Managed Service for Prometheus workspace. Prometheus alerts view can be in Amazon Managed Grafana (AMG) console by checking the Grafana alerting checkbox in the Workspace configuration options tab. Once enabled, this will also migrate native Grafana alerts that were previously created in Grafana dashboards into a new Alerting page in the Grafana workspace. + +Reference: [Announcing Prometheus Alert manager rules in Amazon Managed Grafana](https://aws.amazon.com/blogs/mt/announcing-prometheus-alertmanager-rules-in-amazon-managed-grafana/) + +![List of AMP alerts in Grafana](../../../images/amp-alerting.png) + +## Recommended alerts for a baseline monitoring + +Alerting is a key aspect of robust monitoring and observability best practices. The alerting mechanism should strike a balance between alert fatigue and missing critical alerts. Here are some of the alerts that are recommended to improve the overall reliability of the workloads. Various teams in the organization look at monitoring their infrastructure and workloads from different perspectives and hence this could be expanded or changed based on the requirement and scenario & certainly this is not a comprehensive list. + +- Container Node is using more than certain (ex. 80%) allocated memory limit. +- Container Node is using more than certain (ex. 80%) allocated CPU limit. +- Container Node is using more than certain (ex. 90%) allocated disk space. +- Container in pod in namespace is using more than certain (ex. 80%) allocated CPU limit. +- Container in pod in namespace is using more than certain (ex. 80%) of memory limit. +- Container in pod in namespace had too many restarts. +- Persistent Volume in a namespace is using more than certain (max 75%) disk space. +- Deployment is currently having no active pods running +- Horizontal Pod Autoscaler (HPA) in namespace is running at max capacity + +The essential thing in setting up alerts for the above or any similar scenario will require the expression to be changed as needed. For example, + +``` +expr: | + ((sum(irate(container_cpu_usage_seconds_total{image!="",container!="POD", namespace!="kube-sys"}[30s])) by (namespace,container,pod) / +sum(container_spec_cpu_quota{image!="",container!="POD", namespace!="kube-sys"} / +container_spec_cpu_period{image!="",container!="POD", namespace!="kube-sys"}) by (namespace,container,pod) ) * 100) > 80 + for: 5m +``` + +## ACK Controller for Amazon Managed Service for Prometheus + +Amazon Managed Service for Prometheus [AWS Controller for Kubernetes](https://github.com/aws-controllers-k8s/community) (ACK) controller is available for Workspace, Alert Manager and Ruler resources which lets customers take advantage of Prometheus using [custom resource definitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) (CRDs) and native objects or services that provide supporting capabilities without having to define any resources outside of Kubernetes cluster. The [ACK controller for Amazon Managed Service for Prometheus](https://aws.amazon.com/blogs/mt/introducing-the-ack-controller-for-amazon-managed-service-for-prometheus/) can be used to manage all resources directly from the Kubernetes cluster that you’re monitoring, allowing Kubernetes to act as your ‘source of truth’ for your workload’s desired state. [ACK](https://aws-controllers-k8s.github.io/community/docs/community/overview/) is a collection of Kubernetes CRDs and custom controllers working together to extend the Kubernetes API and manage AWS resources. + +A snippet of alerting rules configured using ACK is shown below: + +``` +apiVersion: prometheusservice.services.k8s.aws/v1alpha1 +kind: RuleGroupsNamespace +metadata: + name: default-rule +spec: + workspaceID: WORKSPACE-ID + name: default-rule + configuration: | + groups: + - name: example + rules: + - alert: HostHighCpuLoad + expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 60 + for: 5m + labels: + severity: warning + event_type: scale_up + annotations: + summary: Host high CPU load (instance {{ $labels.instance }}) + description: "CPU load is > 60%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" + - alert: HostLowCpuLoad + expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) < 30 + for: 5m + labels: + severity: warning + event_type: scale_down + annotations: + summary: Host low CPU load (instance {{ $labels.instance }}) + description: "CPU load is < 30%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" +``` + +## Restricting access to rules using IAM policy + +Organizations require various teams to have their own rules to be created & administered for their recording and alerting requirements. Rules management in Amazon Managed Service for Prometheus allows rules to be access controlled using AWS Identity and Access Management (IAM) policy so that each team can control their own set of rules & alerts grouped by rulegroupnamespaces. + +The below image shows two example rulegroupnamespaces called devops and engg added into Rules management of Amazon Managed Service for Prometheus. + +![Recording and Alerting rule namespaces in AMP console](../../../images/AMP_rules_namespaces.png) + +The below JSON is a sample IAM policy which restricts access to the devops rulegroupnamespace (shown above) with the Resource ARN specified. The notable actions in the below IAM policy are [PutRuleGroupsNamespace](https://docs.aws.amazon.com/cli/latest/reference/amp/put-rule-groups-namespace.html) and [DeleteRuleGroupsNamespace](https://docs.aws.amazon.com/cli/latest/reference/amp/delete-rule-groups-namespace.html) which are restricted to the specified Resource ARN of the rulegroupsnamespace of AMP workspace. Once the policy is created, it can be assigned to any required user, group or role for desired access control requirement. The Action in the IAM policy can be modified/restricted as required based on [IAM permissions](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-APIReference.html) for required & allowable actions. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "VisualEditor0", + "Effect": "Allow", + "Action": [ + "aps:RemoteWrite", + "aps:DescribeRuleGroupsNamespace", + "aps:PutRuleGroupsNamespace", + "aps:DeleteRuleGroupsNamespace" + ], + "Resource": [ + "arn:aws:aps:us-west-2:XXXXXXXXXXXX:workspace/ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx", + "arn:aws:aps:us-west-2:XXXXXXXXXXXX:rulegroupsnamespace/ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx/devops" + ] + } + ] +} +``` + +The below awscli interaction shows an example of an IAM user having restricted access to a rulegroupsnamespace specified through Resource ARN (i.e. devops rulegroupnamespace) in IAM policy and how the same user is denied access to other resources (i.e. engg rulegroupnamespace) not having access. + +``` +$ aws amp describe-rule-groups-namespace --workspace-id ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx --name devops +{ + "ruleGroupsNamespace": { + "arn": "arn:aws:aps:us-west-2:XXXXXXXXXXXX:rulegroupsnamespace/ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx/devops", + "createdAt": "2023-04-28T01:50:15.408000+00:00", + "data": "Z3JvdXBzOgogIC0gbmFtZTogZGV2b3BzX3VwZGF0ZWQKICAgIHJ1bGVzOgogICAgLSByZWNvcmQ6IG1ldHJpYzpob3N0X2NwdV91dGlsCiAgICAgIGV4cHI6IGF2ZyhyYXRlKGNvbnRhaW5lcl9jcHVfdXNhZ2Vfc2Vjb25kc190b3RhbFsybV0pKQogICAgLSBhbGVydDogaGlnaF9ob3N0X2NwdV91c2FnZQogICAgICBleHByOiBhdmcocmF0ZShjb250YWluZXJfY3B1X3VzYWdlX3NlY29uZHNfdG90YWxbNW1dKSkKICAgICAgZm9yOiA1bQogICAgICBsYWJlbHM6CiAgICAgICAgICAgIHNldmVyaXR5OiBjcml0aWNhbAogIC0gbmFtZTogZGV2b3BzCiAgICBydWxlczoKICAgIC0gcmVjb3JkOiBjb250YWluZXJfbWVtX3V0aWwKICAgICAgZXhwcjogYXZnKHJhdGUoY29udGFpbmVyX21lbV91c2FnZV9ieXRlc190b3RhbFs1bV0pKQogICAgLSBhbGVydDogY29udGFpbmVyX2hvc3RfbWVtX3VzYWdlCiAgICAgIGV4cHI6IGF2ZyhyYXRlKGNvbnRhaW5lcl9tZW1fdXNhZ2VfYnl0ZXNfdG90YWxbNW1dKSkKICAgICAgZm9yOiA1bQogICAgICBsYWJlbHM6CiAgICAgICAgc2V2ZXJpdHk6IGNyaXRpY2FsCg==", + "modifiedAt": "2023-05-01T17:47:06.409000+00:00", + "name": "devops", + "status": { + "statusCode": "ACTIVE", + "statusReason": "" + }, + "tags": {} + } +} + + +$ cat > devops.yaml < groups: +> - name: devops_new +> rules: +> - record: metric:host_cpu_util +> expr: avg(rate(container_cpu_usage_seconds_total[2m])) +> - alert: high_host_cpu_usage +> expr: avg(rate(container_cpu_usage_seconds_total[5m])) +> for: 5m +> labels: +> severity: critical +> - name: devops +> rules: +> - record: container_mem_util +> expr: avg(rate(container_mem_usage_bytes_total[5m])) +> - alert: container_host_mem_usage +> expr: avg(rate(container_mem_usage_bytes_total[5m])) +> for: 5m +> labels: +> severity: critical +> EOF + + +$ base64 devops.yaml > devops_b64.yaml + + +$ aws amp put-rule-groups-namespace --workspace-id ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx --name devops --data file://devops_b64.yaml +{ + "arn": "arn:aws:aps:us-west-2:XXXXXXXXXXXX:rulegroupsnamespace/ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx/devops", + "name": "devops", + "status": { + "statusCode": "UPDATING" + }, + "tags": {} +} +``` + +`$ aws amp describe-rule-groups-namespace --workspace-id ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx --name engg +An error occurred (AccessDeniedException) when calling the DescribeRuleGroupsNamespace operation: User: arn:aws:iam::XXXXXXXXXXXX:user/amp_ws_user is not authorized to perform: aps:DescribeRuleGroupsNamespace on resource: arn:aws:aps:us-west-2:XXXXXXXXXXXX:rulegroupsnamespace/ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx/engg` + +`$ aws amp put-rule-groups-namespace --workspace-id ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx --name engg --data file://devops_b64.yaml +An error occurred (AccessDeniedException) when calling the PutRuleGroupsNamespace operation: User: arn:aws:iam::XXXXXXXXXXXX:user/amp_ws_user is not authorized to perform: aps:PutRuleGroupsNamespace on resource: arn:aws:aps:us-west-2:XXXXXXXXXXXX:rulegroupsnamespace/ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx/engg` + +`$ aws amp delete-rule-groups-namespace --workspace-id ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx --name engg +An error occurred (AccessDeniedException) when calling the DeleteRuleGroupsNamespace operation: User: arn:aws:iam::XXXXXXXXXXXX:user/amp_ws_user is not authorized to perform: aps:DeleteRuleGroupsNamespace on resource: arn:aws:aps:us-west-2:XXXXXXXXXXXX:rulegroupsnamespace/ws-8da31ad6-f09d-44ff-93a3-xxxxxxxxxx/engg` + +The user permissions to use rules can also be restricted using an [IAM policy](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-IAM-permissions.html) (documentation sample). + +For more information customers can read the [AWS Documentation](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alert-manager.html), go through the [AWS Observability Workshop](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/amp/setup-alert-manager) on Amazon Managed Service for Prometheus Alert Manager. + +Additional Reference: [Amazon Managed Service for Prometheus Is Now Generally Available with Alert Manager and Ruler](https://aws.amazon.com/blogs/aws/amazon-managed-service-for-prometheus-is-now-generally-available-with-alert-manager-and-ruler/) diff --git a/docusaurus/observability-best-practices/docs/guides/operational/alerts/amg-alerts.md b/docusaurus/observability-best-practices/docs/guides/operational/alerts/amg-alerts.md new file mode 100644 index 000000000..e69de29bb diff --git a/docusaurus/observability-best-practices/docs/guides/operational/alerts/cw-alarms.md b/docusaurus/observability-best-practices/docs/guides/operational/alerts/cw-alarms.md new file mode 100644 index 000000000..e69de29bb diff --git a/docusaurus/observability-best-practices/docs/guides/operational/alerts/prometheus-alerts.md b/docusaurus/observability-best-practices/docs/guides/operational/alerts/prometheus-alerts.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/operational/alerts/prometheus-alerts.md @@ -0,0 +1 @@ + diff --git a/docusaurus/observability-best-practices/docs/guides/operational/business/key-performance-indicators.md b/docusaurus/observability-best-practices/docs/guides/operational/business/key-performance-indicators.md new file mode 100644 index 000000000..ec4ca7c21 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/operational/business/key-performance-indicators.md @@ -0,0 +1,288 @@ +## 1.0 Understanding KPIs ("Golden Signals") +Organizations utilize key performance indicators (KPIs) a.k.a 'Golden Signals' that provide insight into the health or risk of the business and operations. Different parts of an organization would have unique KPIs that cater to measurement of their respective outcomes. For example, the product team of an eCommerce application would track the ability to process cart orders successfully as its KPI. An on-call operations team would measure their KPI as mean-time to detect (MTTD) an incident. For the financial team a KPI for cost of resources under budget is important. + +Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are essential components of service reliability management. This guide outlines best practices for using Amazon CloudWatch and its features to calculate and monitor SLIs, SLOs, and SLAs, with clear and concise examples. + +- **SLI (Service Level Indicator):** A quantitative measure of a service's performance. +- **SLO (Service Level Objective):** The target value for an SLI, representing the desired performance level. +- **SLA (Service Level Agreement):** A contract between a service provider and its users specifying the expected level of service. + +Examples of common SLIs: + +- Availability: Percentage of time a service is operational +- Latency: Time taken to fulfill a request +- Error rate: Percentage of failed requests + +## 2.0 Discover customer and stakeholder requirements (using template as suggested below) + +1. Start with the top question: “What is the business value or business problem in scope for the given workload (ex. Payment portal, eCommerce order placement, User registration, Data reports, Support portal etc) +2. Break down the business value into categories such as User-Experience (UX); Business-Experience (BX); Operational-Experience (OpsX); Security-Experience(SecX); Developer-Experience (DevX) +3. Derive core signals aka “Golden Signals” for each category; the top signals around UX & BX will typically construe the business metrics + +| ID | Initials | Customer | Business Needs | Measurements | Information Sources | What does good look like? | Alerts | Dashboards | Reports | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +|M1 |Example |External End User |User Experience |Response time (Page latency) |Logs / Traces |< 5s for 99.9% |No |Yes |No | +|M2 |Example |Business |Availability |Successful RPS (Requests per second) |Health Check |>85% in 5 min window |Yes |Yes |Yes | +|M3 |Example |Security |Compliance |Critical non-compliant resources |Config data |\<10 under 15 days |No |Yes |Yes | +|M4 |Example |Developers |Agility |Deployment time |Deployment logs |Always < 10 min |Yes |No |Yes | +|M5 |Example |Operators |Capacity |Queue Depth |App logs/metrics |Always < 10 |Yes |Yes |Yes | + +### 2.1 Golden Signals + +|Category |Signal |Notes |References | +|--- |--- |--- |--- | +|UX |Performance (Latency) |See M1 in template |Whitepaper: [Availability and Beyond (Measuring latency)](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/measuring-availability.html#latency) | +|BX |Availability |See M2 in template |Whitepaper: [Avaiability and Beyond (Measuring availability)](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/measuring-availability.html) | +|BX |Business Continuity Plan (BCP) |Amazon Resilience Hub (ARH) resilience score against defined RTO/RPO |Docs: [ARH user guide (Understanding resilience scores)](https://docs.aws.amazon.com/resilience-hub/latest/userguide/resil-score.html) | +|SecX |(Non)-Compliance |See M3 in template |Docs: [AWS Control Tower user guide (Compliance status in the console)](https://docs.aws.amazon.com/controltower/latest/userguide/compliance-statuses.html) | +|DevX |Agility |See M4 in template |Docs: [DevOps Monitoring Dashboard on AWS (DevOps metrics list)](https://docs.aws.amazon.com/solutions/latest/devops-monitoring-dashboard-on-aws/devops-metrics-list.html) | +|OpsX |Capacity (Quotas) |See M5 in template |Docs: [Amazon CloudWatch user guide (Visualizing your service quotas and setting alarms)](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Quotas-Visualize-Alarms.html) | +|OpsX |Budget Anomalies | |Docs:
1. [AWS Billing and Cost Management (AWS Cost Anomaly Detection)](https://docs.aws.amazon.com/cost-management/latest/userguide/getting-started-ad.html)
2. [AWS Budgets](https://aws.amazon.com/aws-cost-management/aws-budgets/) | + + + +## 3.0 Top Level Guidance ‘TLG’ + + +### 3.1 TLG General + +1. Work with business, architecture and security teams to help refine the business, compliance and governance requirements and ensure they accurately reflect the business needs. This includes [establishing recovery-time and recovery-point targets](https://aws.amazon.com/blogs/mt/establishing-rpo-and-rto-targets-for-cloud-applications/) (RTOs, RPOs). Formulate methods to measure requirements such as [measuring availability](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/measuring-availability.html) and latency (ex. Uptime could allow a small percentage of faults over a 5 min window). + +2. Build an effective [tagging strategy](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/defining-and-publishing-a-tagging-schema.html) with purpose built schema that aligns to various business functional outcomes. This should especially cover [operational observability](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/operational-observability.html) and [incident management](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/incident-management.html). + +3. Where possible leverage dynamic thresholds for alarms (esp. for metrics that do not have baseline KPIs) using [CloudWatch anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) which provides machine learning algorithms to establish the baselines. When utilizing AWS available services that publish CW metrics (or other sources like prometheus metrics) to configure alarms consider creating [composite alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm.html) to reduce alarm noise. Example: a composite alarm that comprises of a business metric indicative of availability (tracked by successful requests) and latency when configured to alarm when both drop below a critical threshold during deployments could be deterministic indicator of deployment bug. + +4. (NOTE: Requires AWS Business support or higher) AWS publishes events of interest using AWS Health service related to your resources in Personal Health Dashboard. Leverage [AWS Health Aware (AHA)](https://aws.amazon.com/blogs/mt/aws-health-aware-customize-aws-health-alerts-for-organizational-and-personal-aws-accounts/) framework (that uses AWS Health) to ingest proactive and real-time alerts aggregated across your AWS Organization from a central account (such as a management account). These alerts can be sent to preferred communication platforms such as Slack and integrates with ITSM tools like ServiceNow and Jira. +![Image: AWS Health Aware 'AHA'](../../../images/AHA-Integration.jpg) + +5. Leverage Amazon CloudWatch [Application Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-application-insights.html) to setup best monitors for resources and continuously analyze data for signs of problems with your applications. It also provides automated dashboards that show potential problems with monitored applications to quickly isolate/troubleshoot application/infrastructure issues. Leverage [Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) to aggregate metrics and logs from containers and can be integrated seamlessly with CloudWatch Application Insights. +![Image: CW Application Insights](../../../images/CW-ApplicationInsights.jpg) + +6. Leverage [AWS Resilience Hub](https://aws.amazon.com/resilience-hub/) to analyze applications against defined RTOs and RPOs. Validate if the availability, latency and business continuity requirements are met by using controlled experiments using tools like [AWS Fault Injection Simulator](https://aws.amazon.com/fis/). Conduct additional Well-Architected reviews and service specific deep-dives to ensure workloads are designed to meet business requirements following AWS best practices. + +7. For further details refer to other sections of [AWS Observability Best Practices](https://aws-observability.github.io/observability-best-practices/) guidance, AWS Cloud Adoption Framework: [Operations Perspective](https://docs.aws.amazon.com/whitepapers/latest/aws-caf-operations-perspective/observability.html) whitepaper and AWS Well-Architected Framework Operational Excellence Pillar whitepaper content on '[Understading workload health](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/understanding-workload-health.html)'. + + +### 3.2 TLG by Domain (emphasis on business metrics i.e. UX, BX) + +Suitable examples are provided below using services such as CloudWatch (CW) (Ref: AWS Services that publish [CloudWatch metrics documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-services-cloudwatch-metrics.html)) + +#### 3.2.1 Canaries (aka Synthetic transactions) and Real-User Monitoring (RUM) + +* TLG: One of the easiest and most effective ways to understand client/customer experience is to simulate customer traffic with Canaries (Synthetic transactions) which regularly probes your services and records metrics. + +|AWS Service |Feature |Measurement |Metric |Example |Notes | +|--- |--- |--- |--- |--- |--- | +|CW |Synthetics |Availability |**SuccessPercent** |(Ex. SuccessPercent > 90 or CW Anomaly Detection for 1min Period)
**[Metric Math where m1 is SuccessPercent if Canaries run each weekday 7a-8a (CloudWatchSynthetics): **
`IF(((DAY(m1)<6) AND (HOUR(m1)>7 AND HOUR(m1)<8)),m1)]` | | +| | | | | | | +|CW |Synthetics |Availability |VisualMonitoringSuccessPercent |(Ex. VisualMonitoringSuccessPercent > 90 for 5 min Period for UI screenshot comparisons)
**[Metric Math where m1 is SuccessPercent if Canaries run each weekday 7a-8a (CloudWatchSynthetics): **
`IF(((DAY(m1)<6) AND (HOUR(m1)>7 AND HOUR(m1)<8)),m1)` |If customer expects canary to match predetermined UI screenshot | +| | | | | | | +|CW |RUM |Response Time |Apdex Score |(Ex. Apdex score:
NavigationFrustratedCount < ‘N’ expected value) | | +| | | | | | | + + +#### 3.2.2 API Frontend + + +|AWS Service |Feature |Measurement |Metric |Example |Notes | +|--- |--- |--- |--- |--- |--- | +|CloudFront | |Availability |Total error rate |(Ex. [Total error rate] < 10 or CW Anomaly Detection for 1min Period) |Availability as a measure of error rate | +| | | | | | | +|CloudFront |(Requires turning on additional metrics) |Peformance |Cache hit rate |(Ex.Cache hit rate < 10 CW Anomaly Detection for 1min Period) | | +| | | | | | | +|Route53 |Health checks |(Cross region) Availability |HealthCheckPercentageHealthy |(Ex. [Minimum of HealthCheckPercentageHealthy] > 90 or CW Anomaly Detection for 1min Period) | | +| | | | | | | +|Route53 |Health checks |Latency |TimeToFirstByte |(Ex. [p99 TimeToFirstByte] < 100 ms or CW Anomaly Detection for 1min Period) | | +| | | | | | | +|API Gateway | |Availability |Count |(Ex. [(4XXError + 5XXError) / Count) * 100] < 10 or CW Anomaly Detection for 1min Period) |Availability as a measure of "abandoned" requests | +| | | | | | | +|API Gateway | |Latency |Latency (or IntegrationLatency i.e. backend latency) |(Ex. p99 Latency < 1 sec or CW Anomaly Detection for 1min Period) |p99 will have greater tolerance than lower percentile like p90. (p50 is same as average) | +| | | | | | | +|API Gateway | |Performance |CacheHitCount (and Misses) |(Ex. [CacheMissCount / (CacheHitCount + CacheMissCount) * 100] < 10 or CW Anomaly Detection for 1min Period) |Performance as a measure of Cache (Misses) | +| | | | | | | +|Application Load Balancer (ALB) | |Availability |RejectedConnectionCount |(Ex.[RejectedConnectionCount/(RejectedConnectionCount + RequestCount) * 100] < 10 CW Anomaly Detection for 1min Period) |Availability as a measure of rejected requests due to max connections breached | +| | | | | | | +|Application Load Balancer (ALB) | |Latency |TargetResponseTime |(Ex. p99 TargetResponseTime < 1 sec or CW Anomaly Detection for 1min Period) |p99 will have greater tolerance than lower percentile like p90. (p50 is same as average) | +| | | | | | | + + +#### 3.2.3 Serverless + +|AWS Service |Feature |Measurement |Metric |Example |Notes | +|--- |--- |--- |--- |--- |--- | +|S3 |Request metrics |Availability |AllRequests |(Ex. [(4XXErrors + 5XXErrors) / AllRequests) * 100] < 10 or CW Anomaly Detection for 1min Period) |Availability as a measure of "abandoned" requests | +| | | | | | | +|S3 |Request metrics |(Overall) Latency |TotalRequestLatency |(Ex. [p99 TotalRequestLatency] < 100 ms or CW Anomaly Detection for 1min Period) | | +| | | | | | | +|DynamoDB (DDB) | |Availability |ThrottledRequests |(Ex. [ThrottledRequests] < 100 or CW Anomaly Detection for 1min Period) |Availability as a measure of "throttled" requests | +| | | | | | | +|DynamoDB (DDB) | |Latency |SuccessfulRequestLatency |(Ex. [p99 SuccessfulRequestLatency] < 100 ms or CW Anomaly Detection for 1min Period) | | +| | | | | | | +|Step Functions | |Availability |ExecutionsFailed |(Ex. ExecutionsFailed = 0)
**[ex. Metric Math where m1 is ExecutionsFailed (Step function Execution) UTC time: `IF(((DAY(m1)<6 OR ** ** DAY(m1)==7) AND (HOUR(m1)>21 AND HOUR(m1)<7)),m1)]` |Assuming business flow that requests completion of step functions as a daily operation 9p-7a during weekdays (start of day business operations) | +| | | | | | | + + +#### 3.2.4 Compute and Containers + +|AWS Service |Feature |Measurement |Metric |Example |Notes | +|--- |--- |--- |--- |--- |--- | +|EKS |Prometheus metrics |Availability |APIServer Request Success Ratio |(ex. Prometheus metric like [APIServer Request Success Ratio](https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/service/cwagent-prometheus/sample_cloudwatch_dashboards/kubernetes_api_server/cw_dashboard_kubernetes_api_server.json)) |See [best practices for monitoring EKS control plane metrics](https://aws.github.io/aws-eks-best-practices/reliability/docs/controlplane/#monitor-control-plane-metrics) and [EKS observability](https://docs.aws.amazon.com/eks/latest/userguide/eks-observe.html) for details. | +| | | | | | | +|EKS |Prometheus metrics |Performance |apiserver_request_duration_seconds, etcd_request_duration_seconds |apiserver_request_duration_seconds, etcd_request_duration_seconds | | +| | | | | | | +|ECS | |Availability |Service RUNNING task count |Service RUNNING task count |See ECS CW metrics [documentation](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html#cw_running_task_count) | +| | | | | | | +|ECS | |Performance |TargetResponseTime |(ex. [p99 TargetResponseTime] < 100 ms or CW Anomaly Detection for 1min Period) |See ECS CW metrics [documentation](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html#cw_running_task_count) | +| | | | | | | +|EC2 (.NET Core) |CW Agent Performance Counters |Availability |(ex. [ASP.NET Application Errors Total/Sec](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/appinsights-metrics-ec2.html#appinsights-metrics-ec2-built-in) < 'N') |(ex. [ASP.NET Application Errors Total/Sec](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/appinsights-metrics-ec2.html#appinsights-metrics-ec2-built-in) < 'N') |See EC2 CW Application Insights [documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/appinsights-metrics-ec2.html#appinsights-metrics-ec2-built-in) | +| | | | | | | + + +#### 3.2.5 Databases (RDS) + +|AWS Service |Feature |Measurement |Metric |Example |Notes | +|--- |--- |--- |--- |--- |--- | +|RDS Aurora |Performance Insights (PI) |Availability |Average active sessions |(Ex. Average active serssions with CW Anomaly Detection for 1min Period) |See RDS Aurora CW PI [documentation](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_PerfInsights.Overview.ActiveSessions.html#USER_PerfInsights.Overview.ActiveSessions.AAS) | +| | | | | | | +|RDS Aurora | |Disaster Recovery (DR) |AuroraGlobalDBRPOLag |(Ex. AuroraGlobalDBRPOLag < 30000 ms for 1min Period) |See RDS Aurora CW [documentation](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraMonitoring.Metrics.html) | +| | | | | | | +|RDS Aurora | |Performance |Commit Latency, Buffer Cache Hit Ratio, DDL Latency, DML Latency |(Ex. Commit Latency with CW Anomaly Detection for 1min Period) |See RDS Aurora CW PI [documentation](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_PerfInsights.Overview.ActiveSessions.html#USER_PerfInsights.Overview.ActiveSessions.AAS) | +| | | | | | | +|RDS (MSSQL) |PI |Performance |SQL Compilations |(Ex.
SQL Compliations > 'M' for 5 min Period) |See RDS CW PI [documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights_Counters.html#USER_PerfInsights_Counters.SQLServer) | +| | | | | | | + + +## 4.0 Using Amazon CloudWatch and Metric Math for Calculating SLIs, SLOs, and SLAs + +### 4.1 Amazon CloudWatch and Metric Math + +Amazon CloudWatch provides monitoring and observability services for AWS resources. Metric Math allows you to perform calculations using CloudWatch metric data, making it an ideal tool for calculating SLIs, SLOs, and SLAs. + +#### 4.1.1 Enabling Detailed Monitoring + +Enable Detailed Monitoring for your AWS resources to get 1-minute data granularity, allowing for more accurate SLI calculations. + +#### 4.1.2 Organizing Metrics with Namespaces and Dimensions + +Use Namespaces and Dimensions to categorize and filter metrics for easier analysis. For example, use Namespaces to group metrics related to a specific service, and Dimensions to differentiate between various instances of that service. + +### 4.2 Calculating SLIs with Metric Math + +#### 4.2.1 Availability + +To calculate availability, divide the number of successful requests by the total number of requests: + +``` +availability = 100 * (successful_requests / total_requests) +``` + + +**Example:** + +Suppose you have an API Gateway with the following metrics: +- `4XXError`: Number of 4xx client errors +- `5XXError`: Number of 5xx server errors +- `Count`: Total number of requests + +Use Metric Math to calculate the availability: + +``` +availability = 100 * ((Count - 4XXError - 5XXError) / Count) +``` + + +#### 4.2.2 Latency + +To calculate average latency, use the `SampleCount` and `Sum` statistics provided by CloudWatch: + +``` +average_latency = Sum / SampleCount +``` + + +**Example:** + +Suppose you have a Lambda function with the following metric: +- `Duration`: Time taken to execute the function + +Use Metric Math to calculate the average latency: + +``` +average_latency = Duration.Sum / Duration.SampleCount +``` + + +#### 4.2.3 Error Rate + +To calculate the error rate, divide the number of failed requests by the total number of requests: + +``` +error_rate = 100 * (failed_requests / total_requests) +``` + + +**Example:** + +Using the API Gateway example from before: + +``` +error_rate = 100 * ((4XXError + 5XXError) / Count) +``` + + +### 4.4 Defining and Monitoring SLOs + +#### 4.4.1 Setting Realistic Targets + +Define SLO targets based on user expectations and historical performance data. Set achievable targets to ensure a balance between service reliability and resource utilization. + +#### 4.4.2 Monitoring SLOs with CloudWatch + +Create CloudWatch Alarms to monitor your SLIs and notify you when they approach or breach SLO targets. This enables you to proactively address issues and maintain service reliability. + +#### 4.4.3 Reviewing and Adjusting SLOs + +Periodically review your SLOs to ensure they remain relevant as your service evolves. Adjust targets if necessary and communicate any changes to stakeholders. + +### 4.5 Defining and Measuring SLAs + +#### 4.5.1 Setting Realistic Targets + +Define SLA targets based on historical performance data and user expectations. Set achievable targets to ensure a balance between service reliability and resource utilization. + +#### 4.5.2 Monitoring and Alerting + +Set up CloudWatch Alarms to monitor SLIs and notify you when they approach or breach SLA targets. This enables you to proactively address issues and maintain service reliability. + +#### 4.5.3 Regularly Reviewing SLAs + +Periodically review SLAs to ensure they remain relevant as your service evolves. Adjust targets if necessary and communicate any changes to stakeholders. + +### 4.6 Measuring SLA or SLO Performance Over a Set Period + +To measure SLA or SLO performance over a set period, such as a calendar month, use CloudWatch metric data with custom time ranges. + +**Example:** + +Suppose you have an API Gateway with an SLO target of 99.9% availability. To measure the availability for the month of April, use the following Metric Math expression: + +``` +availability = 100 * ((Count - 4XXError - 5XXError) / Count) +``` + + +Then, configure the CloudWatch metric data query with a custom time range: + +- **Start Time:** `2023-04-01T00:00:00Z` +- **End Time:** `2023-04-30T23:59:59Z` +- **Period:** `2592000` (30 days in seconds) + +Finally, use the `AVG` statistic to calculate the average availability over the month. If the average availability is equal to or greater than the SLO target, you have met your objective. + +## 5.0 Summary + +Key Performance Indicators (KPIs) a.k.a 'Golden Signals' must align to business and stake-holder requirements. Calculating SLIs, SLOs, and SLAs using Amazon CloudWatch and Metric Math is crucial for managing service reliability. Follow the best practices outlined in this guide to effectively monitor and maintain the performance of your AWS resources. Remember to enable Detailed Monitoring, organize metrics with Namespaces and Dimensions, use Metric Math for SLI calculations, set realistic SLO and SLA targets, and establish monitoring and alerting systems with CloudWatch Alarms. By applying these best practices, you can ensure optimal service reliability, better resource utilization, and improved customer satisfaction. + + + + diff --git a/docusaurus/observability-best-practices/docs/guides/operational/business/monitoring-for-business-outcomes.md b/docusaurus/observability-best-practices/docs/guides/operational/business/monitoring-for-business-outcomes.md new file mode 100644 index 000000000..6c18a8c22 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/operational/business/monitoring-for-business-outcomes.md @@ -0,0 +1,92 @@ +# Why should you do observability? + +See [Developing an Observability Strategy](https://www.youtube.com/watch?v=Ub3ATriFapQ) on YouTube + +## What really matters? + +Everything that you do at work should align to your organization's mission. All of us that are employed work to fulfill our organization's mission and towards its vision. At Amazon, our mission states that: + +> Amazon strives to be Earth’s most customer-centric company, Earth’s best employer, and Earth’s safest place to work. + +— [About Amazon](https://www.aboutamazon.com/about-us) + +In IT, every project, deployment, security measure or optimization should work towards a business outcome. It seems obvious, but you should not do anything that does not add value to the business. As ITIL puts it: + +> Every change should deliver business value. + +— ITIL Service Transition, AXELOS, 2011, page 44. +— See [Change Management in the Cloud AWS Whitepaper](https://docs.aws.amazon.com/whitepapers/latest/change-management-in-the-cloud/change-management-in-the-cloud.html) + +Mission and business value are important because they should inform everything that you do. There are many benefits to observability, these include: + +- Better availability +- More reliability +- Understanding of application health and performance +- Better collaboration +- Proactive detection of issues +- Increase customer satisfaction +- Reduce time to market +- Reduce operational costs +- Automation + +All of these benefits have one thing in common, they all deliver business value, either directly to the customer or indrectly to the organization. When thinking about observability, everything should come back to thinking about whether or not your application is delivering business value. + +This means that observability should be measuring things that contribute towards delivering business value, focusing on business outcomes and when they are at risk: you should think about what your customers want and what they need. + +## Where do I start? + +Now that you know what matters, you need to think about what you need to measure. At Amazon, we start with the customer and work backwards from their needs: + +> We are internally driven to improve our services, adding benefits and features, before we have to. We lower prices and increase value for customers before we have to. We invent before we have to. + +— Jeff Bezos, [2012 Shareholder Letter](https://s2.q4cdn.com/299287126/files/doc_financials/annual/2012-Shareholder-Letter.pdf) + +Let's take a simple example, using an e-commerce site. First, think about what you want as a customer when you are buying products online, it may not be the same for everyone, but you probably care about things like: + +- Delivery +- Price +- Security +- Page Speed +- Search (can you find the product you are looking for?) + +Once you know what your customers care about, you can start to measure them and how they affect your business outcomes. Page speed directly impacts your conversion rate and search engine ranking. A 2017 study showed that more than half (53%) of mobile users abandon a page if it takes more than 3 seconds to load. There are of course, many studies that show the importance of page speed, and it is an obvious metric to measure, but you need to measure it and take action because it has a measureable impact on conversion and you can use that data to make improvements. + +## Working backwards + +You cannot be expected to know everything that you customers care about. If you are reading this, you are probably in a technical role. You need to talk to the stakeholders in your organisation, this isn't always easy, but it is vital to ensuring that you are measuring what's important. + +Let's continue with the e-commerce example. This time, consider search: it may be obvious that customers need to be able to search for a product in order to buy it, but did you know that according to a [Forrester Research report](https://www.forrester.com/report/MustHave+eCommerce+Features/-/E-RES89561), 43% of visitors navigate immediately to the search box and searches are 2-3 times more likely to convert compared to non-searchers. Search is really important, it has to work well and you need to monitor it - maybe you discover that particular searches are yeilding no results and that you need to move from naive pattern matching to natural language processing. This is an example of monitoring for a business outcome and then acting to improve the customer experience. + +At Amazon: + +> We strive to deeply understand customers and work backwards from their pain points to rapidly develop innovations that create meaningful solutions in their lives. + +— Daniel Slater - Worldwide Lead, Culture of Innovation, AWS in [Elements of Amazon’s Day 1 Culture](https://aws.amazon.com/executive-insights/content/how-amazon-defines-and-operationalizes-a-day-1-culture/) + +We start with the customer and work backwards from their needs. This isn't the only approach to success in business, but it is a good approach to observability. Work with stakeholders to understand what's important to your customers and then work backwards from there. + +As an added benefit, if you collect metrics that are important to your customers and stakeholders, you can visualize these in near real-time dashboards and avoid having to create reports or answer questions such as "how long is it taking to load the landing page?" or "how much is it costing to run the website?" - stakeholders and executives should be able to self serve this information. + +These are the kind of high level metrics that **really matter** for your application and they are also almost always the best indicator that there is an issue. For example: an alert indicating that there are fewer orders than you would normally expect in a given time period tells you that there is probably an issue that is impacting customers; an alert indicating that a volume on a server is nearly full or that you have a high number of 5xx errors for a particular service may be something that requires fixing, but you still have to understand customer impact and then prioritize accordingly - this can take time. + +Issues that impact customers are easy to identify when you are measuring these high level business metrics. These metrics are the **what** is happening. Other metrics and other forms of observability such as tracing and logs are the **why** is this happening, which will lead you to what you can do to fix it or improve it. + +## What to observe + +Now you have an idea of what matters to your customers, you can identify Key Performance Indicators (KPIs). These are your high level metrics that will tell you if business outcomes are at risk. You also need to gather information from many different sources that may impact those KPIs, this is where you need to start thinking about metrics that could impact those KPIs. As was discussed earlier, the number of 5xx errors, does not indicate impact, but it could have an effect on your KPIs. Work your way backwards from what will impact business outcomes to things that may impact business outcomes. + +Once you know what you need to collect, you need to identify the sources of information that will provide you with the metrics you can use to measure KPIs and related metrics that may impact those KPIs. This is the basis of what you observe. + +This data is likely to come from Metrics, Logs and Traces. Once you have this data, you can use it to alert when outcomes are at risk. + +You can then evaluate the impact and attempt to rectify the issue. Almost always, this data will tell you that there’s a problem, before an isolated technical metric (such as cpu or memory) does. + +You can use observability reactively to fix an issue impacting business outcomes or you can use the data proactively to do something like improve your customer's search experience. + +## Conclusion + +Whilst CPU, RAM, Disk Space and other technical metrics are important for scaling, performance, capacity and cost – they don’t really tell you how your application is doing and don’t give any insight in to customer experience. + +Your customers are what’s important and it’s their experience that you should be monitoring. + +That’s why you should work backwards from your customers’ requirements, working with your stakeholders and establish KPIs and metrics that matter. diff --git a/docusaurus/observability-best-practices/docs/guides/operational/business/sla-percentile.md b/docusaurus/observability-best-practices/docs/guides/operational/business/sla-percentile.md new file mode 100644 index 000000000..f5c231805 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/operational/business/sla-percentile.md @@ -0,0 +1,36 @@ +# Percentiles are important + +Percentiles are important in monitoring and reporting because they provide a more detailed and accurate view of data distribution compared to just relying on averages. An average can sometimes hide important information, such as outliers or variations in the data, that can significantly impact performance and user experience. Percentiles, on the other hand, can reveal these hidden details and give a better understanding of how the data is distributed. + +In [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/), percentiles can be used to monitor and report on various metrics, such as response times, latency, and error rates, across your applications and infrastructure. By setting up alarms on percentiles, you can get alerted when specific percentile values exceed thresholds, allowing you to take action before they impact more customers. + +To use [percentiles in CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Percentiles), choose your metric in **All metrics** in the CloudWatch console and use an existing metric and set the **statistic** to **p99**, you can then edit the value after the p to whichever percentile you would like. You can then view percentile graphs, add them to [CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) and set alarms on these metrics. For example, you could set an alarm to notify you when the 95th percentile of response times exceeds a certain threshold, indicating that a significant percentage of users are experiencing slow response times. + +The histogram below was created in [Amazon Managed Grafana](https://aws.amazon.com/grafana/) using a [CloudWath Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) query from [CloudWatch RUM](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) logs. The query used was: + +``` +fields @timestamp, event_details.duration +| filter event_type = "com.amazon.rum.performance_navigation_event" +| sort @timestamp desc +``` + +The histogram plots the page load time in milliseconds. With this view, it's possible to clearly see the outliers. This data is hidden if average is used. + +![Histogram](../../../images/percentiles-histogram.png) + +The same data shown in CloudWatch using the average value indicates that pages are taking under two seconds to load. You can see from the histogram above, that most pages are actually taking less than a second and we have outliers. + +![Histogram](../../../images/percentiles-average.png) + +Using the same data again with a percentile (p99) indicates that there is an issue, the CloudWatch graph now shows that 99 percent of page loads are taking less than 23 seconds. + +![Histogram](../../../images/percentiles-p99.png) + +To make this easier to visualize, the graphs below compare the average value to the 99th percentile. In this case, the target page load time is two seconds, it is possible to use alternative [CloudWatch statistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.html#Percentile-versus-Trimmed-Mean) and [metric math](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) to make other calculations. In this case Percentile rank (PR) is used with the statistic **PR(:2000)** to show that 92.7% of page loads are happening within the target of 2000ms. + +![Histogram](../../../images/percentiles-comparison.png) + +Using percentiles in CloudWatch can help you gain deeper insights into your system's performance, detect issues early, and improve your customer's experience by identifying outliers that would otherwise be hidden. + + + diff --git a/docusaurus/observability-best-practices/docs/guides/operational/gitops-with-amg/gitops-with-amg.md b/docusaurus/observability-best-practices/docs/guides/operational/gitops-with-amg/gitops-with-amg.md new file mode 100644 index 000000000..2b3f9dd7a --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/operational/gitops-with-amg/gitops-with-amg.md @@ -0,0 +1,64 @@ +# Using GitOps and Grafana Operator with Amazon Managed Grafana + +## How to use this guide + +This Observability best practices guide is meant for developers and architects who want to understand how to use [grafana-operator](https://github.com/grafana-operator/grafana-operator#:~:text=The%20grafana%2Doperator%20is%20a,an%20easy%20and%20scalable%20way.) as a Kubernetes operator on your Amazon EKS cluster to create and manage the lifecycle of Grafana resources and Grafana dashboards in Amazon Managed Grafana in a Kubernetes native way. + +## Introduction + +Customers use Grafana as an observability platform for open source analytics and monitoring solution. We have seen customers running their workloads in Amazon EKS want to shift their focus towards workload gravity and rely on Kubernetes-native controllers to deploy and manage the lifecycle of external resources such as Cloud resources. We have seen customers installing [AWS Controllers for Kubernetes (ACK)](https://aws-controllers-k8s.github.io/community/docs/community/overview/) to create, deploy and manage AWS services. Many customers these days opt to offload the Prometheus and Grafana implementations to managed services and in case of AWS these services are [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/?icmpid=docs_homepage_mgmtgov) and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/?icmpid=docs_homepage_mgmtgov) for monitoring their workloads. + +One common challenge customers face while using Grafana is, in creating and managing the lifecycle of Grafana resources and Grafana dashboards in external Grafana instances such as Amazon Managed Grafana from their Kubernetes cluster. Customers face challenges in finding ways to completely automate and manage infrastructure and application deployment of their whole system using Git based workflows which also includes creating of Grafana resources in Amazon Managed Grafana. In this Observability best practices guide, we will focus on the following topics: + +* Introduction on Grafana Operator - A Kubernetes operator to manage external Grafana instances from your Kubernetes cluster +* Introduction to GitOps - Automated mechanisms to create and manage your infrastructure using Git based workflows +* Using Grafana Operator on Amazon EKS to manage Amazon Managed Grafana +* Using GitOps with Flux on Amazon EKS to manage Amazon Managed Grafana + +## Introduction on Grafana Operator + +The [grafana-operator](https://github.com/grafana-operator/grafana-operator#:~:text=The%20grafana%2Doperator%20is%20a,an%20easy%20and%20scalable%20way.) is a Kubernetes operator built to help you manage your Grafana instances inside Kubernetes. Grafana Operator makes it possible for you to manage and create Grafana dashboards, datasources etc. declaratively between multiple instances in an easy and scalable way. The Grafana operator now supports managing resources such as dashboards, datasources etc hosted on external environments like Amazon Managed Grafana. This ultimately enables us to use GitOps mechanisms using CNCF projects such as [Flux](https://fluxcd.io/) to create and manage the lifecyle of resources in Amazon Managed Grafana from Amazon EKS cluster. + +## Introduction to GitOps + +### What is GitOps and Flux + +GitOps is a software development and operations methodology that uses Git as the source of truth for deployment configurations. It involves keeping the desired state of an application or infrastructure in a Git repository and using Git-based workflows to manage and deploy changes. GitOps is a way of managing application and infrastructure deployment so that the whole system is described declaratively in a Git repository. It is an operational model that offers you the ability to manage the state of multiple Kubernetes clusters leveraging the best practices of version control, immutable artifacts, and automation. + +Flux is a GitOps tool that automates the deployment of applications on Kubernetes. It works by continuously monitoring the state of a Git repository and applying any changes to a cluster. Flux integrates with various Git providers such as GitHub, [GitLab](https://dzone.com/articles/auto-deploy-spring-boot-app-using-gitlab-cicd), and Bitbucket. When changes are made to the repository, Flux automatically detects them and updates the cluster accordingly. + +### Advantages of using Flux + +* **Automated deployments**: Flux automates the deployment process, reducing manual errors and freeing up developers to focus on other tasks. +* **Git-based workflow**: Flux leverages Git as a source of truth, which makes it easier to track and revert changes. +* **Declarative configuration**: Flux uses [Kubernetes](https://dzone.com/articles/kubernetes-full-stack-example-with-kong-ingress-co) manifests to define the desired state of a cluster, making it easier to manage and track changes. + +### Challenges in adopting Flux + +* **Limited customization**: Flux only supports a limited set of customizations, which may not be suitable for all use cases. +* **Steep learning curve**: Flux has a steep learning curve for new users and requires a deep understanding of Kubernetes and Git. + +## Using Grafana Operator on Amazon EKS to manage resources in Amazon Managed Grafana + +As discussed in previous section, Grafana Operator enables us to use our Kubernetes cluster to create and manage the lifecyle of resources in Amazon Managed Grafana in a Kubernetes native way. The below architecture diagram shows the demonstration of Kubernetes cluster as a control plane with using Grafana Operator to setup an identity with AMG, adding Amazon Managed Service for Prometheus as a data source and creating dashboards on Amazon Managed Grafana from Amazon EKS cluster in a Kubernetes native way. + +![GitOPS-WITH-AMG-2](../../../images/Operational/gitops-with-amg/gitops-with-amg-2.jpg) + +Please refer to our post on [Using Open Source Grafana Operator on your Kubernetes cluster to manage Amazon Managed Grafana](https://aws.amazon.com/blogs/mt/using-open-source-grafana-operator-on-your-kubernetes-cluster-to-manage-amazon-managed-grafana/) for detailed demonstration of how to deploy the above solution on your Amazon EKS cluster. + +## Using GitOps with Flux on Amazon EKS to manage resources in Amazon Managed Grafana + +As discussed above, Flux automates the deployment of applications on Kubernetes. It works by continuously monitoring the state of a Git repository such as GitHub and when changes are made to the repository, Flux automatically detects them and updates the cluster accordingly. Please reference the below architecture where we will be demonstrating how to use Grafana Operator from your Kubernetes cluster and GitOps mechanisms using Flux to add Amazon Managed Service for Prometheus as a data source and create dashboards in Amazon Managed Grafana in a Kubernetes native way. + +![GitOPS-WITH-AMG-1](../../../images/Operational/gitops-with-amg/gitops-with-amg-1.jpg) + +Please refer to our One Observability Workshop module - [GitOps with Amazon Managed Grafana](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/gitops-with-amg). This module sets up required "Day 2" operational tooling such as the following on your EKS cluster: + +* [External Secrets Operator](https://github.com/external-secrets/external-secrets/tree/main/deploy/charts/external-secrets) is installed successfully to read Amazon Managed Grafana secrets from AWS Secret Manager +* [Prometheus Node Exporter](https://github.com/prometheus/node_exporter)to measure various machine resources such as memory, disk and CPU utilization +* [Grafana Operator](https://github.com/grafana-operator/grafana-operator) to use our Kubernetes cluster to create and manage the lifecyle of resources in Amazon Managed Grafana in a Kubernetes native way. +* [Flux](https://fluxcd.io/) to automate the deployment of applications on Kubernetes using GitOps mechanisms. + +## Conclusion + +In this section of Observability best practices guide, we learned about using Grafana Operator and GitOps with Amazon Managed Grafana. We started with learning about GitOps and Grafana Operator. Then we focussed on how to use Grafana Operator on Amazon EKS to manage resources in Amazon Managed Grafana and on how to use GitOps with Flux on Amazon EKS to manage resources in Amazon Managed Grafana to setup an identity with AMG, adding AWS data sources on Amazon Managed Grafana from Amazon EKS cluster in a Kubernetes native way. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/operational/observability-driven-dev.md b/docusaurus/observability-best-practices/docs/guides/operational/observability-driven-dev.md new file mode 100644 index 000000000..e69de29bb diff --git a/docusaurus/observability-best-practices/docs/guides/partners/databricks.md b/docusaurus/observability-best-practices/docs/guides/partners/databricks.md new file mode 100644 index 000000000..9c57fb8fa --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/partners/databricks.md @@ -0,0 +1,117 @@ +# Databricks Monitoring and Observability Best Practices in AWS + +Databricks is a platform for managing data analytics and AI/ML workloads. This guide aim at supporting customers running [Databricks on AWS](https://aws.amazon.com/solutions/partners/databricks/) with monitoring these workloads using AWS Native services for observability or OpenSource Managed Services. + +## Why monitor Databricks + +Operation teams managing Databricks clusters benefit from having an integrated, customized dashboard to track workload status, errors, performance bottlenecks; alerting on unwanted behaviour, such as total resource usage over time, or percentual amount of errors; and centralized logging, for root cause analysis, as well as extracting additional customized metrics. + +## What to monitor + +Databricks run Apache Spark in its cluster instances, which has native features to expose metrics. These metrics will give information regarding drivers, workers, and the workloads being executed in the cluster. + +The instances running Spark will have additional useful information about storage, CPU, memory, and networking. It´s important to understand what external factors could be affecting the performance of a Databricks cluster. In the case of clusters with numerous instances, understanding bottlenecks and general health is important as well. + +## How to monitor + +To install collectors and it's dependencies, Databricks init scripts will be needed. These are scripts that are runned in each instance of a Databricks cluster at boot time. + +Databricks cluster permissions will also need permission to send metrics and logs using instance profiles. + +Finally, it's a best practice to configure metrics namespace in Databricks cluster Spark configuration, replacing `testApp` with a proper reference to the cluster. + +![Databricks Spark Config](../../images/databricks_spark_config.png) +*Figure 1: example of metrics namespace Spark configuration* + +## Key parts of a good Observability solution for DataBricks + +**1) Metrics:** Metrics are numbers that describe activity or a particular process measured over a period of time. Here are different types of metrics on Databricks: + +System resource-level metrics, such as CPU, memory, disk, and network. +Application Metrics using Custom Metrics Source, StreamingQueryListener, and QueryExecutionListener, +Spark Metrics exposed by MetricsSystem. + +**2) Logs:** Logs are a representation of serial events that have happened, and they tell a linear story about them. Here are different types of logs on Databricks: + +- Event logs +- Audit logs +- Driver logs: stdout, stderr, log4j custom logs (enable structured logging) +- Executor logs: stdout, stderr, log4j custom logs (enable structured logging) + +**3) Traces:** Stack traces provide end-to-end visibility, and they show the entire flow through stages. This is useful when you must debug to identify which stages/codes cause errors/performance issues. + +**4) Dashboards:** Dashboards provide a great summary view of an application/service’s golden metrics. + +**5) Alerts:** Alerts notify engineers about conditions that require attention. + +## AWS Native Observability options + +Native solutions, such as Ganglia UI and Log Delivery, are great solutions for collecting system metrics and querying Apache Spark™ metrics. However, some areas can be improved: + +- Ganglia doesn’t support alerts. +- Ganglia doesn’t support creating metrics derived from logs (e.g., ERROR log growth rate). +- You can’t use custom dashboards to track SLO (Service Level Objectives) and SLI (Service Level Indicators) related to data-correctness, data-freshness, or end-to-end latency, and then visualize them with ganglia. + +[Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) is a critical tool for monitoring and managing your Databricks clusters on AWS. It provides valuable insights into cluster performance and helps you identify and resolve issues quickly. Integrating Databricks with CloudWatch and enabling structured logging can help improve those areas. CloudWatch Application Insights can help you automatically discover the fields contained in the logs, and CloudWatch Logs Insights provides a purpose-built query language for faster debugging and analysis. + +![Databricks With CloudWatch](../../images/databricks_cw_arch.png) +*Figure 2: Databricks CloudWatch Architecture* + +For more informaton on how to use CloudWatch to monitor Databricks, see: +[How to Monitor Databricks with Amazon CloudWatch](https://aws.amazon.com/blogs/mt/how-to-monitor-databricks-with-amazon-cloudwatch/) + +## Open-source software observability options + +[Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) is a Prometheus-compatible monitoring managed, serverless service, that will be responsible for storing metrics, and managing alerts created on top of these metrics. Prometheus is a popular open source monitoring technology, being the second project belonging to the Cloud Native Computing Foundation, right after Kubernetes. + +[Amazon Managed Grafana](https://aws.amazon.com/grafana/) is a managed service for Grafana. Grafana is an open source technology for time-series data visualization, commonly used for observability. We can use Grafana to visualize data from several sources, such as Amazon Managed Service for Prometheus, Amazon CloudWatch, and many others. It will be used to visualize Databricks metrics and alerts. + +[AWS Distro for OpenTelemetry](https://aws-otel.github.io/) is the AWS-supported distribution of OpenTelemetry project, which provides open source standards, libraries, and services for collecting traces and metrics. Through OpenTelemetry, we can collect several different observability data formats, such as Prometheus or StatsD, enrich this data, and send it to several destinations, such as CloudWatch or Amazon Managed Service for Prometheus. + +### Use cases + +While AWS Native services will deliver the observability needed to manage Databricks clusters, there are some scenarios where using Open Source managed services is the best choice. + +Both Prometheus and Grafana are very popular technologies, and are already being used in many companies. AWS Open Source services for observability will allow operations teams to use the same existing infrastructure, the same query language, and existing dashboards and alerts to monitor Databricks workloads, without the heavy lifting of managing these services infrastructure, scalability, and performance. + +ADOT is the best alternative for teams that need to send metrics and traces to different destinations, such as CloudWatch and Prometheus, or work with different types of data sources, such as OTLP and StatsD. + +Finally, Amazon Managed Grafana supports many different Data Sources, including CloudWatch and Prometheus, and help correlate data for teams that decide on using more than one tool, allowing for the creation of templates that will enable observability for all Databricks Clusters, and a powerful API that allow its provisioning and configuration through Infrastructure as Code. + +![Databricks OpenSource Observability Diagram](../../images/databricks_oss_diagram.png) +*Figure 3: Databricks Open Source Observability Architecture* + +To observe metrics from a Databricks cluster using AWS Managed Open Source Services for Observability, you will need an Amazon Managed Grafana workspace for visualizing both metrics and alerts, and an Amazon Managed Service for Prometheus workspace, configured as a datasource in the Amazon Managed Grafana workspace. + +There are two important kind of metrics that must be collected: Spark and node metrics. + +Spark metrics will bring information such as current number of workers in the cluster, or executors; shuffles, that happen when nodes exchenge data during processing; or spills, when data go from RAM to disk and from disk to RAM. To expose these metrics, Spark native Prometheus - available since version 3.0 - must be enabled through Databricks management console, and configured through a `init_script`. + +To keep track of node metrics, such as disk usage, CPU time, memory, storage performance, we use the `node_exporter`, that can be used without any further configuration, but should only expose important metrics. + +An ADOT Collector must be installed in each node of the cluster, scraping the metrics exposed by both Spark and the `node_exporter`, filtering these metrics, injecting metadata such as `cluster_name`, and sending these metrics to the Prometheus workspace. + +Both the ADOT Collector and the `node _exporter` must be installed and configured through a `init_script`. + +The Databricks cluster must be configured with an IAM Role with permission to write metrics in the Prometheus workspace. + +## Best Practices + +### Prioritize valuable metrics + +Spark and node_exporter both expose several metrics, and several formats for the same metrics. Without filtering which metrics are useful for monitoring and incident response, the mean time to detect problems increase, costs with storing samples increase, valuable information will be harder to be found and understood. Using OpenTelemetry processors, it is possible to filter and keep only valuable metrics, or filter out metrics that doesn't make sense; aggregate and calculate metrics before sending them to AMP. + +### Avoid alerting fatigue + +Once valuable metrics are being ingested into AMP, it's essential to configure alerts. However, alerting on every resource usage burst may cause alerting fatigue, that is when too much noise will decrease the confidence in alerts severity, and leave important events undetected. AMP alerting rules group feature should be use to avoid ambiqguity, i.e., several connected alerts generating separated notifications. Also, alerts should receive the proper severity, and it should reflect business priorities. + +### Reuse Amazon Managed Grafana dashboards + +Amazon Managed Grafana leverages Grafana native templating feature, which allow the creation for dashboards for all existing and new Databricks clusters. It removes the need of manually creating and keeping visualizations for each cluster. To use this feature, its important to have the correct labels in the metrics to group these metrics per cluster. Once again, it's possible with OpenTelemetry processors. + +## References and More Information + +- [Create Amazon Managed Service for Prometheus workspace](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-create-workspace.html) +- [Create Amazon Managed Grafana workspace](https://docs.aws.amazon.com/grafana/latest/userguide/Amazon-Managed-Grafana-create-workspace.html) +- [Configure Amazon Managed Service for Prometheus datasource](https://docs.aws.amazon.com/grafana/latest/userguide/prometheus-data-source.html) +- [Databricks Init Scripts](https://docs.databricks.com/clusters/init-scripts.html) diff --git a/docusaurus/observability-best-practices/docs/guides/rust-custom-metrics/README.md b/docusaurus/observability-best-practices/docs/guides/rust-custom-metrics/README.md new file mode 100644 index 000000000..8ffb277cd --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/rust-custom-metrics/README.md @@ -0,0 +1,322 @@ +# Creating Custom Metrics with the AWS Rust SDK + +## Introduction + +Rust, a systems programming language focused on safety, performance, and concurrency, has been gaining popularity in the software development world. Its unique approach to memory management and thread safety makes it an attractive choice for building robust and efficient applications, particularly in the cloud. With the rise of serverless architectures and the need for high-performance, scalable services, Rust's capabilities make it an excellent choice for building cloud-native applications. In this guide, we'll explore how to leverage the AWS Rust SDK to create custom CloudWatch metrics, enabling you to gain deeper insights into your applications' performance and behavior within the AWS ecosystem. + +## Pre-Requesites + +In order to use this guide we will need to install Rust and also create a CloudWatch log group and log stream to store some of our data we will use later. + +### Installing Rust + +On Mac or Linux: + +``` +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +``` + +On Windows, download and run [rustup-init.exe](https://static.rust-lang.org/rustup/dist/i686-pc-windows-gnu/rustup-init.exe) + +### Creating a CloudWatch Log Group and Log Stream + +1. Create the CloudWatch Log Group: + +``` +aws logs create-log-group --log-group-name rust_custom +``` + +2. Create the CloudWatch Log Stream: + +``` +aws logs create-log-stream --log-group-name rust_custom --log-stream-name diceroll_log_stream +``` + +## The Code + +You can find the complete code in the sandbox section of this repository. + +``` +git clone https://github.com/aws-observability/observability-best-practices.git +cd observability-best-practices/sandbox/rust-custom-metrics +``` + +This code will first simulate a diceroll , we will pretend that we care about the value of this diceroll as a custom metric. We will then show 3 different ways of adding the metric to CloudWatch and viewing it on a dashboard. + +### Setting up the Application + +First we need to import some crates to use in our application. + +```rust +use crate::cloudwatch::types::Dimension; +use crate::cloudwatchlogs::types::InputLogEvent; +use aws_sdk_cloudwatch as cloudwatch; +use aws_sdk_cloudwatch::config::BehaviorVersion; +use aws_sdk_cloudwatch::types::MetricDatum; +use aws_sdk_cloudwatchlogs as cloudwatchlogs; +use rand::prelude::*; +use serde::Serialize; +use serde_json::json; +use std::time::{SystemTime, UNIX_EPOCH}; +``` + +In this import block we mainly are importing the aws sdk libraries we will use. We also bring in the 'rand' crate so we can create a random diceroll value. Finally we have a few libraries like 'serde' and 'time' to handle some of the data creation that we use to populate our sdk calls. + +Now we can create our diceroll value in our main function, this value will be used by all 3 AWS SDK calls that we make. + +```rust +//select a random number 1-6 to represent a diceroll +let mut rng = rand::thread_rng(); +let roll_value = rng.gen_range(1..7); +``` + +Now that we have our diceroll number, let's explore 3 different ways of adding the value to CloudWatch as a custom metric. Once the value is a custom metric we gain the ability to set up alarms on the value, set up anomaly detection, plot the value on a dashboard, and much more. + +### Put Metric Data + +The first method we will use to add the value to CloudWatch is PutMetricData. By using PutMetricData we are writing the time-series value of the metric directly to CloudWatch. This is the most efficient way of adding the value. When we use PutMetricData we need to provide the namespace, as well as any dimensions to each AWS SDK call along side the metric value. Here is the code: + +First we will set up a function that takes in our metric (diceroll value) and it returns a Result type, which in Rust indicates success of failure. The first thing we do within the function is initialize our AWS Rust SDK client. Our client will inherit credentials and region from the local environment. So make sure those are configured by running `aws configure` from your command line prior to running this code. + +```rust +async fn put_metric_data(roll_value: i32) -> Result<(), cloudwatch::Error> { + //Create a reusable aws config that we can pass to our clients + let config = aws_config::load_defaults(BehaviorVersion::v2023_11_09()).await; + + //Create a cloudwatch client + let client = cloudwatch::Client::new(&config); +``` + +After initializing our client we can start to setup the input needed for our PutMetricData API call. We need to define the dimensions and then the MetricDatum itself, which is the combination of dimensions and value. + +```rust +//Use fluent builders to build the required input for pmd call, starting with dimensions. +let dimensions = Dimension::builder() + .name("roll_value_pmd_dimension") + .value(roll_value.to_string()) + .build(); + +let put_metric_data_input = MetricDatum::builder() + .metric_name("roll_value_pmd") + .dimensions(dimensions) + .value(f64::from(roll_value)) + .build(); +``` + +Finally we can make the PutMetricData API call using the input we defined previously. + +```rust +let response = client + .put_metric_data() + .namespace("rust_custom_metrics") + .metric_data(put_metric_data_input) + .send() + .await?; +println!("Metric Submitted: {:?}", response); +Ok(()) +``` +Notice that the sdk call is in an async function. Since the function completes asynchronously, we need to `await` it's completion. Then we return the Result type as defined in the top level of our function. + +When it's time to call our function from main it will just look like this: + +```rust +//call the put_metric_data function with the roll value +println!("First we will write a custom metric with PutMetricData API call"); +put_metric_data(roll_value).await.unwrap(); +``` +Again we are awaiting the function call to complete and then we `unwrap` the value as in our case we are only interested in the 'Ok' result and not the error. In a production scenario you would likely error handle in a different way. + +### PutLogEvent + Metric Filter + +The next way to create a custom metric is to simply write it to a CloudWatch log group. Once the metric is in a CloudWatch log group we can use a [Metric Filter](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringPolicyExamples.html) to extract the metric data from the log data. + +First we will define a struct for our log messages. This is optional, as we could just manually build a json. But in a more complex application you would likely want this logging struct for re-usability. + +```rust +//Make a simple struct for the log message. We could also just create a json string manually. +#[derive(Serialize)] +struct DicerollValue { + welcome_message: String, + roll_value: i32, +} +``` + +Once our struct is defined we are ready to make our AWS API call. Again we will create an API client, this time using the logs sdk. We will also define the system time using unix epoch timing. + +```rust +//Create a reusable aws config that we can pass to our clients +let config = aws_config::load_defaults(BehaviorVersion::v2023_11_09()).await; + +//Create a cloudwatch logs client +let client = cloudwatchlogs::Client::new(&config); + +//Let's get the time in ms from unix epoch, this is required for CWlogs +let time_now = SystemTime::now() + .duration_since(UNIX_EPOCH) + .unwrap() + .as_millis() as i64; +``` + +First we will create json from a new instantiation of our struct we defined earlier. Then use this to create a log event. + +```rust +let log_json = json!(DicerollValue { + welcome_message: String::from("Hello from rust!"), + roll_value +}); + +let log_event = InputLogEvent::builder() + .timestamp(time_now) + .message(log_json.to_string()) + .build(); +``` + +Now we can complete our API call in a similar way to what we did with PutMetricData + +```rust +let response = client + .put_log_events() + .log_group_name("rust_custom") + .log_stream_name("diceroll_log_stream") + .log_events(log_event.unwrap()) + .send() + .await?; + +println!("Log event submitted: {:?}", response); +Ok(()) +``` + +Once the log event has been submitted, we need to go to CloudWatch and create a Metric Filter for the log group to properly extract the metric. + +In the CloudWatch console go to the rust_custom log group that we created. Then create a metric filter. The filter pattern should be `{$.roll_value = *}` . Then for the Metric Value use `$.roll_value` . You can use any namespace and metric name that you like. This Metric Filter can be explained like so: + +"Trigger the filter whenever we get a field called 'roll_value', no matter what the value is. Once triggered, use the 'roll_value' as the number to write to CloudWatch Metrics". + +This way of creating metrics is very powerful for extracting time series values from log-data when you do not have control over the log formatting. Since we are directly instrumenting code, we do have control over the format of our log data, therefore a better method may be to use CloudWatch Embedded Metric Format, which we will discuss in the next step. + +### PutLogEvent + Embedded Metric Format + +CloudWatch [Embedded Metric Format](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html)(EMF) is a way of embedding time series metrics directly in your logs. CloudWatch will then extract the metrics without the need for Metric Filters. Let's take a look at the code. + +Create a logs client again along with grabbing system time in unix epoch. + +```rust +//Create a reusable aws config that we can pass to our clients +let config = aws_config::load_defaults(BehaviorVersion::v2023_11_09()).await; + +//Create a cloudwatch logs client +let client = cloudwatchlogs::Client::new(&config); + +//get the time in unix epoch ms +let time_now = SystemTime::now() + .duration_since(UNIX_EPOCH) + .unwrap() + .as_millis() as i64; +``` + +Now we can create our EMF json string. This needs to have all the data required for CloudWatch to create the custom metric, so we embed the namespace, dimensions, and value in the string. + +```rust +//Create a json string in embedded metric format with our diceroll value. +let json_emf = json!( + { + "_aws": { + "Timestamp": time_now, + "CloudWatchMetrics": [ + { + "Namespace": "rust_custom_metrics", + "Dimensions": [["roll_value_emf_dimension"]], + "Metrics": [ + { + "Name": "roll_value_emf" + } + ] + } + ] + }, + "roll_value_emf_dimension": roll_value.to_string(), + "roll_value_emf": roll_value + } +); +``` + +Notice how we actually create a dimension out of our roll value as well as using it for the value. This let's us perform a GroupBy on the roll value so we can see how many times each roll value was landed on. + +Now we can make the API call to write the log event just like we did before: + +```rust +let log_event = InputLogEvent::builder() + .timestamp(time_now) + .message(json_emf.to_string()) + .build(); + +let response = client + .put_log_events() + .log_group_name("rust_custom") + .log_stream_name("diceroll_log_stream_emf") + .log_events(log_event.unwrap()) + .send() + .await?; + +println!("EMF Log event submitted: {:?}", response); +Ok(()) +``` + +Once the log event is submitted to CloudWatch, the metric will be extracted without any need for a metric filter. This is a great way of creating high-cardinality metrics where it may be easier to write these values as log messages instead of doing a PutMetricData API call with all the different dimensions. + +### Putting it all together + +Our final main function will call all three API calls like this + +```rust +#[::tokio::main] +async fn main() { + println!("Let's have some fun by creating custom metrics with the Rust SDK"); + + //select a random number 1-6 to represent a dicerolll + let mut rng = rand::thread_rng(); + let roll_value = rng.gen_range(1..7); + + //call the put_metric_data function with the roll value + println!("First we will write a custom metric with PutMetricData API call"); + put_metric_data(roll_value).await.unwrap(); + + println!("Now let's write a log event, which we will then extract a custom metric from."); + //call the put_log_data function with the roll value + put_log_event(roll_value).await.unwrap(); + + //call the put_log_emf function with the roll value + println!("Now we will put a log event with embedded metric format to directly submit the custom metric."); + put_log_event_emf(roll_value).await.unwrap(); +} +``` + +In order to generate some test data, we can build the application and then run it in a loop to generate some data to view in CloudWatch. From the root directory run the following + +``` +cargo build +``` + +Now we will run it 50 times with a 2 second sleep. The sleep is just to space the metrics out a little bit to make them easier to view in a CloudWatch Dashboard. + +``` +for run in {1..50}; do ./target/debug/custom-metrics; sleep 2; done +``` + +Now we can review the results in CloudWatch. I like to do a GroupBy on the dimensions, this lets me see how much each time the roll value was selected. The metric insights query should look like this. Change up the metric name and dimension name based on if you changed anything. + +``` +SELECT COUNT(roll_value_emf) FROM rust_custom_metrics GROUP BY roll_value_emf_dimension +``` + +Now we can put them all three on a dashboard and see as expected the same graph. + +![dashboard](./dashboard.png) + +## Cleanup + +Make sure to delete your `rust_custom` log group. + +``` +aws logs delete-log-group --log-group-name rust_custom +``` \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/guides/rust-custom-metrics/dashboard.png b/docusaurus/observability-best-practices/docs/guides/rust-custom-metrics/dashboard.png new file mode 100644 index 000000000..24753d7d1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/guides/rust-custom-metrics/dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/guides/serverless/aws-native/lambda-based-observability.md b/docusaurus/observability-best-practices/docs/guides/serverless/aws-native/lambda-based-observability.md new file mode 100644 index 000000000..1c47f4a4e --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/serverless/aws-native/lambda-based-observability.md @@ -0,0 +1,325 @@ +# AWS Lambda based Serverless Observability + +In the world of distributed systems and serverless computing, achieving observability is the key to ensuring application reliability and performance. It involves more than traditional monitoring. By leveraging AWS observability tools like Amazon CloudWatch and AWS X-Ray, you can gain insights into your serverless applications, troubleshoot issues, and optimize application performance. In this guide, we will learn essential concepts, tools and best practices to implement Observability of your Lambda based serverless application. + +The first step before you implement observability for your infrastructure or application is to determine your key objectives. It could be enhanced user experience, increased developer productivity, meeting service level objectives (SLOs), increasing business revenue or any other specific objective depending on your application type. So, clearly define these key objectives and establish how you would measure them. Then work backwards from there to design your observability strategy. Refer to “[Monitor what matters](https://aws-observability.github.io/observability-best-practices/guides/#monitor-what-matters)” to learn more. + +## Pillars of Observability + +There are three main pillars to observability: + +* Logs: Timestamped records of discrete events that happened within an application or system, such as a failure, an error, or a state transformation +* Metrics: Numeric data measured at various time intervals (time series data); SLIs (request rate, error rate, duration, CPU%, etc.) +* Traces: A trace represents a single user’s journey across multiple applications and systems (usually microservices) + + + AWS offers both Native and Open source tools to facilitate logging, monitoring metrics, and tracing to obtain actionable insights for your AWS Lambda application. + +## **Logs** + +In this section of the observability best practices guide, we will deep dive on to following topics: + +* Unstructured vs structured logs +* CloudWatch Logs Insights +* Logging correlation Id +* Code Sample using Lambda Powertools +* Log visualization using CloudWatch Dashboards +* CloudWatch Logs Retention + + +Logs are discrete events that have occurred within your application. These can include events like failures, errors, execution path or something else. Logs can be recorded in unstructured, semi-structured, or structured formats. + +### **Unstructured vs structured logs** + +We often see developers start with simple log messages within their application using `print` or `console.log` statements. These are difficult to parse and analyze programmatically at scale, particularly in a AWS Lambda based applications that can generate many lines of log messages across different log groups. As a result, consolidating these logs in CloudWatch becomes challenging and hard to analyze. You would need to do text match or regular expressions to find relevant information in the logs. Here’s is an example of what unstructured logging looks like: + +``` +[2023-07-19T19:59:07Z] INFO Request started +[2023-07-19T19:59:07Z] INFO AccessDenied: Could not access resource +[2023-07-19T19:59:08Z] INFO Request finished +``` + +As you can see, the log messages lack a consistent structure, making it challenging to get useful insights from it. Also, it is hard to add contextual information to it. + +Whereas structured logging is a way to log information in a consistent format, often in JSON, that allows logs to be treated as data rather than text, which makes querying and filtering simple. It gives developers the ability to efficiently store, retrieve, and analyze the logs programmatically. It also facilitates better debugging. Structured logging provides a simpler way to modify the verbosity of logs across different environments through log levels. **Pay attention to logging levels.** Logging too much will increase costs and decrease application throughput. Ensure personal identifiable information is redacted before logging. Here’s is an example of what structured logging looks like: + +``` +{ + "correlationId": "9ac54d82-75e0-4f0d-ae3c-e84ca400b3bd", + "requestId": "58d9c96e-ae9f-43db-a353-c48e7a70bfa8", + "level": "INFO", + "message": "AccessDenied", + "function-name": "demo-observability-function", + "cold-start": true +} +``` + + +**`Prefer structured and centralized logging into CloudWatch logs`** to emit operational information about transactions, correlation identifiers across different components, and business outcomes from your application. + +### **CloudWatch Logs Insights** +Use CloudWatch Logs Insights, which can automatically discover fields in JSON formatted logs. In addition, JSON logs can be extended to log custom metadata specific to your application that can be used to search, filter, and aggregate your logs. + + +### **Logging correlation Id** + +For example, for an http request coming in from API Gateway, the correlation Id is set at the `requestContext.requestId` path, which can be easily extracted and logged in the downstream Lambda functions using Lambda powertools. Distributed systems often involve multiple services and components working together to handle a request. So, logging correlation Id and passing them to downstream systems becomes crucial for end-to-end tracing and debugging. A correlation Id is a unique identifier assigned to a request at the very beginning. As the request moves through different services, the correlation Id is included in the logs, allowing you to trace the entire path of the request. You can either manually insert correlation Id to your AWS Lambda logs or use tools like [AWS Lambda powertools](https://docs.powertools.aws.dev/lambda/python/latest/core/logger/#setting-a-correlation-id) to easily grab the correlation Id the from API Gateway and log it along with your application logs. For example, for an http request correlation Id could be a request-id which can be initiated at API Gateway and then passed on to your backend services like Lambda functions. + +### **Code Sample using Lambda Powertools** +As a best practice, generate a correlation Id as early as possible in the request lifecycle, preferably at the entry point of your serverless application, such as API Gateway or application load balancer. Use UUIDs, or request id or any other unique attribute which can used to track the request across distributed systems. Pass the correlation id along with each request either as part of the custom header, body or metadata. Ensure that correlation Id is included in all the log entries and traces in your downstream services. + +You can either manually capture and include correlation Id as part of your Lambda function logs or use tools like [AWS Lambda Powertools](https://docs.powertools.aws.dev/lambda/python/latest/core/logger/#setting-a-correlation-id). With Lambda Powertools, you can easily grab the correlation Id from predefined request [path mapping](https://github.com/aws-powertools/powertools-lambda-python/blob/08a0a7b68d2844d36c33ab8156640f4ea9632d0c/aws_lambda_powertools/logging/correlation_paths.py) for supported upstream services and automatically add it alongside your application logs. Also, ensure that correlation Id is added to all your error messages to easily debug and identify the root cause in case of failures and tie it back to the original request. + +Let's look at the code sample to demostrate structured logging with correlation id and viewing it in CloudWatch for below serverless architecture: + +![architecture](../../../images/Serverless/aws-native/apigw_lambda.png) + +``` +// Initializing Logger +Logger log = LogManager.getLogger(); + +// Uses @Logger annotation from Lambda Powertools, which takes optional parameter correlationIdPath to extract correlation Id from the API Gateway header and inserts correlation_id to the Lambda function logs in a structured format. +@Logging(correlationIdPath = "/headers/path-to-correlation-id") +public APIGatewayProxyResponseEvent handleRequest(final APIGatewayProxyRequestEvent input, final Context context) { + ... + // The log statement below will also have additional correlation_id + log.info("Success") + ... +} +``` + +In this example, a Java based Lambda function is using Lambda Powertools library to log `correlation_id` coming in from the api gateway request. + +Sample CloudWatch logs for the code sample: + +``` +{ + "level": "INFO", + "message": "Success", + "function-name": "demo-observability-function", + "cold-start": true, + "lambda_request_id": "52fdfc07-2182-154f-163f-5f0f9a621d72", + "correlation_id": "" +}_ +``` + +### **Log visualization using CloudWatch Dashboards** + +Once you log the data in structured JSON format, [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) then automatically discovers values in JSON output and parses the messages as fields. CloudWatch Logs insights provides purpose-built [SQL-like query](https://serverlessland.com/snippets?type=CloudWatch+Logs+Insights) language to search and filter multiple log streams. You can perform queries over multiple log groups using glob and regular expressions pattern matching. In addition, you can also write your custom queries and save them to re-run it again without having to re-create them each time. + +![CloudWatch Dashboard](../../../images/Serverless/aws-native/cw_dashboard.png) +In CloudWatch logs insights, you can generate visualizations like line charts, bar charts, and stacked area charts from your queries with one or more aggregation functions. You can then easily add these visualization to the CloudWatch Dashboards. Sample dashboard below shows percentile report of Lambda function’s execution duration. Such dashboards will quickly give you insights on where you should focus on improve application performance. Average latency is a good metrics to look at but **`you should aim to optimize for p99 and not the average latency.`** + +![CloudWatch Dashboard](../../../images/Serverless/aws-native/cw_percentile.png) +To send (platform, function and extensions) logs to locations other than CloudWatch, you could use [Lambda Telemetry API](https://docs.aws.amazon.com/lambda/latest/dg/telemetry-api.html) with Lambda Extensions. A number of [partner solutions](https://docs.aws.amazon.com/lambda/latest/dg/extensions-api-partners.html) provide Lambda layers which use the Lambda Telemetry API and make integration with their systems easier. + +To make the best use of CloudWatch logs insights, think about what data you must be ingesting into your logs in the form of structured logging, which will then help better monitor the health of your application. + + +### **CloudWatch Logs Retention** + +By default all messages that are written to stdout in your Lambda function are saved to an Amazon CloudWatch log stream. Lambda function's execution role should have permission to create CloudWatch log streams and write log events the streams. It is important to be aware that CloudWatch is billed by the amount of data ingested, and the storage used. Therefore, reducing the amount of logging will help you minimize the associated cost. **`By default CloudWatch logs are kept indefinitely and never expire. It is recommended to configure log retention policy to reduce log-storage costs`**, and apply it across all your log groups. You might want differing retention policies per environment. Log retention can be configured manually in the AWS console but to ensure consistency and best practices, you should configure it as part of your Infrastructure as Code (IaC) deployments. Below is a sample CloudFormation template that demonstrates how to configuring Log Retention for Lambda function: + +``` +Resources: + Function: + Type: AWS::Serverless::Function + Properties: + CodeUri: . + Runtime: python3.8 + Handler: main.handler + Tracing: Active + + # Explicit log group that refers to the Lambda function + LogGroup: + Type: AWS::Logs::LogGroup + Properties: + LogGroupName: !Sub "/aws/lambda/${Function}" + # Explicit retention time + RetentionInDays: 7 +``` + +In this example, we created a Lambda function and corresponding log group. The **`RetentionInDays`** property is **set to 7 days**, meaning that logs in this log group will be retained for 7 days before they are automatically deleted, thus helping to control log storage cost. + + +## **Metrics** + +In this section of Observability best practices guide, we will deep dive on to following topics: + +* Monitor and alert on out-of-the-box metrics +* Publish custom metrics +* Use embedded-metrics to auto generate metrics from your logs +* Use CloudWatch Lambda Insights to monitor system-level metrics +* Creating CloudWatch Alarms + +### **Monitor and alert on out-of-the-box metrics** + +Metrics are numeric data measured at various time intervals (time series data) and service-level indicators (request rate, error rate, duration, CPU, etc.). AWS services provide a number of out-of-the-box standard metrics to help monitor the operational health of your application. Establish key metrics applicable for your application and use them for monitor performance of your application. Examples of key metrics may include function errors, queue depth, failed state machine executions, and api response times. + +One challenge with out-of-the-box metrics is knowing how to analyze them in a CloudWatch dashboard. For example, when looking at Concurrency, do I look at max, average, or percentile? And the right statistics to monitor differs for each metric. + +As best practices, for Lambda function’s `ConcurrentExecutions` metrics look at the `Count` statistics to check if it is getting close to the account and regional limit or close to the Lambda reserved concurrency limit if applicable. +For `Duration` metric, which indicates how long your function takes to process an event, look at the `Average` or `Max` statistic. For measuring the latency of your API, look at the `Percentile` statistics for API Gateway’s `Latency` metrics. P50, P90, and P99 are much better methods of monitoring latency over averages. + +Once you know what metrics to monitor, configure alerts on these key metrics to engage you when components of your application are unhealthy. For Example + +* For AWS Lambda, alert on Duration, Errors, Throttling, and ConcurrentExecutions. For stream-based invocations, alert on IteratorAge. For Asynchronous invocations, alert on DeadLetterErrors. +* For Amazon API Gateway, alert on IntegrationLatency, Latency, 5XXError, 4XXError +* For Amazon SQS, alert on ApproximateAgeOfOldestMessage, ApproximateNumberOfMessageVisible +* For AWS Step Functions, alert on ExecutionThrottled, ExecutionsFailed, ExecutionsTimedOut + +### **Publish custom metrics** + +Identify key performance indicators (KPIs) based on desired business and customer outcomes for your application. Evaluate KPIs to determine application success and operational health. Key metrics may vary depending on the type of application, but examples include site visited, orders placed, flights purchased, page load time, unique visitors etc. + +One way to publish custom metrics to AWS CloudWatch is by calling CloudWatch metrics SDK’s `putMetricData` API. However, `putMetricData` API call is synchronous. It will increase the duration of your Lambda function and it can potentially block other API calls in your application, leading to performance bottlenecks. Also, longer execution duration of your Lambda function will attribute towards higher cost. Additionally you are charged for both the number of custom metrics that are sent to CloudWatch and the number of API calls (i.e. PutMetricData API calls) that are made. + +**`A more efficient and cost-effective way to publish custom metrics is with`** [CloudWatch Embedded Metrics Format](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html) (EMF). The CloudWatch Embedded Metric format allows you to generate custom metrics **`asynchronously`** as logs written to CloudWatch logs, resulting in improved performance of your application at a lower cost. With EMF, you can embed custom metrics alongside detailed log event data, and CloudWatch automatically extracts these custom metrics so that you can visualize and set alarm on them as you would do out-of-the-box metrics. By sending logs in the embedded metric format, you can query it using [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html), and you only pay for the query, not the cost of the metrics. + +To achieve this, you can generate the logs using [EMF specification](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html), and send them to CloudWatch using `PutLogEvents` API. To simplify the process, there are **two client libraries that support the creation of metrics in the EMF** **format**. + +* Low level client libraries ([aws-embedded-metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Libraries.html)) +* Lambda Powertools [Metrics](https://docs.powertools.aws.dev/lambda/java/core/metrics/). + + +### **Use [CloudWatch Lambda Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Lambda-Insights.html) to monitor system-level metrics** + +CloudWatch Lambda insights provides you system-level metrics, including CPU time, memory usage, disk utilization, and network performance. Lambda Insights also collects, aggregates, and summarizes diagnostic information, such as **`cold starts`** and Lambda worker shutdowns. Lambda Insights leverages CloudWatch Lambda extension, which is packaged as a Lambda layer. Once enabled, it collects system-level metrics and emits a single performance log event to CloudWatch Logs for every invocation of that Lambda function in the embedded metrics format. + +:::note + CloudWatch Lambda Insights is not enabled by default and needs to be turned on per Lambda function. +::: + +You can enable it via AWS console or via Infrastructure as Code (IaC). Here is an example of how to enable it using the AWS serverless application model (SAM). You add `LambdaInsightsExtension` extension Layer to your Lambda function, and also add managed IAM policy `CloudWatchLambdaInsightsExecutionRolePolicy`, which gives permissions to your Lambda function to create log stream and call `PutLogEvents` API to be able to write logs to it. + +``` +// Add LambdaInsightsExtension Layer to your function resource +Resources: + MyFunction: + Type: AWS::Serverless::Function + Properties: + Layers: + - !Sub "arn:aws:lambda:${AWS::Region}:580247275435:layer:LambdaInsightsExtension:14" + +// Add IAM policy to enable Lambda function to write logs to CloudWatch +Resources: + MyFunction: + Type: AWS::Serverless::Function + Properties: + Policies: + - `CloudWatchLambdaInsightsExecutionRolePolicy` +``` + +You can then use CloudWatch console to view these system-level performance metrics under Lambda Insights. + + +![Lambda Insights](../../../images/Serverless/aws-native/lambda_insights.png) + +### **Creating CloudWatch Alarms** +Creating CloudWatch Alarms and take necessary actions when metrics go off is a critical part of observability. Amazon [CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) are used to alert you or automate remediation actions when application and infrastructure metrics exceed static or dynamically set thresholds. + +To set up an alarm for a metric, you select a threshold value that triggers a set of actions. A fixed threshold value is known as a static threshold. For instance, you can configure an alarm on `Throttles` metrics from Lambda function to activate if it exceeds 10% of the time within a 5-min period. This could potentially mean that Lambda function has reached its max concurrency for your account and region. + +In a serverless application, it is common to send an alert using SNS (Simple Notification Service). This enables users to receive alerts via email, SMS, or other channels. Additionally, you can subscribe a Lambda function to the SNS topic, allowing it to auto remediate any issues which caused the alarm to go off. + +For example, Let’s say you have a Lambda function A, which is polling an SQS queue and calling a downstream service. If downstream service is down and not responding, Lambda function will continue to poll from SQS and try calling downstream service with failures. While you can monitor these errors and generate a CloudWatch alarm using SNS to notify appropriate team, you can also call another Lambda function B (via SNS subscription), which can disable the event-source-mapping for the Lambda function A and thus stopping it from polling SQS queue, until the downstream service is back up and running. + +While setting up alarms on an individual metric is good, sometimes monitoring multiple metrics becomes necessary to better understand the operational health and performance of your application. In such a scenario, you should setup alarms based on multiple metrics using [metric math](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) expression. + +For example, if you want to monitor AWS Lambda errors but allow a small number of errors without triggering your alarm, y you can create an error rate expression in the form of a percentage. i.e. ErrorRate = errors / invocation * 100, then create an alarm to send an alert if the ErrorRate goes above 20% within the configured evaluation period. + + +## **Tracing** + +In this section of Observability best practices guide, we will deep dive on to following topics: + +* Introduction to distributed tracing and AWS X-Ray +* Apply appropriate sampling rule +* Use X-Ray SDK to trace interaction with other services +* Code Sample for tracing integrated services using X-Ray SDK + +### Introduction to distributed tracing and AWS X-Ray + +Most serverless applications consist of multiple microservices, each using multiple AWS services. Due to the nature of serverless architectures, it’s crucial to have distributed tracing. For effective performance monitoring and error tracking, it is important to trace the transaction across entire application flow, from the source caller through all the downstream services. While it’s possible to achieve this using individual service’s logs, it’s faster and more efficient to use a tracing tool like AWS X-Ray. See [Instrumenting your application with AWS X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html) for more information. + +AWS X-Ray enables you to trace requests as it flows through the involved microservices. X-Ray Service maps enables you to understand different integration points and identify any performance degradation of your application. You can quickly isolate which component of you application is causing errors, throttling or having latency issues with just few clicks. Under the service graph, you can also individual traces to pinpoint the exact duration taken by each microservice. + +![X-Ray Trace](../../../images/Serverless/aws-native/xray_trace.png) + +**`As a best practice, create custom subsegments in your code for downstream calls`** or any specific functionality that requires monitoring. For instance, you can create a subsegment to monitor a call to an external HTTP API, or an SQL database query. + +For example, To create a custom subsegment for a function that makes calls to downstream services, use the `captureAsyncFunc` function (in node.js) + +``` +var AWSXRay = require('aws-xray-sdk'); + +app.use(AWSXRay.express.openSegment('MyApp')); + +app.get('/', function (req, res) { + var host = 'api.example.com'; + + // start of the subsegment + AWSXRay.captureAsyncFunc('send', function(subsegment) { + sendRequest(host, function() { + console.log('rendering!'); + res.render('index'); + + // end of the subsegment + subsegment.close(); + }); + }); +}); +``` + +In this example, the application creates a custom subsegment named `send` for calls to the `sendRequest` function. `captureAsyncFunc` passes a subsegment that you must close within the callback function when the asynchronous calls that it makes are complete. + + +### **Apply appropriate sampling rule** + +AWS X-Ray SDK does not trace all requests by default. It applies a conservative sampling rule to provide a representative sample of the requests without incurring high cost. However, you can [customize](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-sampling.html#xray-console-config) the default sampling rule or disable sampling altogether and start tracing all your requests based on your specific requirements. + +It’s important to note that AWS X-Ray is not intended to be used as an audit or compliance tool. You should consider having **`different sampling rate for different type of application`**. For instance, high-volume read-only calls, like background polling, or health checks can be sampled at a lower rate while still providing enough data to identify any potential issues that may arise. You may also want to have **`different sampling rate per environment`**. For instance, in your development environment, you may want all your requests to be traced to troubleshoot any errors or performance issues easily, whereas for production environment you may have lower number of traces. **`You should also keep in mind that extensive tracing can result in increased cost`**. For more information about sampling rules, see [_Configuring sampling rules in the X-Ray console_](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-sampling.html). + +### **Use X-Ray SDK to trace interaction with other AWS services** + +While X-Ray tracing can be easily enabled for services like AWS Lambda and Amazon API Gateway, with just few clicks or few lines on your IaC tool, other services require additional steps to instrument their code. Here is the complete [list of AWS Services integrated with X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/xray-services.html). + +To instrument calls to the services which are not integrated with X-Ray, such as DynamoDB, you can capture traces by wrapping AWS SDK calls with the AWS X-Ray SDK. For instance, when using node.js, you can follow below code example to capture all AWS SDK calls: + +### **Code sample for tracing integrated services using X-Ray SDK** + +``` +//... FROM (old code) +const AWS = require('aws-sdk'); + +//... TO (new code) +const AWSXRay = require('aws-xray-sdk-core'); +const AWS = AWSXRay.captureAWS(require('aws-sdk')); +... +``` + +:::note + To instrument individual clients wrap your AWS SDK client in a call to `AWSXRay.captureAWSClient`. Do not use both `captureAWS` and `captureAWSClient` together. This will lead to duplicate traces. +::: + +## **Additional Resources** + +[CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) + +[CloudWatch Lambda Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Lambda-Insights.html) + +[Embedded Metrics Library](https://github.com/awslabs/aws-embedded-metrics-java) + + +## Summary + +In this observability best practice guide for AWS Lambda based serverless application, we highlighted critical aspects such as logging, metrics and tracing using Native AWS services such as Amazon CloudWatch and AWS X-Ray. We recommended using AWS Lambda Powertools library to easily add observability best practices to your application. By adopting these best practices, you can unlock valuable insights into your serverless application, enabling faster error detection and performance optimization. + +For further deep dive, we would highly recommend you to practice AWS Native Observability module of AWS [One Observability Workshop](https://catalog.workshops.aws/observability/en-US). + + + + + + + diff --git a/docusaurus/observability-best-practices/docs/guides/serverless/oss/lambda-based-observability-adot.md b/docusaurus/observability-best-practices/docs/guides/serverless/oss/lambda-based-observability-adot.md new file mode 100644 index 000000000..63dc84817 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/serverless/oss/lambda-based-observability-adot.md @@ -0,0 +1,200 @@ +# AWS Lambda based Serverless Observability with OpenTelemetry + +This guide covers the best practices on configuring observability for Lambda based serverless applications using managed open-source tools and technologies together with the native AWS monitoring services such as AWS X-Ray, and Amazon CloudWatch. We will cover tools such as [AWS Distro for OpenTelemetry (ADOT)](https://aws-otel.github.io/docs/introduction), [AWS X-Ray](https://aws.amazon.com/xray), and [Amazon Managed Service for Prometheus (AMP)](https://aws.amazon.com/prometheus/) and how you can use these tools to gain actionable insights into your serverless applications, troubleshoot issues, and optimize application performance. + +## **Key topics covered** + +In this section of the observability best practices guide, we will deep dive on to following topics: + +* Introduction to AWS Distro for OpenTelemetry (ADOT) and ADOT Lambda Layer +* Auto-instrumentation Lambda function using ADOT Lambda Layer +* Custom configuration support for ADOT Collector +* Integration with Amazon Managed Service for Prometheus (AMP) +* Pros and cons of using ADOT Lambda Layer +* Managing cold start latency when using ADOT + + +## **Introduction to AWS Distro for OpenTelemetry (ADOT)** + +[AWS Distro for OpenTelemetry (ADOT)](https://aws-otel.github.io/docs/introduction) is a secure, production-ready, AWS-supported distribution of the Cloud Native Computing Foundation (CNCF) [OpenTelemetry (OTel)](https://opentelemetry.io/) project. Using ADOT, you can instrument your applications just once and send correlated metrics and traces to multiple monitoring solutions. + +AWS's managed [OpenTelemetry Lambda Layer](https://aws-otel.github.io/docs/getting-started/lambda) utilizes [OpenTelemetry Lambda Layer](https://github.com/open-telemetry/opentelemetry-lambda) to export telemetry data asynchronously from AWS Lambda. It provides plug-and-play user experience by wrapping an AWS Lambda function, and by packaging the OpenTelemetry runtime specific SDK, trimmed down version of ADOT collector together with an out-of-the-box configuration for auto-instrumenting AWS Lambda functions. ADOT Lambda Layer collector components, such as Receivers, Exporters, and Extensions support integration with Amazon CloudWatch, Amazon OpenSearch Service, Amazon Managed Service for Prometheus, AWS X-Ray, and others. Find the complete list [here](https://github.com/aws-observability/aws-otel-lambda). ADOT also supports integrations with [partner solutions](https://aws.amazon.com/otel/partners). + +ADOT Lambda Layer supports both auto-instrumentation (for Python, NodeJS, and Java) as well as custom instrumentation for any specific set of libraries and SDKs. With auto-instrumentation, by default, the Lambda Layer is configured to export traces to AWS X-Ray. For custom instrumentation, you will need to include the corresponding library instrumentation from the respective [OpenTelemetry runtime instrumentation repository](https://github.com/open-telemetry) and modify your code to initialize it in your function. + +## **Auto-instrumentation using ADOT Lambda Layer with AWS Lambda** + +You can easily enable auto-instrumentation of Lambda function using ADOT Lambda Layer without any code changes. Let’s take an example of adding ADOT Lambda layer to your existing Java based Lambda function and view execution logs and traces in CloudWatch. + +1. Choose the ARN of the Lambda Layer based on the `runtime`, `region` and the `arch type` as per the [documentation](https://aws-otel.github.io/docs/getting-started/lambda). Make sure you use the Lambda Layer in the same region as your Lambda function. For example, Lambda Layer for java auto-instrumentation would be `arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-java-agent-x86_64-ver-1-28-1:1` +2. Add Layer to your Lambda function either via Console of IaC of your choice. + * With AWS Console, follow the [instructions](https://docs.aws.amazon.com/lambda/latest/dg/adding-layers.html) to add Layer to your Lambda function. Under Specify an ARN paste the layer ARN selected above. + * With IaC option, SAM template for Lambda function would look like this: + ``` + Layers: + - !Sub arn:aws:lambda:${AWS::Region}:901920570463:layer:aws-otel-java-agent-arm64-ver-1-28-1:1 + ``` +3. Add an environment variable `AWS_LAMBDA_EXEC_WRAPPER=/opt/otel-handler` for Node.js or Java, and `AWS_LAMBDA_EXEC_WRAPPER=/opt/otel-instrument` for Python to your Lambda function. +4. Enable Active Tracing for your Lambda function. **`Note`** that by default, the layer is configured to export traces to AWS X-Ray. Make sure your Lambda function’s execution role has the required AWS X-Ray permissions. For more on AWS X-Ray permissions for AWS Lambda, see the [AWS Lambda documentation](https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html#services-xray-permissions). + * `Tracing: Active` +5. Example SAM template with Lambda Layer configuration, Environment Variable, and X-Ray tracing would look something like this: +``` +Resources: + ListBucketsFunction: + Type: AWS::Serverless::Function + Properties: + Handler: com.example.App::handleRequest + ... + ProvisionedConcurrencyConfig: + ProvisionedConcurrentExecutions: 1 + Policies: + - AWSXrayWriteOnlyAccess + - AmazonS3ReadOnlyAccess + Environment: + Variables: + AWS_LAMBDA_EXEC_WRAPPER: /opt/otel-handler + Tracing: Active + Layers: + - !Sub arn:aws:lambda:${AWS::Region}:901920570463:layer:aws-otel-java-agent-amd64-ver-1-28-1:1 + Events: + HelloWorld: + Type: Api + Properties: + Path: /listBuckets + Method: get +``` +6. Testing and Visualizing traces in AWS X-Ray +Invoke your Lambda function either directly or via an API (if an API is configured as a trigger). For example, invoking Lambda function via API (using `curl`) would generate logs as below: +``` +curl -X GET https://XXXXXX.execute-api.us-east-1.amazonaws.com/Prod/listBuckets +``` +Lambda function logs: +

+OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
+[otel.javaagent 2023-09-24 15:28:16:862 +0000] [main] INFO io.opentelemetry.javaagent.tooling.VersionLogger - opentelemetry-javaagent - version: 1.28.0-adot-lambda1-aws
+EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
+START RequestId: ed8f8444-3c29-40fe-a4a1-aca7af8cd940 Version: 3
+...
+END RequestId: ed8f8444-3c29-40fe-a4a1-aca7af8cd940
+REPORT RequestId: ed8f8444-3c29-40fe-a4a1-aca7af8cd940 Duration: 5144.38 ms Billed Duration: 5145 ms Memory Size: 1024 MB Max Memory Used: 345 MB Init Duration: 27769.64 ms
+XRAY TraceId: 1-65105691-384f7da75714148655fa631b SegmentId: 2c52a147021ebd20 Sampled: true
+
+ +As you can see from the logs, OpenTelemetry Lambda extension starts listening and instrumenting Lambda functions using opentelemetry-javaagent and generates traces in AWS X-Ray. + +To view the traces from the above Lambda function invocation, navigate to the AWS X-Ray console and select the trace id under Traces. You should see a Trace Map along with Segments Timeline as below: +![Lambda Insights](../../../images/Serverless/oss/xray-trace.png) + + +## **Custom configuration support for ADOT Collector** + +The ADOT Lambda Layer combines both OpenTelemetry SDK and the ADOT Collector components. The configuration of the ADOT Collector follows the OpenTelemetry standard. By default, the ADOT Lambda Layer uses [config.yaml](https://github.com/aws-observability/aws-otel-lambda/blob/main/adot/collector/config.yaml), which exports telemetry data to AWS X-Ray. However, ADOT Lambda Layer also supports other exporters, which enables you to send metrics and traces to other destinations. Find the complete list of available components supported for custom configuration [here](https://github.com/aws-observability/aws-otel-lambda/blob/main/README.md#adot-lambda-layer-available-components). + +## **Integration with Amazon Managed Service for Prometheus (AMP)** + +You can use custom collector configuration to export metrics from your Lambda function to Amazon Managed Prometheus (AMP). + +1. Follow the steps from auto-instrumentation above, to configure Lambda Layer, set Environment variable `AWS_LAMBDA_EXEC_WRAPPER`. +2. Follow the [instructions](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-create-workspace.html) to create Amazon Manager Prometheus workspace in your AWS account, where your Lambda function will be sending metrics to. Make a note of the `Endpoint - remote write URL` from the AMP workspace. You would need that to be configured on ADOT collector configuration. +3. Create a custom ADOT collector configuration file (say `collector.yaml`) in your Lambda function's root directory with details of AMP endpoint remote write URL from previous step. You can also load the configuration file from S3 bucket. +Sample ADOT collector configuration file: +``` +#collector.yaml in the root directory +#Set an environemnt variable 'OPENTELEMETRY_COLLECTOR_CONFIG_FILE' to '/var/task/collector.yaml' + +extensions: + sigv4auth: + service: "aps" + region: "" + +receivers: + otlp: + protocols: + grpc: + http: + +exporters: + logging: + prometheusremotewrite: + endpoint: "" + namespace: test + auth: + authenticator: sigv4auth + +service: + extensions: [sigv4auth] + pipelines: + traces: + receivers: [otlp] + exporters: [awsxray] + metrics: + receivers: [otlp] + exporters: [logging, prometheusremotewrite] +``` +Prometheus Remote Write Exporter can also be configured with retry, and timeout settings. For more information see the [documentation](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/README.md). **`Note`** Service value for `sigv4auth` extension should be `aps` (amazon prometheus service). Also, Make sure your Lambda function execution role has the required AMP permissions. For more information on permissions and policies required on AMP for AWS Lambda, see the AWS Managed Service for Prometheus [documentation](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-and-IAM.html#AMP-IAM-policies-built-in). + +4. Add an environment variable `OPENTELEMETRY_COLLECTOR_CONFIG_FILE` and set value to the path of configuration file. E.g. /var/task/``.yaml. This will tell the Lambda Layer extension where to find the collector configuration. +``` +Function: + Type: AWS::Serverless::Function + Properties: + ... + Environment: + Variables: + OPENTELEMETRY_COLLECTOR_CONFIG_FILE: /var/task/collector.yaml +``` +5. Update your Lambda function code to add metrics using OpenTelemetry Metrics API. Check out examples here. +``` +// get meter +Meter meter = GlobalOpenTelemetry.getMeterProvider() + .meterBuilder("aws-otel") + .setInstrumentationVersion("1.0") + .build(); + +// Build counter e.g. LongCounter +LongCounter counter = meter + .counterBuilder("processed_jobs") + .setDescription("Processed jobs") + .setUnit("1") + .build(); + +// It is recommended that the API user keep a reference to Attributes they will record against +Attributes attributes = Attributes.of(stringKey("Key"), "SomeWork"); + +// Record data +counter.add(123, attributes); +``` + +## **Pros and Cons of using ADOT Lambda Layer** + +If you intend to send traces to AWS X-Ray from Lambda function, you can either use [X-Ray SDK](https://docs.aws.amazon.com/xray/latest/devguide/xray-sdk-nodejs.html) or [AWS Distro for OpenTelemetry (ADOT) Lambda Layer](https://aws-otel.github.io/docs/getting-started/lambda). While X-Ray SDK supports easy instrumentation of various AWS services, it can only send traces to X-Ray. Whereas, ADOT collector, which is included as part of the Lambda Layer supports large number of library instrumentations for each language. You can use it to collect and send metrics and traces to AWS X-Ray and other monitoring solutions, such as Amazon CloudWatch, Amazon OpenSearch Service, Amazon Managed Service for Prometheus and other [partner](https://aws-otel.github.io/docs/components/otlp-exporter#appdynamics) solutions. + +However, due to the flexibility ADOT offers, your Lambda function may require additional memory and can experience notable impact on cold start latency. So, if you are optimizing your Lambda function for low-latency and do not need advanced features of OpenTelemetry, using AWS X-Ray SDK over ADOT might be more suitable. For detailed comparison and guidance on choosing the right tracing tool, refer to AWS docs on [choosing between ADOT and X-Ray SDK](https://docs.aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html#xray-instrumenting-choosing). + + +## **Managing cold start latency when using ADOT** +ADOT Lambda Layer for Java is agent-based, which means that when you enable auto-instrumentation, Java Agent will try to instrument all the OTel [supported](https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation) libraries. This will increase the Lambda function cold start latency significantly. So, we recommend that you only enable auto-instrumentation for the libraries/frameworks that are used by your application. + +To enable only specific instrumentations, you can use the following environment variables: + +* `OTEL_INSTRUMENTATION_COMMON_DEFAULT_ENABLED`: when set to false, disables auto-instrumentation in the Layer, requiring each instrumentation to be enabled individually. +* `OTEL_INSTRUMENTATION__ENABLED`: set to true to enable auto-instrumentation for a specific library or framework. Replace "NAME" by the instrumentation that you want to enable. For the list of available instrumentations, see Suppressing specific agent instrumentation. + +For example, to only enable auto-instrumentation for Lambda and the AWS SDK, you would set the following environment variables: +``` +OTEL_INSTRUMENTATION_COMMON_DEFAULT_ENABLED=false +OTEL_INSTRUMENTATION_AWS_LAMBDA_ENABLED=true +OTEL_INSTRUMENTATION_AWS_SDK_ENABLED=true +``` + +## **Additional Resources** + +* [OpenTelemetry](https://opentelemetry.io) +* [AWS Distro for OpenTelemetry (ADOT)](https://aws-otel.github.io/docs/introduction) +* [ADOT Lambda Layer](https://aws-otel.github.io/docs/getting-started/lambda) + +## **Summary** + +In this observability best practice guide for AWS Lambda based serverless application using Open Source technologies, we covered AWS Distro for OpenTelemetry (ADOT) and Lambda Layer and how you can use it instrument your AWS Lambda functions. We covered how you can easily enable auto-instrumentation as well as customize the ADOT collector with simple configuration to send observability signals to multiple destinations. We highlighted pros and cons of using ADOT and how it can impact cold start latency for your Lambda function and also recommended best practices to manage cold-start times. By adopting these best practices, you can instrument your applications just once to send logs, metrics and traces to multiple monitoring solutions in a vendor agnostic way. + +For further deep dive, we would highly recommend you to practice AWS managed open-source Observability module of [AWS One Observability Workshop](https://catalog.workshops.aws/observability/en-US). diff --git a/docusaurus/observability-best-practices/docs/guides/signal-collection/emf.md b/docusaurus/observability-best-practices/docs/guides/signal-collection/emf.md new file mode 100644 index 000000000..d338ce2c6 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/signal-collection/emf.md @@ -0,0 +1,137 @@ +# CloudWatch Embedded Metric Format + +## Introduction + +CloudWatch Embedded Metric Format (EMF) enables customers to ingest complex high-cardinality application data to Amazon CloudWatch in the form of logs and generate actionable metrics. With Embedded Metric Format customers do not have to rely on complex architecture or need to use any third party tools to gain insights into their environments. Although this feature can be used in all environments, it’s particularly useful in workloads that have ephemeral resources like AWS Lambda functions or containers in Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), or Kubernetes on EC2. Embedded Metric Format lets customers easily create custom metrics without having to instrument or maintain separate code, while gaining powerful analytical capabilities on log data. + +## How Embedded Metric Format (EMF) logs work + +Compute environments like Amazon EC2, On-premise Servers, containers in Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), or Kubernetes on EC2 can generate & send Embedded Metric Format (EMF) logs through the CloudWatch Agent to Amazon CloudWatch. + +AWS Lambda allows customers to easily generate custom metrics without requiring any custom code, making blocking network calls or relying on any third party software to generate and ingest Embedded Metric Format (EMF) logs to Amazon CloudWatch. + +Customers can embed custom metrics alongside detailed log event data asynchronously without requiring to provide special header declaration while publishing structured logs aligning the [EMF specification](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html). CloudWatch automatically extracts the custom metrics so that customers can visualize & set alarm for real-time incident detection. The detailed log events and high-cardinality context associated with the extracted metrics can be queried using CloudWatch Logs Insights to provide deep insights into the root causes of operational events. + +The Amazon CloudWatch output plugin for [Fluent Bit](https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch) allows customers to ingest metrics & logs data into the Amazon CloudWatch service that includes support for [Embedded Metric Format](https://github.com/aws/aws-for-fluent-bit) (EMF). + +![CloudWatch EMF Architecture](../../images/EMF-Arch.png) + +## When to use Embedded Metric Format (EMF) logs + +Traditionally, monitoring has been structured into three categories. The first category is the classic health check of an application. The second category is 'metrics', through which customers instrument their application using models like counters, timers, and gauges. The third category is 'logs', that are invaluable for the overall observability of the application. Logs provide customers continuous information about how their application is behaving. Now, customers have a way to significantly improve the way they can observe their application, without having to make sacrifices in data granularity or richness by unifying and simplifying all the instrumentation of their application while gaining incredible analytical capabilities through Embedded Metric Format (EMF) logs. + +[Embedded Metric Format (EMF) logs](https://aws.amazon.com/blogs/mt/enhancing-workload-observability-using-amazon-cloudwatch-embedded-metric-format/) is ideal for environments which generate high cardinality application data, that can be part of the EMF logs without having to increase metric dimensions. This still allows customers to slice and dice the application data by querying EMF logs through CloudWatch Logs Insights and CloudWatch Metrics Insights without needing to put every attribute as a metric dimension. + +Customers aggregating [telemetry data from millions of Telco or IoT devices](https://aws.amazon.com/blogs/mt/how-bt-uses-amazon-cloudwatch-to-monitor-millions-of-devices/) require insights into their devices performance and ability to quickly deep dive into unique telemetry that the devices report. They also need to troubleshoot problems easier & faster without requiring to dig through humongous data to provide a quality service. By using Embedded Metric Format (EMF) logs customers can accomplish large scale observability by combining metrics and logs into single entity and improve troubleshooting with cost efficiency and better performance. + +## Generating Embedded Metric Format (EMF) logs + +The following methods can be used to generate Embedded metric format logs + +1. Generate and send the EMF logs through an agent (like [CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Generation_CloudWatch_Agent.html) or Fluent-Bit or Firelens) using open-sourced client libraries. + + - Open-sourced client libraries are available in the following languages which can be used to create EMF logs + - [Node.Js](https://github.com/awslabs/aws-embedded-metrics-node) + - [Python](https://github.com/awslabs/aws-embedded-metrics-python) + - [Java](https://github.com/awslabs/aws-embedded-metrics-java) + - [C#](https://github.com/awslabs/aws-embedded-metrics-dotnet) + - EMF logs can be generated using AWS Distro for OpenTelemetry (ADOT). ADOT is a secure, production-ready, AWS-supported distribution of the OpenTelemetry project part of the Cloud Native Computing Foundation (CNCF). OpenTelemetry is an open-source initiative that provides APIs, libraries, and agents to collect distributed traces, logs and metrics for application monitoring & removes boundaries and restrictions between vendor-specific formats. There are two components required for this, an OpenTelemetry compliant data source and [ADOT Collector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsemfexporter) enabled for use with [CloudWatch EMF](https://aws-otel.github.io/docs/getting-started/cloudwatch-metrics#cloudwatch-emf-exporter-awsemf) logs. + +2. Manually constructed logs conforming to [defined specification in JSON format](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html), can be sent through [CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Generation_CloudWatch_Agent.html) or [PutLogEvents API](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Generation_PutLogEvents.html) to CloudWatch. + +## Viewing Embedded Metric Format logs in CloudWatch console + +After generating the Embedded Metric Format (EMF) logs that extract metrics customers can [view them in CloudWatch console](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_View.html) under Metrics. Embedded metrics have the dimensions that are specified while generating the logs. Embedded metrics generated using client libraries have ServiceType, ServiceName, LogGroup as default dimensions. + +- **ServiceName**: The name of the service is overridden, however for services where the name cannot be inferred (e.g. Java process running on EC2) a default value of Unknown is used if not explicitly set. +- **ServiceType**: The type of the service is overridden, however for services where the type cannot be inferred (e.g. Java process running on EC2) a default value of Unknown is used if not explicitly set. +- **LogGroupName**: Customers can optionally configure the destination log group that metrics should be delivered to, for agent-based platforms. This value is passed from the library to the agent in the Embedded Metric payload. If a LogGroup is not provided, the default value will be derived from the service name: -metrics +- **LogStreamName**: Customers can optionally configure the destination log stream that metrics should be delivered to, for agent-based platforms. This value will be passed from the library to the agent in the Embedded Metric payload. If a LogStreamName is not provided, the default value will be derived by the agent (this will likely be the hostname). +- **NameSpace**: Overrides the CloudWatch namespace. If not set, a default value of aws-embedded-metrics is used. + +A sample EMF logs looks like below in the CloudWatch Console logs + +```json +2023-05-19T15:20:39.391Z 238196b6-c8da-4341-a4b7-0c322e0ef5bb INFO +{ + "LogGroup": "emfTestFunction", + "ServiceName": "emfTestFunction", + "ServiceType": "AWS::Lambda::Function", + "Service": "Aggregator", + "AccountId": "XXXXXXXXXXXX", + "RequestId": "422b1569-16f6-4a03-b8f0-fe3fd9b100f8", + "DeviceId": "61270781-c6ac-46f1-baf7-22c808af8162", + "Payload": { + "sampleTime": 123456789, + "temperature": 273, + "pressure": 101.3 + }, + "executionEnvironment": "AWS_Lambda_nodejs18.x", + "memorySize": "256", + "functionVersion": "$LATEST", + "logStreamId": "2023/05/19/[$LATEST]f3377848231140c185570caa9f97abc8", + "_aws": { + "Timestamp": 1684509639390, + "CloudWatchMetrics": [ + { + "Dimensions": [ + [ + "LogGroup", + "ServiceName", + "ServiceType", + "Service" + ] + ], + "Metrics": [ + { + "Name": "ProcessingLatency", + "Unit": "Milliseconds" + } + ], + "Namespace": "aws-embedded-metrics" + } + ] + }, + "ProcessingLatency": 100 +} +``` + +For the same EMF log, the extracted metrics looks like below, which can be queried in **CloudWatch Metrics**. + +![CloudWatch EMF Metrics](../../images/emf_extracted_metrics.png) + +Customers can query the detailed log events associated with the extracted metrics using **CloudWatch Logs Insights** to get deep insights into the root causes of operational events. One of the benefits of extracting metrics from EMF logs is that the customers can filter logs by the unique metric (metric name plus unique dimension set) and metric values, to get context on the events that contributed to the aggregated metric value. + +For the same EMF logs discussed above, an example query having ProcessingLatency as a metric and Service as a dimension to get an impacted request id or device id is shown below as sample query in CloudWatch Logs Insights. + +```json +filter ProcessingLatency < 200 and Service = "Aggregator" +| fields @requestId, @ingestionTime, @DeviceId +``` + +![CloudWatch EMF Logs](../../images/emf_extracted_CWLogs.png) + +## Alarms on metrics created with EMF logs + +Creating [alarms on metrics generated by EMF](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Alarms.html) follows the same pattern as creating alarms on any other metrics. The key thing to note here is, EMF metric generation depends on log publishing flow, because the CloudWatch Logs process the EMF logs & transform the metrics. So it’s important to publish logs in a timely manner so that the metric datapoints are created within the period of time in which alarms are evaluated. + +For the same EMF logs discussed above, an example alarm is created and shown below using the ProcessingLatency metric as a datapoint with a threshold. + +![CloudWatch EMF Alarm](../../images/EMF-Alarm.png) + +## Latest features of EMF Logs + +Customers can send EMF logs to CloudWatch Logs using [PutLogEvents API](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Generation_PutLogEvents.html) and may optionally include the HTTP header `x-amzn-logs-format: json/emf` to instruct CloudWatch Logs that the metrics should be extracted, it is no longer necessary. + +Amazon CloudWatch supports [high resolution metric extraction](https://aws.amazon.com/about-aws/whats-new/2023/02/amazon-cloudwatch-high-resolution-metric-extraction-structured-logs/) with up to 1 second granularity from structured logs using Embedded Metric Format (EMF). Customers can provide an optional [StorageResolution](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Resolution_definition) parameter within EMF specification logs with a value of 1 or 60 (default) to indicate the desired resolution (in seconds) of the metric. Customers can publish both standard resolution (60 seconds) and high resolution (1 second) metrics via EMF, enabling granular visibility into their applications’ health and performance. + +Amazon CloudWatch provides [enhanced visibility into errors](https://aws.amazon.com/about-aws/whats-new/2023/01/amazon-cloudwatch-enhanced-error-visibility-embedded-metric-format-emf/) in Embedded Metric Format (EMF) with two error metrics ([EMFValidationErrors & EMFParsingErrors](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatch-Logs-Monitoring-CloudWatch-Metrics.html)). This enhanced visibility helps customers quickly identify and remediate errors when leveraging EMF, thereby simplifying the instrumentation process . + +With the increased complexity of managing modern applications, customers need more flexibility when defining and analyzing custom metrics. Hence the maximum number of metric dimensions has been increased from 10 to 30. Customers can create custom metrics using [EMF logs with up to 30 dimensions](https://aws.amazon.com/about-aws/whats-new/2022/08/amazon-cloudwatch-metrics-increases-throughput/). + +## Additional References: + +- One Observability Workshop on [Embedded Metric Format with an AWS Lambda function](https://catalog.workshops.aws/observability/en-US/aws-native/metrics/emf/clientlibrary) sample using NodeJS Library. +- Serverless Observability Workshop on [Async metrics using Embedded Metrics Format](https://serverless-observability.workshop.aws/en/030_cloudwatch/async_metrics_emf.html) (EMF) +- [Java code sample using PutLogEvents API](https://catalog.workshops.aws/observability/en-US/aws-native/metrics/emf/putlogevents) to send EMF logs to CloudWatch Logs +- Blog article: [Lowering costs and focusing on our customers with Amazon CloudWatch embedded custom metrics](https://aws.amazon.com/blogs/mt/lowering-costs-and-focusing-on-our-customers-with-amazon-cloudwatch-embedded-custom-metrics/) diff --git a/docusaurus/observability-best-practices/docs/guides/signal-correlation/how-does-it-work.md b/docusaurus/observability-best-practices/docs/guides/signal-correlation/how-does-it-work.md new file mode 100644 index 000000000..e69de29bb diff --git a/docusaurus/observability-best-practices/docs/guides/strategy.md b/docusaurus/observability-best-practices/docs/guides/strategy.md new file mode 100644 index 000000000..636d5d979 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/strategy.md @@ -0,0 +1,2 @@ +# Creating an observability strategy + diff --git a/docusaurus/observability-best-practices/docs/guides/webpackConfig.js b/docusaurus/observability-best-practices/docs/guides/webpackConfig.js new file mode 100644 index 000000000..aeebda8da --- /dev/null +++ b/docusaurus/observability-best-practices/docs/guides/webpackConfig.js @@ -0,0 +1,12 @@ +module.exports = function(config) { + // Webpack Configuration + + // Ignore the "compiled with problems" error + config.webpack.cache = false; + config.webpack.ignoreWarnings = [/Failed to parse source map/]; + config.webpack.watchOptions = { + ignored: ['node_modules'], + }; + + return config; + }; \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/home.md b/docusaurus/observability-best-practices/docs/home.md new file mode 100644 index 000000000..127d91cdb --- /dev/null +++ b/docusaurus/observability-best-practices/docs/home.md @@ -0,0 +1,34 @@ +# What is observability + +## What it is + +Observability is the capability to continuously generate and discover actionable insights based on signals from the system under observation. In other words, observability allows users to understand a system’s state from its external output and take (corrective) action. + +## Problem it addresses + +Computer systems are measured by observing low-level signals such as CPU time, memory, disk space, and higher-level and business signals, including API response times, errors, transactions per second, etc. + +The observability of a system has a significant impact on its operating and development costs. Observable systems yield meaningful, actionable data to their operators, allowing them to achieve favorable outcomes (faster incident response, increased developer productivity) and less toil and downtime. + +## How it helps + +Understanding that more information does not necessarily translate into a more observable system is pivotal. In fact, sometimes, the amount of information generated by a system can make it harder to identify valuable health signals from the noise generated by the application. Observability requires the right data at the right time for the right consumer (human or piece of software) to make the right decisions. + +## What you will find here + +This site contains our best practices for observability: what do to, what *not* to do, and a collection of recipes on how to do them. Most of the content here is vendor agnostic and represents what any good observability solution will provide. + +It is important that you consider observability as a *solution* though and not a *product*. Observability comes from your practices, and is integral to strong development and DevOps leadership. A well-observed application is one that places observability as a principal of operations, similar to how security must be at the forefront of how you organize a project. Attempting to “bolt-on” observability after the fact is an anti-pattern and meets with less success. + +This site is organized into four categories: + +1. [Best practices by solution, such as for dashboarding, application performance monitoring, or containers](https://aws-observability.github.io/observability-best-practices/guides/) +1. [Best practices for the use of different data types, such as for logs or traces](https://aws-observability.github.io/observability-best-practices/signals/logs/) +1. [Best practices for specific AWS tools (though these are largely fungible to other vendor products as well)](https://aws-observability.github.io/observability-best-practices/tools/cloudwatch_agent/) +1. [Curated recipes for observability with AWS](https://aws-observability.github.io/observability-best-practices/recipes/) + +:::info + This site is based on real world use cases that AWS and our customers have solved for. + + Observability is at the heart of modern application development, and a critical consideration when operating distributed systems, such as microservices, or complex applications with many external integrations. We consider it to be a leading indicator of a healthy workload, and we are pleased to share our experiences with you here! +::: \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/images/ADOT-central.png b/docusaurus/observability-best-practices/docs/images/ADOT-central.png new file mode 100644 index 000000000..0b9cbedc1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ADOT-central.png differ diff --git a/docusaurus/observability-best-practices/docs/images/ADOT-sidecar.png b/docusaurus/observability-best-practices/docs/images/ADOT-sidecar.png new file mode 100644 index 000000000..1d55523e8 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ADOT-sidecar.png differ diff --git a/docusaurus/observability-best-practices/docs/images/AHA-Integration.jpg b/docusaurus/observability-best-practices/docs/images/AHA-Integration.jpg new file mode 100644 index 000000000..1555143cd Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/AHA-Integration.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/AMP_rules_namespaces.png b/docusaurus/observability-best-practices/docs/images/AMP_rules_namespaces.png new file mode 100644 index 000000000..0642857e3 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/AMP_rules_namespaces.png differ diff --git a/docusaurus/observability-best-practices/docs/images/AWS-Observability-maturity-model.png b/docusaurus/observability-best-practices/docs/images/AWS-Observability-maturity-model.png new file mode 100644 index 000000000..b26ce356f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/AWS-Observability-maturity-model.png differ diff --git a/docusaurus/observability-best-practices/docs/images/AWS_O11y_Stack.png b/docusaurus/observability-best-practices/docs/images/AWS_O11y_Stack.png new file mode 100644 index 000000000..59fc1c744 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/AWS_O11y_Stack.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Application_Insights_CW_DB.png b/docusaurus/observability-best-practices/docs/images/Application_Insights_CW_DB.png new file mode 100644 index 000000000..439fe55be Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Application_Insights_CW_DB.png differ diff --git a/docusaurus/observability-best-practices/docs/images/CW-ApplicationInsights.jpg b/docusaurus/observability-best-practices/docs/images/CW-ApplicationInsights.jpg new file mode 100644 index 000000000..1da83e2e8 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/CW-ApplicationInsights.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/ContainerInsightsMetrics.png b/docusaurus/observability-best-practices/docs/images/ContainerInsightsMetrics.png new file mode 100644 index 000000000..d6fd3f946 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ContainerInsightsMetrics.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Container_Insights_CW_Automatic_DB.png b/docusaurus/observability-best-practices/docs/images/Container_Insights_CW_Automatic_DB.png new file mode 100644 index 000000000..708032a68 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Container_Insights_CW_Automatic_DB.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-1.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-1.jpg new file mode 100644 index 000000000..70f2dd9db Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-1.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-10.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-10.jpg new file mode 100644 index 000000000..68b7fb1fc Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-10.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-11.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-11.jpg new file mode 100644 index 000000000..006bfd623 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-11.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-12.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-12.jpg new file mode 100644 index 000000000..3a5cffcc4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-12.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-13.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-13.jpg new file mode 100644 index 000000000..9c6eba97e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-13.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-2.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-2.jpg new file mode 100644 index 000000000..69aa3487f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-2.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-3.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-3.jpg new file mode 100644 index 000000000..5bfc8c3aa Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-3.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-4.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-4.jpg new file mode 100644 index 000000000..dd7f47be2 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-4.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-5.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-5.jpg new file mode 100644 index 000000000..635eb77f9 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-5.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-6.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-6.jpg new file mode 100644 index 000000000..df6dd526c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-6.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-7.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-7.jpg new file mode 100644 index 000000000..1a27cfaa4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-7.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-8.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-8.jpg new file mode 100644 index 000000000..be54fea8a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-8.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-9.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-9.jpg new file mode 100644 index 000000000..317b58dde Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/api-mon-9.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-adot-collector-pipeline-eks.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-adot-collector-pipeline-eks.jpg new file mode 100644 index 000000000..a0031041b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-adot-collector-pipeline-eks.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-adot-collector-pipeline.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-adot-collector-pipeline.jpg new file mode 100644 index 000000000..86eb83bed Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-adot-collector-pipeline.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-components.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-components.jpg new file mode 100644 index 000000000..4fba29962 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-components.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-cost-explorer.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-cost-explorer.jpg new file mode 100644 index 000000000..d661d4a19 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/cw-cost-explorer.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-1.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-1.jpg new file mode 100644 index 000000000..07fa4c88e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-1.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-2.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-2.jpg new file mode 100644 index 000000000..14a8d88d7 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-2.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-3.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-3.jpg new file mode 100644 index 000000000..883c71bde Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-3.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-4.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-4.jpg new file mode 100644 index 000000000..2092047bf Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/log-aggreg-4.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-1.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-1.jpg new file mode 100644 index 000000000..f203343c8 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-1.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-2.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-2.jpg new file mode 100644 index 000000000..18664e3e7 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-2.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-3.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-3.jpg new file mode 100644 index 000000000..819da56f7 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-3.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-4.jpg b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-4.jpg new file mode 100644 index 000000000..6b2731542 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/aws-native/eks/tracing-4.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/arch.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/arch.png new file mode 100644 index 000000000..27aea7912 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/arch.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda1.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda1.png new file mode 100644 index 000000000..9e48f4eee Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda1.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda10.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda10.png new file mode 100644 index 000000000..0f05068a3 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda10.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda2.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda2.png new file mode 100644 index 000000000..867670006 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda2.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda3.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda3.png new file mode 100644 index 000000000..64c96a0c2 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda3.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda4.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda4.png new file mode 100644 index 000000000..1ec31f848 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda4.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda5.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda5.png new file mode 100644 index 000000000..0a215fb70 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda5.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda6.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda6.png new file mode 100644 index 000000000..b1f4eb38a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda6.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda7.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda7.png new file mode 100644 index 000000000..f16f6f5f0 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda7.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda8.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda8.png new file mode 100644 index 000000000..de41739e5 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda8.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda9.png b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda9.png new file mode 100644 index 000000000..349abe3df Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Containers/oss/eks/keda9.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Contributor_Insights_CW_DB.png b/docusaurus/observability-best-practices/docs/images/Contributor_Insights_CW_DB.png new file mode 100644 index 000000000..6909870c1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Contributor_Insights_CW_DB.png differ diff --git a/docusaurus/observability-best-practices/docs/images/CustomDashboard.png b/docusaurus/observability-best-practices/docs/images/CustomDashboard.png new file mode 100644 index 000000000..4fdf7e93d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/CustomDashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/images/EMF-Alarm.png b/docusaurus/observability-best-practices/docs/images/EMF-Alarm.png new file mode 100644 index 000000000..48484b9bd Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/EMF-Alarm.png differ diff --git a/docusaurus/observability-best-practices/docs/images/EMF-Arch.png b/docusaurus/observability-best-practices/docs/images/EMF-Arch.png new file mode 100644 index 000000000..38bfdeb1a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/EMF-Arch.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Lambda_Insights_CW_Automatic_DB.png b/docusaurus/observability-best-practices/docs/images/Lambda_Insights_CW_Automatic_DB.png new file mode 100644 index 000000000..6c4b08bf3 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Lambda_Insights_CW_Automatic_DB.png differ diff --git a/docusaurus/observability-best-practices/docs/images/LogInsights.png b/docusaurus/observability-best-practices/docs/images/LogInsights.png new file mode 100644 index 000000000..6cfbed843 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/LogInsights.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Metrics_Explorer_CW_DB.png b/docusaurus/observability-best-practices/docs/images/Metrics_Explorer_CW_DB.png new file mode 100644 index 000000000..cf2bae49e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Metrics_Explorer_CW_DB.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Operational/gitops-with-amg/gitops-with-amg-1.jpg b/docusaurus/observability-best-practices/docs/images/Operational/gitops-with-amg/gitops-with-amg-1.jpg new file mode 100644 index 000000000..dc78feadf Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Operational/gitops-with-amg/gitops-with-amg-1.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Operational/gitops-with-amg/gitops-with-amg-2.jpg b/docusaurus/observability-best-practices/docs/images/Operational/gitops-with-amg/gitops-with-amg-2.jpg new file mode 100644 index 000000000..527ac7796 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Operational/gitops-with-amg/gitops-with-amg-2.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/Prometheus.png b/docusaurus/observability-best-practices/docs/images/Prometheus.png new file mode 100644 index 000000000..d3ca83021 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Prometheus.png differ diff --git a/docusaurus/observability-best-practices/docs/images/PushPullApproach.png b/docusaurus/observability-best-practices/docs/images/PushPullApproach.png new file mode 100644 index 000000000..c3321a87b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/PushPullApproach.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/apigw_lambda.png b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/apigw_lambda.png new file mode 100644 index 000000000..8273cb376 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/apigw_lambda.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/cw_dashboard.png b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/cw_dashboard.png new file mode 100644 index 000000000..084215fda Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/cw_dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/cw_percentile.png b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/cw_percentile.png new file mode 100644 index 000000000..b98848490 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/cw_percentile.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/lambda_insights.png b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/lambda_insights.png new file mode 100644 index 000000000..c2eea08f7 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/lambda_insights.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/xray_trace.png b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/xray_trace.png new file mode 100644 index 000000000..4f000772c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Serverless/aws-native/xray_trace.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Serverless/oss/xray-trace.png b/docusaurus/observability-best-practices/docs/images/Serverless/oss/xray-trace.png new file mode 100644 index 000000000..92ee92258 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Serverless/oss/xray-trace.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Service_Map_CW_DB.png b/docusaurus/observability-best-practices/docs/images/Service_Map_CW_DB.png new file mode 100644 index 000000000..c88a7b9b6 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Service_Map_CW_DB.png differ diff --git a/docusaurus/observability-best-practices/docs/images/Why_is_Observability_Important.png b/docusaurus/observability-best-practices/docs/images/Why_is_Observability_Important.png new file mode 100644 index 000000000..1b64136eb Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/Why_is_Observability_Important.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-arch.png b/docusaurus/observability-best-practices/docs/images/adot-arch.png new file mode 100644 index 000000000..9116143d5 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-arch.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-deployment.png b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-deployment.png new file mode 100644 index 000000000..752b8f402 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-deployment.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-ecs.png b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-ecs.png new file mode 100644 index 000000000..58606e5c5 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-ecs.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-gateway-batching-pressure.png b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-gateway-batching-pressure.png new file mode 100644 index 000000000..120cd956e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-gateway-batching-pressure.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-gateway-batching.png b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-gateway-batching.png new file mode 100644 index 000000000..0f2db2ef6 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-gateway-batching.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-gateway.png b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-gateway.png new file mode 100644 index 000000000..1029d67d2 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-gateway.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-no-collector.png b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-no-collector.png new file mode 100644 index 000000000..42dc03567 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-no-collector.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-simple-gateway-pressure.png b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-simple-gateway-pressure.png new file mode 100644 index 000000000..8b513799a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-simple-gateway-pressure.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-simple-gateway.png b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-simple-gateway.png new file mode 100644 index 000000000..6f1664233 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-deployment-simple-gateway.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-eks-daemonset.png b/docusaurus/observability-best-practices/docs/images/adot-collector-eks-daemonset.png new file mode 100644 index 000000000..0b84a7f57 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-eks-daemonset.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-collector-eks-sidecar.png b/docusaurus/observability-best-practices/docs/images/adot-collector-eks-sidecar.png new file mode 100644 index 000000000..2aec26cb5 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-collector-eks-sidecar.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-emf.png b/docusaurus/observability-best-practices/docs/images/adot-emf.png new file mode 100644 index 000000000..be780b3ba Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-emf.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot-prom-arch.png b/docusaurus/observability-best-practices/docs/images/adot-prom-arch.png new file mode 100644 index 000000000..31b366587 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot-prom-arch.png differ diff --git a/docusaurus/observability-best-practices/docs/images/adot.png b/docusaurus/observability-best-practices/docs/images/adot.png new file mode 100644 index 000000000..cf52e8472 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/adot.png differ diff --git a/docusaurus/observability-best-practices/docs/images/alarm3.png b/docusaurus/observability-best-practices/docs/images/alarm3.png new file mode 100644 index 000000000..b0b09d21e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/alarm3.png differ diff --git a/docusaurus/observability-best-practices/docs/images/alarm4.png b/docusaurus/observability-best-practices/docs/images/alarm4.png new file mode 100644 index 000000000..55e420f46 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/alarm4.png differ diff --git a/docusaurus/observability-best-practices/docs/images/alarms.graffle b/docusaurus/observability-best-practices/docs/images/alarms.graffle new file mode 100644 index 000000000..5514336b4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/alarms.graffle differ diff --git a/docusaurus/observability-best-practices/docs/images/allocations.png b/docusaurus/observability-best-practices/docs/images/allocations.png new file mode 100644 index 000000000..0a1f79360 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/allocations.png differ diff --git a/docusaurus/observability-best-practices/docs/images/amg-rds-aurora.png b/docusaurus/observability-best-practices/docs/images/amg-rds-aurora.png new file mode 100644 index 000000000..d6cfeedea Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/amg-rds-aurora.png differ diff --git a/docusaurus/observability-best-practices/docs/images/amp-alerting.png b/docusaurus/observability-best-practices/docs/images/amp-alerting.png new file mode 100644 index 000000000..a1a92ee40 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/amp-alerting.png differ diff --git a/docusaurus/observability-best-practices/docs/images/amp-overview.png b/docusaurus/observability-best-practices/docs/images/amp-overview.png new file mode 100644 index 000000000..d740e7ff0 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/amp-overview.png differ diff --git a/docusaurus/observability-best-practices/docs/images/ampmetricsingestionrate.png b/docusaurus/observability-best-practices/docs/images/ampmetricsingestionrate.png new file mode 100644 index 000000000..31a0d1abd Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ampmetricsingestionrate.png differ diff --git a/docusaurus/observability-best-practices/docs/images/ampwsingestionrate-1.png b/docusaurus/observability-best-practices/docs/images/ampwsingestionrate-1.png new file mode 100644 index 000000000..e6c7dd22c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ampwsingestionrate-1.png differ diff --git a/docusaurus/observability-best-practices/docs/images/ampwsingestionrate-2.png b/docusaurus/observability-best-practices/docs/images/ampwsingestionrate-2.png new file mode 100644 index 000000000..12429e10b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ampwsingestionrate-2.png differ diff --git a/docusaurus/observability-best-practices/docs/images/ampwsingestionrate-3.png b/docusaurus/observability-best-practices/docs/images/ampwsingestionrate-3.png new file mode 100644 index 000000000..026d2be20 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ampwsingestionrate-3.png differ diff --git a/docusaurus/observability-best-practices/docs/images/automatic-dashboard.png b/docusaurus/observability-best-practices/docs/images/automatic-dashboard.png new file mode 100644 index 000000000..6db1dc412 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/automatic-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/images/aws-logo.png b/docusaurus/observability-best-practices/docs/images/aws-logo.png new file mode 100644 index 000000000..d49129e3f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/aws-logo.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cloudwatch-cost-1.PNG b/docusaurus/observability-best-practices/docs/images/cloudwatch-cost-1.PNG new file mode 100644 index 000000000..bd304d8d2 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cloudwatch-cost-1.PNG differ diff --git a/docusaurus/observability-best-practices/docs/images/cloudwatch-cost-2.PNG b/docusaurus/observability-best-practices/docs/images/cloudwatch-cost-2.PNG new file mode 100644 index 000000000..7be9b2fa9 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cloudwatch-cost-2.PNG differ diff --git a/docusaurus/observability-best-practices/docs/images/cloudwatch-intro.png b/docusaurus/observability-best-practices/docs/images/cloudwatch-intro.png new file mode 100644 index 000000000..0bee34796 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cloudwatch-intro.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cur-architecture.png b/docusaurus/observability-best-practices/docs/images/cur-architecture.png new file mode 100644 index 000000000..d09257ccb Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cur-architecture.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cw-agent.png b/docusaurus/observability-best-practices/docs/images/cw-agent.png new file mode 100644 index 000000000..438413678 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cw-agent.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cw-alarm.png b/docusaurus/observability-best-practices/docs/images/cw-alarm.png new file mode 100644 index 000000000..b20f025de Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cw-alarm.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cw-logs.png b/docusaurus/observability-best-practices/docs/images/cw-logs.png new file mode 100644 index 000000000..99a467b07 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cw-logs.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cw-metrics.png b/docusaurus/observability-best-practices/docs/images/cw-metrics.png new file mode 100644 index 000000000..52f6fae6e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cw-metrics.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cw_dashboards_custom-widgets.png b/docusaurus/observability-best-practices/docs/images/cw_dashboards_custom-widgets.png new file mode 100644 index 000000000..2a24157c4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cw_dashboards_custom-widgets.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cw_dashboards_widgets.png b/docusaurus/observability-best-practices/docs/images/cw_dashboards_widgets.png new file mode 100644 index 000000000..ac5b6c1f7 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cw_dashboards_widgets.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwalarm1.png b/docusaurus/observability-best-practices/docs/images/cwalarm1.png new file mode 100644 index 000000000..1744ab59e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwalarm1.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwalarm2.png b/docusaurus/observability-best-practices/docs/images/cwalarm2.png new file mode 100644 index 000000000..579b260d4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwalarm2.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwl-dp-cred-sensitive.png b/docusaurus/observability-best-practices/docs/images/cwl-dp-cred-sensitive.png new file mode 100644 index 000000000..efb5fa4ef Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwl-dp-cred-sensitive.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwl-dp-credentials.png b/docusaurus/observability-best-practices/docs/images/cwl-dp-credentials.png new file mode 100644 index 000000000..418c44ba4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwl-dp-credentials.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwl-dp-fin-info.png b/docusaurus/observability-best-practices/docs/images/cwl-dp-fin-info.png new file mode 100644 index 000000000..a1b30b540 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwl-dp-fin-info.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwl-dp-loggroup.png b/docusaurus/observability-best-practices/docs/images/cwl-dp-loggroup.png new file mode 100644 index 000000000..52c76ebf5 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwl-dp-loggroup.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwl-dp-masked.png b/docusaurus/observability-best-practices/docs/images/cwl-dp-masked.png new file mode 100644 index 000000000..3177666ef Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwl-dp-masked.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwl-dp-phi.png b/docusaurus/observability-best-practices/docs/images/cwl-dp-phi.png new file mode 100644 index 000000000..c41edcd9b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwl-dp-phi.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwl-dp-pii.png b/docusaurus/observability-best-practices/docs/images/cwl-dp-pii.png new file mode 100644 index 000000000..b4517b32b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwl-dp-pii.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwl1.png b/docusaurus/observability-best-practices/docs/images/cwl1.png new file mode 100644 index 000000000..04636e77a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwl1.png differ diff --git a/docusaurus/observability-best-practices/docs/images/cwl2.png b/docusaurus/observability-best-practices/docs/images/cwl2.png new file mode 100644 index 000000000..ba74d9752 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/cwl2.png differ diff --git a/docusaurus/observability-best-practices/docs/images/dashboard1.png b/docusaurus/observability-best-practices/docs/images/dashboard1.png new file mode 100644 index 000000000..eadad2d64 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/dashboard1.png differ diff --git a/docusaurus/observability-best-practices/docs/images/database_cw_ds_amg.png b/docusaurus/observability-best-practices/docs/images/database_cw_ds_amg.png new file mode 100644 index 000000000..779b5749c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/database_cw_ds_amg.png differ diff --git a/docusaurus/observability-best-practices/docs/images/database_performance_symptoms.png b/docusaurus/observability-best-practices/docs/images/database_performance_symptoms.png new file mode 100644 index 000000000..ebf113cd8 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/database_performance_symptoms.png differ diff --git a/docusaurus/observability-best-practices/docs/images/databricks_cw_arch.png b/docusaurus/observability-best-practices/docs/images/databricks_cw_arch.png new file mode 100644 index 000000000..eb93368b3 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/databricks_cw_arch.png differ diff --git a/docusaurus/observability-best-practices/docs/images/databricks_oss_diagram.png b/docusaurus/observability-best-practices/docs/images/databricks_oss_diagram.png new file mode 100644 index 000000000..257c421a5 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/databricks_oss_diagram.png differ diff --git a/docusaurus/observability-best-practices/docs/images/databricks_spark_config.png b/docusaurus/observability-best-practices/docs/images/databricks_spark_config.png new file mode 100644 index 000000000..ce62d77bf Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/databricks_spark_config.png differ diff --git a/docusaurus/observability-best-practices/docs/images/db_cw_alarm.png b/docusaurus/observability-best-practices/docs/images/db_cw_alarm.png new file mode 100644 index 000000000..f0c4253c7 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/db_cw_alarm.png differ diff --git a/docusaurus/observability-best-practices/docs/images/db_cw_metrics.png b/docusaurus/observability-best-practices/docs/images/db_cw_metrics.png new file mode 100644 index 000000000..adfaf7d38 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/db_cw_metrics.png differ diff --git a/docusaurus/observability-best-practices/docs/images/db_dgr_anomaly.png b/docusaurus/observability-best-practices/docs/images/db_dgr_anomaly.png new file mode 100644 index 000000000..892c3c22f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/db_dgr_anomaly.png differ diff --git a/docusaurus/observability-best-practices/docs/images/db_dgr_recommendation.png b/docusaurus/observability-best-practices/docs/images/db_dgr_recommendation.png new file mode 100644 index 000000000..097b7fce2 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/db_dgr_recommendation.png differ diff --git a/docusaurus/observability-best-practices/docs/images/db_enhanced_monitoring.png b/docusaurus/observability-best-practices/docs/images/db_enhanced_monitoring.png new file mode 100644 index 000000000..b70faa15a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/db_enhanced_monitoring.png differ diff --git a/docusaurus/observability-best-practices/docs/images/db_enhanced_monitoring_loggroup.png b/docusaurus/observability-best-practices/docs/images/db_enhanced_monitoring_loggroup.png new file mode 100644 index 000000000..a548273cc Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/db_enhanced_monitoring_loggroup.png differ diff --git a/docusaurus/observability-best-practices/docs/images/db_perf_insights.png b/docusaurus/observability-best-practices/docs/images/db_perf_insights.png new file mode 100644 index 000000000..a2859a322 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/db_perf_insights.png differ diff --git a/docusaurus/observability-best-practices/docs/images/db_performanceinsights_amg.png b/docusaurus/observability-best-practices/docs/images/db_performanceinsights_amg.png new file mode 100644 index 000000000..39b9a6d60 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/db_performanceinsights_amg.png differ diff --git a/docusaurus/observability-best-practices/docs/images/default-metrics.png b/docusaurus/observability-best-practices/docs/images/default-metrics.png new file mode 100644 index 000000000..f098da5c1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/default-metrics.png differ diff --git a/docusaurus/observability-best-practices/docs/images/diff-query.png b/docusaurus/observability-best-practices/docs/images/diff-query.png new file mode 100644 index 000000000..7c1e66d11 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/diff-query.png differ diff --git a/docusaurus/observability-best-practices/docs/images/doge.jpg b/docusaurus/observability-best-practices/docs/images/doge.jpg new file mode 100644 index 000000000..a6e280b49 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/doge.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/ec2-auto-dashboard.png b/docusaurus/observability-best-practices/docs/images/ec2-auto-dashboard.png new file mode 100644 index 000000000..6db1dc412 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ec2-auto-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/images/ec2-custom-dashboard.png b/docusaurus/observability-best-practices/docs/images/ec2-custom-dashboard.png new file mode 100644 index 000000000..4fdf7e93d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ec2-custom-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/images/ec2-resource-health.png b/docusaurus/observability-best-practices/docs/images/ec2-resource-health.png new file mode 100644 index 000000000..fe5adf333 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/ec2-resource-health.png differ diff --git a/docusaurus/observability-best-practices/docs/images/emf_extracted_CWLogs.png b/docusaurus/observability-best-practices/docs/images/emf_extracted_CWLogs.png new file mode 100644 index 000000000..0ff6c324a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/emf_extracted_CWLogs.png differ diff --git a/docusaurus/observability-best-practices/docs/images/emf_extracted_metrics.png b/docusaurus/observability-best-practices/docs/images/emf_extracted_metrics.png new file mode 100644 index 000000000..598f093ad Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/emf_extracted_metrics.png differ diff --git a/docusaurus/observability-best-practices/docs/images/fluent-arch.png b/docusaurus/observability-best-practices/docs/images/fluent-arch.png new file mode 100644 index 000000000..8edd0f667 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/fluent-arch.png differ diff --git a/docusaurus/observability-best-practices/docs/images/goldilocks-architecture.png b/docusaurus/observability-best-practices/docs/images/goldilocks-architecture.png new file mode 100644 index 000000000..67eb59655 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/goldilocks-architecture.png differ diff --git a/docusaurus/observability-best-practices/docs/images/goldilocks-dashboard.png b/docusaurus/observability-best-practices/docs/images/goldilocks-dashboard.png new file mode 100644 index 000000000..79d1a5af2 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/goldilocks-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/images/goldilocks-recommendation.png b/docusaurus/observability-best-practices/docs/images/goldilocks-recommendation.png new file mode 100644 index 000000000..025cca8f2 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/goldilocks-recommendation.png differ diff --git a/docusaurus/observability-best-practices/docs/images/grafana-dashboard.png b/docusaurus/observability-best-practices/docs/images/grafana-dashboard.png new file mode 100644 index 000000000..80bb07af7 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/grafana-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/images/grafana-overview.png b/docusaurus/observability-best-practices/docs/images/grafana-overview.png new file mode 100644 index 000000000..454c35382 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/grafana-overview.png differ diff --git a/docusaurus/observability-best-practices/docs/images/graph_widget_metrics.png b/docusaurus/observability-best-practices/docs/images/graph_widget_metrics.png new file mode 100644 index 000000000..4b6d48ea8 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/graph_widget_metrics.png differ diff --git a/docusaurus/observability-best-practices/docs/images/horizontal-annotation.png b/docusaurus/observability-best-practices/docs/images/horizontal-annotation.png new file mode 100644 index 000000000..05f398425 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/horizontal-annotation.png differ diff --git a/docusaurus/observability-best-practices/docs/images/internet_monitor.graffle b/docusaurus/observability-best-practices/docs/images/internet_monitor.graffle new file mode 100644 index 000000000..fd1d348d4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/internet_monitor.graffle differ diff --git a/docusaurus/observability-best-practices/docs/images/internet_monitor.png b/docusaurus/observability-best-practices/docs/images/internet_monitor.png new file mode 100644 index 000000000..b1bbcdd39 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/internet_monitor.png differ diff --git a/docusaurus/observability-best-practices/docs/images/internet_monitor_2.png b/docusaurus/observability-best-practices/docs/images/internet_monitor_2.png new file mode 100644 index 000000000..12bfcbcb4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/internet_monitor_2.png differ diff --git a/docusaurus/observability-best-practices/docs/images/internet_monitor_3.png b/docusaurus/observability-best-practices/docs/images/internet_monitor_3.png new file mode 100644 index 000000000..95e457117 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/internet_monitor_3.png differ diff --git a/docusaurus/observability-best-practices/docs/images/kubecost-architecture.png b/docusaurus/observability-best-practices/docs/images/kubecost-architecture.png new file mode 100644 index 000000000..401b7c7f5 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/kubecost-architecture.png differ diff --git a/docusaurus/observability-best-practices/docs/images/logs-view.png b/docusaurus/observability-best-practices/docs/images/logs-view.png new file mode 100644 index 000000000..fa3a0b154 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/logs-view.png differ diff --git a/docusaurus/observability-best-practices/docs/images/metrics1.png b/docusaurus/observability-best-practices/docs/images/metrics1.png new file mode 100644 index 000000000..359f8931e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/metrics1.png differ diff --git a/docusaurus/observability-best-practices/docs/images/metrics2.png b/docusaurus/observability-best-practices/docs/images/metrics2.png new file mode 100644 index 000000000..2d297a9c6 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/metrics2.png differ diff --git a/docusaurus/observability-best-practices/docs/images/metrics3.png b/docusaurus/observability-best-practices/docs/images/metrics3.png new file mode 100644 index 000000000..be02c0c95 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/metrics3.png differ diff --git a/docusaurus/observability-best-practices/docs/images/metrics4.png b/docusaurus/observability-best-practices/docs/images/metrics4.png new file mode 100644 index 000000000..e87ceda32 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/metrics4.png differ diff --git a/docusaurus/observability-best-practices/docs/images/o11y-virtuous-cycle.png b/docusaurus/observability-best-practices/docs/images/o11y-virtuous-cycle.png new file mode 100644 index 000000000..5254c2842 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/o11y-virtuous-cycle.png differ diff --git a/docusaurus/observability-best-practices/docs/images/o11y4AIOps.png b/docusaurus/observability-best-practices/docs/images/o11y4AIOps.png new file mode 100644 index 000000000..1238e6e76 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/o11y4AIOps.png differ diff --git a/docusaurus/observability-best-practices/docs/images/pattern_analysis.png b/docusaurus/observability-best-practices/docs/images/pattern_analysis.png new file mode 100644 index 000000000..cbd1b0607 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/pattern_analysis.png differ diff --git a/docusaurus/observability-best-practices/docs/images/percentiles-average.png b/docusaurus/observability-best-practices/docs/images/percentiles-average.png new file mode 100644 index 000000000..d5491dbef Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/percentiles-average.png differ diff --git a/docusaurus/observability-best-practices/docs/images/percentiles-comparison.png b/docusaurus/observability-best-practices/docs/images/percentiles-comparison.png new file mode 100644 index 000000000..3544e9a87 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/percentiles-comparison.png differ diff --git a/docusaurus/observability-best-practices/docs/images/percentiles-histogram.png b/docusaurus/observability-best-practices/docs/images/percentiles-histogram.png new file mode 100644 index 000000000..ab17dad78 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/percentiles-histogram.png differ diff --git a/docusaurus/observability-best-practices/docs/images/percentiles-p99.png b/docusaurus/observability-best-practices/docs/images/percentiles-p99.png new file mode 100644 index 000000000..071c8bb69 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/percentiles-p99.png differ diff --git a/docusaurus/observability-best-practices/docs/images/prom-metrics.png b/docusaurus/observability-best-practices/docs/images/prom-metrics.png new file mode 100644 index 000000000..512e9edb1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/prom-metrics.png differ diff --git a/docusaurus/observability-best-practices/docs/images/prometheus-cost.png b/docusaurus/observability-best-practices/docs/images/prometheus-cost.png new file mode 100644 index 000000000..2d587b42d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/prometheus-cost.png differ diff --git a/docusaurus/observability-best-practices/docs/images/right-sizing.png b/docusaurus/observability-best-practices/docs/images/right-sizing.png new file mode 100644 index 000000000..65deb37c1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/right-sizing.png differ diff --git a/docusaurus/observability-best-practices/docs/images/rum1.png b/docusaurus/observability-best-practices/docs/images/rum1.png new file mode 100644 index 000000000..f068d804f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/rum1.png differ diff --git a/docusaurus/observability-best-practices/docs/images/rum2.png b/docusaurus/observability-best-practices/docs/images/rum2.png new file mode 100644 index 000000000..f6d1f89bc Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/rum2.png differ diff --git a/docusaurus/observability-best-practices/docs/images/savings.png b/docusaurus/observability-best-practices/docs/images/savings.png new file mode 100644 index 000000000..e46e8a7ba Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/savings.png differ diff --git a/docusaurus/observability-best-practices/docs/images/service-map-trace.png b/docusaurus/observability-best-practices/docs/images/service-map-trace.png new file mode 100644 index 000000000..d74c0b1c0 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/service-map-trace.png differ diff --git a/docusaurus/observability-best-practices/docs/images/slo.png b/docusaurus/observability-best-practices/docs/images/slo.png new file mode 100644 index 000000000..b8fe313d1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/slo.png differ diff --git a/docusaurus/observability-best-practices/docs/images/sns-alert.png b/docusaurus/observability-best-practices/docs/images/sns-alert.png new file mode 100644 index 000000000..5c131b2c6 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/sns-alert.png differ diff --git a/docusaurus/observability-best-practices/docs/images/synthetics0.png b/docusaurus/observability-best-practices/docs/images/synthetics0.png new file mode 100644 index 000000000..9e48c6e7b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/synthetics0.png differ diff --git a/docusaurus/observability-best-practices/docs/images/synthetics1.png b/docusaurus/observability-best-practices/docs/images/synthetics1.png new file mode 100644 index 000000000..31c4b4d1a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/synthetics1.png differ diff --git a/docusaurus/observability-best-practices/docs/images/synthetics2.png b/docusaurus/observability-best-practices/docs/images/synthetics2.png new file mode 100644 index 000000000..6ff877b0b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/synthetics2.png differ diff --git a/docusaurus/observability-best-practices/docs/images/synthetics3.png b/docusaurus/observability-best-practices/docs/images/synthetics3.png new file mode 100644 index 000000000..98a2d1bf3 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/synthetics3.png differ diff --git a/docusaurus/observability-best-practices/docs/images/three-pillars.png b/docusaurus/observability-best-practices/docs/images/three-pillars.png new file mode 100644 index 000000000..43a5d5dfe Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/three-pillars.png differ diff --git a/docusaurus/observability-best-practices/docs/images/throttled-period.png b/docusaurus/observability-best-practices/docs/images/throttled-period.png new file mode 100644 index 000000000..bd231e3c1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/throttled-period.png differ diff --git a/docusaurus/observability-best-practices/docs/images/time_series.png b/docusaurus/observability-best-practices/docs/images/time_series.png new file mode 100644 index 000000000..16a00704c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/time_series.png differ diff --git a/docusaurus/observability-best-practices/docs/images/vs.jpg b/docusaurus/observability-best-practices/docs/images/vs.jpg new file mode 100644 index 000000000..6be6bd232 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/vs.jpg differ diff --git a/docusaurus/observability-best-practices/docs/images/waterfall-trace.png b/docusaurus/observability-best-practices/docs/images/waterfall-trace.png new file mode 100644 index 000000000..517c359e9 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/waterfall-trace.png differ diff --git a/docusaurus/observability-best-practices/docs/images/widget_alarms.png b/docusaurus/observability-best-practices/docs/images/widget_alarms.png new file mode 100644 index 000000000..31f7e7beb Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/widget_alarms.png differ diff --git a/docusaurus/observability-best-practices/docs/images/widget_logs_1.png b/docusaurus/observability-best-practices/docs/images/widget_logs_1.png new file mode 100644 index 000000000..5773c5893 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/widget_logs_1.png differ diff --git a/docusaurus/observability-best-practices/docs/images/widget_logs_2.png b/docusaurus/observability-best-practices/docs/images/widget_logs_2.png new file mode 100644 index 000000000..8e58c898c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/images/widget_logs_2.png differ diff --git a/docusaurus/observability-best-practices/docs/intro.md b/docusaurus/observability-best-practices/docs/intro.md new file mode 100644 index 000000000..45e8604c8 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/intro.md @@ -0,0 +1,47 @@ +--- +sidebar_position: 1 +--- + +# Tutorial Intro + +Let's discover **Docusaurus in less than 5 minutes**. + +## Getting Started + +Get started by **creating a new site**. + +Or **try Docusaurus immediately** with **[docusaurus.new](https://docusaurus.new)**. + +### What you'll need + +- [Node.js](https://nodejs.org/en/download/) version 18.0 or above: + - When installing Node.js, you are recommended to check all checkboxes related to dependencies. + +## Generate a new site + +Generate a new Docusaurus site using the **classic template**. + +The classic template will automatically be added to your project after you run the command: + +```bash +npm init docusaurus@latest my-website classic +``` + +You can type this command into Command Prompt, Powershell, Terminal, or any other integrated terminal of your code editor. + +The command also installs all necessary dependencies you need to run Docusaurus. + +## Start your site + +Run the development server: + +```bash +cd my-website +npm run start +``` + +The `cd` command changes the directory you're working with. In order to work with your newly created Docusaurus site, you'll need to navigate the terminal there. + +The `npm run start` command builds your website locally and serves it through a development server, ready for you to view at http://localhost:3000/. + +Open `docs/intro.md` (this page) and edit some lines: the site **reloads automatically** and displays your changes. diff --git a/docusaurus/observability-best-practices/docs/patterns/Tracing/xrayec2.md b/docusaurus/observability-best-practices/docs/patterns/Tracing/xrayec2.md new file mode 100644 index 000000000..3728f2806 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/Tracing/xrayec2.md @@ -0,0 +1,39 @@ +# EC2 Tracing with AWS X-Ray + + +In the world of cloud computing, Amazon Elastic Compute Cloud (EC2) provides a highly scalable and flexible platform for running a wide range of applications. However, as applications become more distributed and complex, observability becomes crucial for ensuring the reliability, performance, and efficiency of these applications. + +AWS X-Ray addresses this challenge by offering a powerful distributed tracing service that enhances observability for applications running on EC2 instances. By integrating AWS X-Ray with your EC2-hosted applications, you can unlock a range of benefits and capabilities that enable you to gain deeper insights into your application's behavior and performance: + +1. **End-to-End Visibility**: AWS X-Ray traces requests as they flow through your applications running on EC2 instances and other AWS services, providing an end-to-end view of the complete lifecycle of a request. This visibility helps you understand the interactions between different components and identify potential bottlenecks or issues more effectively. + +2. **Performance Analysis**: X-Ray collects detailed performance metrics, such as request latencies, error rates, and resource utilization, for your EC2-hosted applications. These metrics allow you to analyze the performance of your applications, identify performance hotspots, and optimize resource allocation. + +3. **Distributed Tracing**: In modern distributed architectures, requests often traverse multiple services and components. AWS X-Ray provides a unified view of these distributed traces, enabling you to understand the interactions between different components and correlate performance data across your entire application. + +4. **Service Map Visualization**: X-Ray generates dynamic service maps that provide a visual representation of your application's components and their interactions. These service maps help you understand the complexity of your application architecture and identify potential areas for optimization or refactoring. + +5. **Integration with AWS Services**: AWS X-Ray seamlessly integrates with a wide range of AWS services, including AWS Lambda, API Gateway, Amazon ECS, and Amazon EKS. This integration allows you to trace requests across multiple services and correlate performance data with logs and metrics from other AWS services. + +6. **Custom Instrumentation**: While AWS X-Ray provides out-of-the-box instrumentation for many AWS services, you can also instrument your custom applications and services using the AWS X-Ray SDKs. This capability enables you to trace and analyze the performance of your custom code within your EC2-hosted applications, providing a more comprehensive view of your application's behavior. + +![EC2 Xray](../images/xrayec2.png) +*Figure 1: Applications running from EC2 sending traces to x-ray* + +To leverage AWS X-Ray for enhanced observability of your EC2-hosted applications, you'll need to follow these general steps: + +1. **Instrument Custom Applications**: Use the AWS X-Ray SDKs to instrument your applications running on EC2 instances and emit trace data to X-Ray. + +2. **Deploy Instrumented Applications**: Deploy your instrumented applications to your EC2 instances. + +3. **Analyze Trace Data**: Use the AWS X-Ray console or APIs to analyze trace data, view service maps, and investigate performance issues or bottlenecks within your EC2-hosted applications. + +4. **Set Up Alerts and Notifications**: Configure CloudWatch alarms and notifications based on X-Ray metrics to receive alerts for performance degradation or anomalies in your EC2-hosted applications. + +5. **Integrate with Other Observability Tools**: Combine AWS X-Ray with other observability tools, such as AWS CloudWatch Logs, Amazon CloudWatch Metrics, and AWS Distro for OpenTelemetry, to gain a comprehensive view of your applications' performance, logs, and metrics. + +While AWS X-Ray provides powerful tracing capabilities for EC2-hosted applications, it's important to consider potential challenges such as trace data volume and cost management. As your applications scale and generate more trace data, you may need to implement sampling strategies or adjust trace data retention policies to manage costs effectively. + +Additionally, ensuring proper access control and data security for your trace data is crucial. AWS X-Ray provides encryption for trace data at rest and in transit, as well as granular access control mechanisms to protect the confidentiality and integrity of your trace data. + +In conclusion, integrating AWS X-Ray with your applications running on EC2 instances is a powerful approach to enhancing observability for cloud-based applications. By tracing requests end-to-end and providing detailed performance metrics, AWS X-Ray empowers you to identify and troubleshoot issues more effectively, optimize resource utilization, and gain deeper insights into the behavior and performance of your applications. With the integration of AWS X-Ray and other AWS observability services, you can build and maintain highly observable, reliable, and performant applications in the cloud. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/Tracing/xrayecs.md b/docusaurus/observability-best-practices/docs/patterns/Tracing/xrayecs.md new file mode 100644 index 000000000..34337f750 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/Tracing/xrayecs.md @@ -0,0 +1,38 @@ +# ECS Tracing with AWS X-Ray + +In the world of modern application development, containerization has become the de facto standard for deploying and managing applications. Amazon Elastic Container Service (ECS) provides a highly scalable and reliable platform for deploying and managing containerized applications. However, as applications become more distributed and complex, observability becomes crucial for ensuring the reliability, performance, and efficiency of these applications. + +AWS X-Ray addresses this challenge by offering a powerful distributed tracing service that enhances observability for containerized applications running on ECS. By integrating AWS X-Ray with your ECS workloads, you can unlock a range of benefits and capabilities that enable you to gain deeper insights into your application's behavior and performance: + +1. **End-to-End Visibility**: AWS X-Ray traces requests as they flow through your containerized applications and other AWS services, providing an end-to-end view of the complete lifecycle of a request. This visibility helps you understand the interactions between different microservices and identify potential bottlenecks or issues more effectively. + +2. **Performance Analysis**: X-Ray collects detailed performance metrics, such as request latencies, error rates, and resource utilization, for your containerized applications. These metrics allow you to analyze the performance of your applications, identify performance hotspots, and optimize resource allocation. + +3. **Distributed Tracing**: In modern microservices architectures, requests often traverse multiple containers and services. AWS X-Ray provides a unified view of these distributed traces, enabling you to understand the interactions between different components and correlate performance data across your entire application. + +4. **Service Map Visualization**: X-Ray generates dynamic service maps that provide a visual representation of your application's components and their interactions. These service maps help you understand the complexity of your microservices architecture and identify potential areas for optimization or refactoring. + +5. **Integration with AWS Services**: AWS X-Ray seamlessly integrates with a wide range of AWS services, including AWS Lambda, API Gateway, Amazon ECS, and Amazon EKS. This integration allows you to trace requests across multiple services and correlate performance data with logs and metrics from other AWS services. + +6. **Custom Instrumentation**: While AWS X-Ray provides out-of-the-box instrumentation for many AWS services, you can also instrument your custom applications and services using the AWS X-Ray SDKs. This capability enables you to trace and analyze the performance of your custom code within your containerized applications, providing a more comprehensive view of your application's behavior. + +![ECS Tracing](../images/xrayecs.png) +*Figure 1: Sending tracing from ECS to X-Raymv * + +To leverage AWS X-Ray for enhanced observability of your ECS workloads, you'll need to follow these general steps: + +1. **Instrument Custom Applications**: Use the AWS X-Ray SDKs to instrument your containerized applications and emit trace data to X-Ray. + +2. **Deploy Instrumented Applications**: Deploy your instrumented containerized applications to your Amazon ECS cluster or service. + +3. **Analyze Trace Data**: Use the AWS X-Ray console or APIs to analyze trace data, view service maps, and investigate performance issues or bottlenecks within your containerized applications. + +4. **Set Up Alerts and Notifications**: Configure CloudWatch alarms and notifications based on X-Ray metrics to receive alerts for performance degradation or anomalies in your ECS workloads. + +5. **Integrate with Other Observability Tools**: Combine AWS X-Ray with other observability tools, such as AWS CloudWatch Logs, Amazon CloudWatch Metrics, and AWS Distro for OpenTelemetry, to gain a comprehensive view of your containerized applications' performance, logs, and metrics. + +While AWS X-Ray provides powerful tracing capabilities for ECS workloads, it's important to consider potential challenges such as trace data volume and cost management. As your containerized applications scale and generate more trace data, you may need to implement sampling strategies or adjust trace data retention policies to manage costs effectively. + +Additionally, ensuring proper access control and data security for your trace data is crucial. AWS X-Ray provides encryption for trace data at rest and in transit, as well as granular access control mechanisms to protect the confidentiality and integrity of your trace data. + +In conclusion, integrating AWS X-Ray with your Amazon ECS workloads is a powerful \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/Tracing/xrayeks.md b/docusaurus/observability-best-practices/docs/patterns/Tracing/xrayeks.md new file mode 100644 index 000000000..e37a45947 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/Tracing/xrayeks.md @@ -0,0 +1,39 @@ +# EKS Tracing with AWS X-Ray + +In the world of modern application development, containerization has become the de facto standard for deploying and managing applications. Amazon Elastic Kubernetes Service (EKS) provides a robust and scalable platform for deploying and managing containerized applications using Kubernetes. However, as applications become more distributed and complex, observability becomes crucial for ensuring the reliability, performance, and efficiency of these applications. + +AWS X-Ray addresses this challenge by offering a powerful distributed tracing service that enhances observability for containerized applications running on EKS. By integrating AWS X-Ray with your EKS workloads, you can unlock a range of benefits and capabilities that enable you to gain deeper insights into your application's behavior and performance: + +1. **End-to-End Visibility**: AWS X-Ray traces requests as they flow through your containerized applications and other AWS services, providing an end-to-end view of the complete lifecycle of a request. This visibility helps you understand the interactions between different microservices and identify potential bottlenecks or issues more effectively. + +2. **Performance Analysis**: X-Ray collects detailed performance metrics, such as request latencies, error rates, and resource utilization, for your containerized applications. These metrics allow you to analyze the performance of your applications, identify performance hotspots, and optimize resource allocation. + +3. **Distributed Tracing**: In modern microservices architectures, requests often traverse multiple containers and services. AWS X-Ray provides a unified view of these distributed traces, enabling you to understand the interactions between different components and correlate performance data across your entire application. + +4. **Service Map Visualization**: X-Ray generates dynamic service maps that provide a visual representation of your application's components and their interactions. These service maps help you understand the complexity of your microservices architecture and identify potential areas for optimization or refactoring. + +5. **Integration with AWS Services**: AWS X-Ray seamlessly integrates with a wide range of AWS services, including AWS Lambda, API Gateway, Amazon EKS, and Amazon ECS. This integration allows you to trace requests across multiple services and correlate performance data with logs and metrics from other AWS services. + +6. **Custom Instrumentation**: While AWS X-Ray provides out-of-the-box instrumentation for many AWS services, you can also instrument your custom applications and services using the AWS X-Ray SDKs. This capability enables you to trace and analyze the performance of your custom code within your containerized applications, providing a more comprehensive view of your application's behavior. + +![EKS Tracing](../images/xrayeks.png) +*Figure 1: Sending traces from EKS to X-Ray* + + +To leverage AWS X-Ray for enhanced observability of your EKS workloads, you'll need to follow these general steps: + +1. **Instrument Custom Applications**: Use the AWS X-Ray SDKs to instrument your containerized applications and emit trace data to X-Ray. + +2. **Deploy Instrumented Applications**: Deploy your instrumented containerized applications to your Amazon EKS cluster. + +3. **Analyze Trace Data**: Use the AWS X-Ray console or APIs to analyze trace data, view service maps, and investigate performance issues or bottlenecks within your containerized applications. + +4. **Set Up Alerts and Notifications**: Configure CloudWatch alarms and notifications based on X-Ray metrics to receive alerts for performance degradation or anomalies in your EKS workloads. + +5. **Integrate with Other Observability Tools**: Combine AWS X-Ray with other observability tools, such as AWS CloudWatch Logs, Amazon CloudWatch Metrics, and AWS Distro for OpenTelemetry, to gain a comprehensive view of your containerized applications' performance, logs, and metrics. + +While AWS X-Ray provides powerful tracing capabilities for EKS workloads, it's important to consider potential challenges such as trace data volume and cost management. As your containerized applications scale and generate more trace data, you may need to implement sampling strategies or adjust trace data retention policies to manage costs effectively. + +Additionally, ensuring proper access control and data security for your trace data is crucial. AWS X-Ray provides encryption for trace data at rest and in transit, as well as granular access control mechanisms to protect the confidentiality and integrity of your trace data. + +In conclusion, integrating AWS X-Ray with your Amazon EKS workloads is a powerful approach to enhancing observability for containerized applications. By tracing requests end-to-end and providing detailed performance metrics, AWS X-Ray empowers you to identify and troubleshoot issues more effectively, optimize resource utilization, and gain deeper insights into the behavior and performance of your containerized applications. With the integration of AWS X-Ray and other AWS observability services, you can build and maintain highly observable, reliable, and performant containerized applications in the cloud. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/Tracing/xraylambda.md b/docusaurus/observability-best-practices/docs/patterns/Tracing/xraylambda.md new file mode 100644 index 000000000..3b33d0aca --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/Tracing/xraylambda.md @@ -0,0 +1,38 @@ +# Lambda Tracing with AWS X-Ray + +In the world of serverless computing, observability is crucial for ensuring the reliability, performance, and efficiency of your applications. AWS Lambda, the cornerstone of serverless architectures, provides a powerful and scalable platform for running event-driven code without the need to manage underlying infrastructure. However, as applications become more distributed and complex, traditional logging and monitoring techniques often fall short in providing a comprehensive view of the end-to-end request flow and performance. + +AWS X-Ray addresses this challenge by offering a powerful distributed tracing service that enhances observability for serverless applications built with AWS Lambda. By integrating AWS X-Ray with your Lambda functions, you can unlock a range of benefits and capabilities that enable you to gain deeper insights into your application's behavior and performance: + +1. **End-to-End Visibility**: AWS X-Ray traces requests as they flow through your Lambda functions and other AWS services, providing an end-to-end view of the complete lifecycle of a request. This visibility helps you understand the interactions between different components and identify potential bottlenecks or issues more effectively. + +2. **Performance Analysis**: X-Ray collects detailed performance metrics, such as execution times, cold start latencies, and error rates, for your Lambda functions. These metrics allow you to analyze the performance of your serverless applications, identify performance hotspots, and optimize resource utilization. + +3. **Distributed Tracing**: In serverless architectures, requests often traverse multiple Lambda functions and other AWS services. AWS X-Ray provides a unified view of these distributed traces, enabling you to understand the interactions between different components and correlate performance data across your entire application. + +4. **Service Map Visualization**: X-Ray generates dynamic service maps that provide a visual representation of your application's components and their interactions. These service maps help you understand the complexity of your serverless architecture and identify potential areas for optimization or refactoring. + +5. **Integration with AWS Services**: AWS X-Ray seamlessly integrates with a wide range of AWS services, including AWS Lambda, API Gateway, Amazon DynamoDB, and Amazon SQS. This integration allows you to trace requests across multiple services and correlate performance data with logs and metrics from other AWS services. + +6. **Custom Instrumentation**: While AWS X-Ray provides out-of-the-box instrumentation for AWS Lambda functions, you can also instrument your custom code within Lambda functions using the AWS X-Ray SDKs. This capability enables you to trace and analyze the performance of your custom logic, providing a more comprehensive view of your application's behavior. + +![Lambda Xrary](../images/xraylambda.png) +*Figure 1: Sending traces from Lambda to X-Ray* + +To leverage AWS X-Ray for enhanced observability of your Lambda functions, you'll need to follow these general steps: + +1. **Enable X-Ray Tracing**: Configure your AWS Lambda functions to enable active tracing by updating the function configuration or using the AWS Lambda console or AWS Serverless Application Model (SAM). + +2. **Instrument Custom Code (Optional)**: If you have custom code within your Lambda functions, you can use the AWS X-Ray SDKs to instrument your code and emit additional trace data to X-Ray. + +3. **Analyze Trace Data**: Use the AWS X-Ray console or APIs to analyze trace data, view service maps, and investigate performance issues or bottlenecks within your Lambda functions and serverless applications. + +4. **Set Up Alerts and Notifications**: Configure CloudWatch alarms and notifications based on X-Ray metrics to receive alerts for performance degradation or anomalies in your Lambda functions. + +5. **Integrate with Other Observability Tools**: Combine AWS X-Ray with other observability tools, such as AWS CloudWatch Logs, Amazon CloudWatch Metrics, and AWS Lambda Insights, to gain a comprehensive view of your Lambda functions' performance, logs, and metrics. + +While AWS X-Ray provides powerful tracing capabilities for Lambda functions, it's important to consider potential challenges such as trace data volume and cost management. As your serverless applications scale and generate more trace data, you may need to implement sampling strategies or adjust trace data retention policies to manage costs effectively. + +Additionally, ensuring proper access control and data security for your trace data is crucial. AWS X-Ray provides encryption for trace data at rest and in transit, as well as granular access control mechanisms to protect the confidentiality and integrity of your trace data. + +In conclusion, integrating AWS X-Ray with your AWS Lambda functions is a powerful approach to enhancing observability for serverless applications. By tracing requests end-to-end and providing detailed performance metrics, AWS X-Ray empowers you to identify and troubleshoot issues more effectively, optimize resource utilization, and gain deeper insights into the behavior and performance of your serverless applications. With the integration of AWS X-Ray and other AWS observability services, you can build and maintain highly observable, reliable, and performant serverless applications in the cloud. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/adoteksfargate.md b/docusaurus/observability-best-practices/docs/patterns/adoteksfargate.md new file mode 100644 index 000000000..2d192fb6b --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/adoteksfargate.md @@ -0,0 +1,49 @@ +# CloudWatch Container Insights + +## Introduction + +Amazon CloudWatch Container Insights is a powerful tool for collecting, aggregating, and summarizing metrics and logs from containerized applications and microservices. This document provides an overview of the integration between ADOT and CloudWatch Container Insights for EKS Fargate workloads, including its design, deployment process, and benefits. + +## ADOT Collector Design for EKS Fargate + +The ADOT Collector uses a pipeline architecture consisting of three main components: + +1. Receiver: Accepts data in a specified format and translates it into an internal format. +2. Processor: Performs tasks such as batching, filtering, and transformations on the data. +3. Exporter: Determines the destination for sending metrics, logs, or traces. + +For EKS Fargate, the ADOT Collector uses a Prometheus Receiver to scrape metrics from the Kubernetes API server, which acts as a proxy for the kubelet on worker nodes. This approach is necessary due to the networking limitations in EKS Fargate that prevent direct access to the kubelet. The collected metrics go through a series of processors for filtering, renaming, data aggregation, and conversion. Finally, the AWS CloudWatch EMF Exporter converts the metrics to the embedded metric format (EMF) and sends them to CloudWatch Logs. + +![CI EKS fargate with ADOT](./images/cieksfargateadot.png) +*Figure 1: Container Insights with ADOT on EKS Fargate* + +## Deployment Process + +To deploy the ADOT Collector on an EKS Fargate cluster: + +1. Create an EKS cluster with Kubernetes +2. Set up a Fargate pod execution role. +3. Define Fargate profiles for the necessary namespaces. +4. Create an IAM role for the ADOT Collector with the required permissions. +5. Deploy the ADOT Collector as a Kubernetes StatefulSet using the provided manifest. +6. Deploy sample workloads to test the metrics collection. + + +## Pros and Cons + +### Pros: + +1. Unified Monitoring: Provides a consistent monitoring experience across EKS EC2 and Fargate workloads. +2. Scalability: A single ADOT Collector instance can discover and collect metrics from all worker nodes in an EKS cluster. +3. Rich Metrics: Collects a comprehensive set of system metrics, including CPU, memory, disk, and network usage. +4. Easy Integration: Seamlessly integrates with existing CloudWatch dashboards and alarms. +5. Cost-Effective: Enables monitoring of Fargate workloads without the need for additional monitoring infrastructure. + +### Cons: + +1. Configuration Complexity: Setting up the ADOT Collector requires careful configuration of IAM roles, Fargate profiles, and Kubernetes resources. +2. Resource Overhead: The ADOT Collector itself consumes resources on the Fargate cluster, which needs to be accounted for in capacity planning. + + +The integration of AWS Distro for OpenTelemetry with CloudWatch Container Insights for EKS Fargate workloads provides a powerful solution for monitoring containerized applications. It offers a unified monitoring experience across different EKS deployment options and leverages the scalability and flexibility of the OpenTelemetry framework. By enabling the collection of system metrics from Fargate workloads, this integration allows customers to gain deeper insights into their application performance, make informed scaling decisions, and optimize resource utilization. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/ampagentless.md b/docusaurus/observability-best-practices/docs/patterns/ampagentless.md new file mode 100644 index 000000000..81e7ec50c --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/ampagentless.md @@ -0,0 +1,50 @@ +# Pushing Metrics from EKS to Prometheus + +When running containerized workloads on Amazon Elastic Kubernetes Service (EKS), you can leverage AWS Managed Prometheus (AMP) to collect and analyze metrics from your applications and infrastructure. AMP simplifies the deployment and management of Prometheus-compatible monitoring by providing a fully managed Prometheus-compatible monitoring solution. + +To push metrics from your EKS containerized workloads to AMP, you can use the Managed Prometheus Collector configuration. The Managed Prometheus Collector is a component of AMP that scrapes metrics from your applications and services and sends them to the AMP workspace for storage and analysis. + +![EKS AMP](./images/eksamp.png) +*Figure 1: Sending metrics from EKS to AMP* + +## Configuring Managed Prometheus Collector + +1. **Enable AMP Workspace**: First, ensure that you have an AMP workspace created in your AWS account. If you haven't set up an AMP workspace yet, follow the AWS documentation to create one. + +2. **Configure Managed Prometheus Collector**: Within your AMP workspace, navigate to the "Managed Prometheus Collectors" section and create a new collector configuration. + +3. **Define Scrape Configuration**: In the collector configuration, specify the targets from which the collector should scrape metrics. For EKS workloads, you can define a Kubernetes service discovery configuration that allows the collector to dynamically discover and scrape metrics from your Kubernetes Pods and Services. + + Example Kubernetes service discovery configuration: + + ```yaml + kubernetes_sd_configs: + - role: pod + namespaces: + names: + - namespace1 + - namespace2 +``` +This configuration instructs the collector to scrape metrics from Pods running in the namespace1 and namespace2 Kubernetes namespaces. + +4. **Configure Prometheus Annotations**: To enable metric collection from your containerized workloads, you need to annotate your Kubernetes Pods or Services with the appropriate Prometheus annotations. These annotations provide information about the metrics endpoint and other configuration settings. +Example Prometheus annotations: +```yaml +annotations: + prometheus.io/scrape: "true" + prometheus.io/port: "8080" + prometheus.io/path: "/metrics" +``` +These annotations indicate that the Prometheus collector should scrape metrics from the /metrics endpoint on port 8080 of the Pod or Service. + +5. **Deploy Workloads with Instrumentation**: Deploy your containerized workloads to EKS, ensuring that they expose the appropriate metrics endpoints and include the necessary Prometheus annotations. You can use tools like Minikube, Helm, or AWS Cloud Development Kit (CDK) to deploy and manage your EKS workloads. + +6. **Verify Metric Collection**: Once the Managed Prometheus Collector is configured and your workloads are deployed, you should see the collected metrics appearing in the AMP workspace. You can use the AMP query editor to explore and visualize the metrics from your EKS workloads. + +## Additional Considerations + +- Authentication and Authorization: AMP supports various authentication and authorization mechanisms, including IAM roles and service accounts, to secure access to your monitoring data. + +- Integration with AWS Observability Services: You can integrate AMP with other AWS observability services, such as AWS CloudWatch and AWS X-Ray, for comprehensive observability across your AWS environment. + +By leveraging the Managed Prometheus Collector in AMP, you can efficiently collect and analyze metrics from your EKS containerized workloads without the need to manage and scale the underlying Prometheus infrastructure. AMP provides a fully managed and scalable solution for monitoring your EKS applications and infrastructure. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/apmappsignals.md b/docusaurus/observability-best-practices/docs/patterns/apmappsignals.md new file mode 100644 index 000000000..cc30147e9 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/apmappsignals.md @@ -0,0 +1,40 @@ +# APM with Application Signals + +In the ever-evolving world of modern application development, ensuring optimal performance and meeting service level objectives (SLOs) is crucial for providing a seamless user experience and maintaining business continuity. Amazon CloudWatch Application Signals, an OpenTelemetry (OTel) compatible application performance monitoring (APM) feature, revolutionizes the way organizations monitor and troubleshoot their applications running on AWS. + +CloudWatch Application Signals takes a holistic approach to application performance monitoring by seamlessly correlating telemetry data across multiple sources, including metrics, traces, logs, real-user monitoring, and synthetic monitoring. This integrated approach enables organizations to gain comprehensive insights into their applications' performance, pinpoint root causes of issues, and proactively address potential disruptions. + +One of the key advantages of CloudWatch Application Signals is its automatic instrumentation and tracking capabilities. With no manual effort or custom code required, Application Signals provides a pre-built, standardized dashboard that displays the most critical metrics for application performance – volume, availability, latency, faults, and errors – for each application running on AWS. This streamlined approach eliminates the need for custom dashboards, enabling service operators to quickly assess application health and performance against their defined SLOs. + +![APM](./images/apm.png) +*Figure 1: Cloudwatch Application Signals sending metrics, logs and traces* + +CloudWatch Application Signals empowers organizations with the following capabilities: + +1. **Comprehensive Application Performance Monitoring**: Application Signals provides a unified view of application performance, combining insights from metrics, traces, logs, real-user monitoring, and synthetic monitoring. This holistic approach enables organizations to identify performance bottlenecks, pinpoint root causes, and take proactive measures to ensure optimal application performance. + +2. **Automatic Instrumentation and Tracking**: With no manual effort or custom code required, Application Signals automatically instruments and tracks application performance against defined SLOs. This streamlined approach reduces the overhead associated with manual instrumentation and configuration, enabling organizations to focus on application development and optimization. + +3. **Standardized Dashboard and Visualization**: Application Signals offers a pre-built, standardized dashboard that displays the most critical metrics for application performance, including volume, availability, latency, faults, and errors. This standardized view enables service operators to quickly assess application health and performance, facilitating informed decision-making and proactive issue resolution. + +4. **Seamless Correlation and Troubleshooting**: By correlating telemetry data across multiple sources, Application Signals simplifies the troubleshooting process. Service operators can seamlessly drill down into correlated traces, logs, and metrics to identify the root cause of performance issues or anomalies, reducing the mean time to resolution (MTTR) and minimizing application disruptions. + +5. **Integration with Container Insights**: For applications running in containerized environments, CloudWatch Application Signals seamlessly integrates with Container Insights, enabling organizations to identify infrastructure-related issues that may impact application performance, such as memory shortages or high CPU utilization on container pods. + +To leverage CloudWatch Application Signals for application performance monitoring, organizations can follow these general steps: + +1. **Enable Application Signals**: Enable CloudWatch Application Signals for your applications running on AWS, either through the AWS Management Console, AWS Command Line Interface (CLI), or programmatically using AWS SDKs. + +2. **Define Service Level Objectives (SLOs)**: Establish and configure the desired SLOs for your applications, such as target availability, maximum latency, or error thresholds, to align with business requirements and customer expectations. + +3. **Monitor and Analyze Performance**: Utilize the pre-built, standardized dashboard provided by Application Signals to monitor application performance against defined SLOs. Analyze metrics, traces, logs, real-user monitoring, and synthetic monitoring data to identify performance issues or anomalies. + +4. **Troubleshoot and Resolve Issues**: Leverage the seamless correlation capabilities of Application Signals to drill down into correlated traces, logs, and metrics, enabling rapid identification and resolution of performance issues or root causes. + +5. **Integrate with Container Insights (if applicable)**: For containerized applications, integrate CloudWatch Application Signals with Container Insights to identify infrastructure-related issues that may impact application performance. + +While CloudWatch Application Signals offers powerful application performance monitoring capabilities, it's important to consider potential challenges such as data volume and cost management. As application complexity and scale increase, the volume of telemetry data generated can grow significantly, potentially impacting performance and incurring additional costs. Implementing data sampling strategies, retention policies, and cost optimization techniques may be necessary to ensure an efficient and cost-effective monitoring solution. + +Additionally, ensuring proper access control and data security for your application performance data is crucial. CloudWatch Application Signals leverages AWS Identity and Access Management (IAM) for granular access control, and data encryption is applied to telemetry data at rest and in transit, protecting the confidentiality and integrity of your application performance data. + +In conclusion, CloudWatch Application Signals revolutionizes application performance monitoring for applications running on AWS. By providing automatic instrumentation, standardized dashboards, and seamless correlation of telemetry data, Application Signals empowers organizations to proactively monitor application performance, ensure SLO adherence, and rapidly troubleshoot and resolve performance issues. With its integration capabilities and OpenTelemetry compatibility, CloudWatch Application Signals offers a comprehensive and future-proof solution for application performance monitoring in the cloud. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/ecsampamg.md b/docusaurus/observability-best-practices/docs/patterns/ecsampamg.md new file mode 100644 index 000000000..9b3858a82 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/ecsampamg.md @@ -0,0 +1,54 @@ +# Monitoring ECS Workloads + + +## Introduction + +In the world of containerized applications, effective monitoring is crucial for maintaining reliability and performance. This document outlines an advanced monitoring solution for Amazon Elastic Container Service (ECS) workloads, leveraging AWS Distro for OpenTelemetry (ADOT), AWS X-Ray, and Amazon Managed Service for Prometheus. + +## Architecture Overview + +The monitoring architecture centers around an ECS task that hosts both the application and an ADOT collector. This setup enables comprehensive data collection directly from the application environment. + +![ECS AMP](./images/ecs.png) +*Figure 1: Sending metrics from ECS to AMP and X-Ray* + +## Key Components + +### ECS Task +The ECS task serves as the foundational unit, encapsulating the application and monitoring components. + +### Sample Application +A containerized application runs within the ECS task, representing the workload to be monitored. + +### AWS Distro for OpenTelemetry (ADOT) Collector +The ADOT collector, deployed alongside the application, acts as a central aggregation point for telemetry data. It collects both metrics and traces from the application. + +### AWS X-Ray +X-Ray receives trace data from the ADOT collector, providing detailed insights into request flows and service dependencies. + +### Amazon Managed Service for Prometheus +This service stores and manages the metrics collected by the ADOT collector, offering a scalable solution for metric storage and querying. + +## Data Flow + +1. The sample application generates telemetry data during its operation. +2. The ADOT collector, running in the same ECS task, collects this data from the application. +3. Trace data is forwarded to AWS X-Ray for distributed tracing analysis. +4. Metrics are sent to Amazon Managed Service for Prometheus for storage and later analysis. + +## Benefits + +- **Comprehensive Monitoring**: Captures both metrics and traces, providing a holistic view of application performance. +- **Scalability**: Leverages managed services to handle large volumes of telemetry data. +- **Integration**: Seamlessly works with ECS and other AWS services. +- **Reduced Operational Overhead**: Utilizes managed services, minimizing the need for infrastructure management. + +## Implementation Considerations + +- Proper IAM roles and permissions must be configured for the ECS task to allow data transmission to X-Ray and Prometheus. +- Resource allocation within the ECS task should account for both the application and the ADOT collector. +- Consider implementing log collection alongside metrics and traces for a complete observability solution. + +## Conclusion + +This architecture provides a robust monitoring solution for ECS workloads, combining the power of OpenTelemetry with AWS managed services. It enables deep insights into application performance and behavior, facilitating quick problem resolution and informed decision-making for containerized environments. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/eksampamg.md b/docusaurus/observability-best-practices/docs/patterns/eksampamg.md new file mode 100644 index 000000000..3febe5764 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/eksampamg.md @@ -0,0 +1,39 @@ +# EKS Monitoring with AWS Open source service + +In the world of containerized applications and Kubernetes, monitoring and observability are crucial for ensuring the reliability, performance, and efficiency of your workloads. Amazon Elastic Kubernetes Service (EKS) provides a powerful and scalable platform for deploying and managing containerized applications, and when combined with tools like Node Exporter, Amazon Managed Prometheus, and Grafana, you can unlock a comprehensive monitoring solution for your EKS workloads. + +Node Exporter is a Prometheus exporter that exposes a wide range of hardware and kernel-related metrics from a host machine. By deploying Node Exporter as a DaemonSet in your EKS cluster, you can collect valuable metrics from each worker node, including CPU, memory, disk, and network usage, as well as various system-level metrics. + +Amazon Managed Prometheus is a fully-managed service provided by AWS that simplifies the deployment, management, and scaling of Prometheus monitoring infrastructure. By integrating Node Exporter with Amazon Managed Prometheus, you can collect and store node-level metrics in a highly available and scalable manner, without the overhead of managing and scaling Prometheus instances yourself. + +Grafana is a powerful open-source data visualization and monitoring tool that seamlessly integrates with Prometheus. By configuring Grafana to connect to your Amazon Managed Prometheus instance, you can create rich and customizable dashboards that provide real-time insights into the health and performance of your EKS workloads and underlying infrastructure. + +![EKS AMP AMG](./images/eksnodeexporterampamg.png) +*Figure 1: EKS node metrics sent to AMP and visualize with AMG* + + +Deploying this monitoring stack in your EKS cluster offers several benefits: + +1. Comprehensive Visibility: By collecting metrics from Node Exporter and visualizing them in Grafana, you gain end-to-end visibility into your EKS workloads, from the application level down to the underlying infrastructure, enabling you to proactively identify and address issues. + +2. Scalability and Reliability: Amazon Managed Prometheus and Grafana are designed to be highly scalable and reliable, ensuring that your monitoring solution can grow seamlessly as your EKS workloads scale, without compromising performance or availability. + +3. Centralized Monitoring: With Amazon Managed Prometheus acting as a centralized monitoring platform, you can consolidate metrics from multiple EKS clusters, enabling you to monitor and compare workloads across different environments or regions. + +4. Custom Dashboards and Alerts: Grafana's powerful dashboard and alerting capabilities allow you to create custom visualizations tailored to your specific monitoring needs, enabling you to surface relevant metrics and set up alerts for critical events or thresholds. + +5. Integration with AWS Services: Amazon Managed Prometheus seamlessly integrates with other AWS services, such as Amazon CloudWatch and AWS X-Ray, enabling you to correlate and visualize metrics from various sources within a unified monitoring solution. + +To implement this monitoring stack in your EKS cluster, you'll need to follow these general steps: + +1. Deploy Node Exporter as a DaemonSet on your EKS worker nodes to collect node-level metrics. +2. Set up an Amazon Managed Prometheus workspace and configure it to scrape metrics from Node Exporter. +3. Install and configure Grafana, either within your EKS cluster or as a separate service, and connect it to your Amazon Managed Prometheus workspace. +4. Create custom Grafana dashboards and configure alerts based on your monitoring requirements. + +While this monitoring solution provides powerful capabilities, it's important to consider the potential overhead and resource consumption introduced by Node Exporter, Prometheus, and Grafana. Careful planning and resource allocation are necessary to ensure that your monitoring components do not compete with your application workloads for resources. + +Additionally, you should ensure that your monitoring solution adheres to best practices for data security, access control, and retention policies. Implementing secure communication channels, authentication mechanisms, and data encryption is crucial to maintain the confidentiality and integrity of your monitoring data. + +In conclusion, deploying Node Exporter, Amazon Managed Prometheus, and Grafana in your EKS cluster provides a comprehensive monitoring solution for your containerized workloads. By leveraging these tools, you can gain deep insights into the performance and health of your applications, enabling proactive issue detection, efficient resource utilization, and informed decision-making. However, it's essential to carefully plan and implement this monitoring stack, considering resource consumption, security, and compliance requirements to ensure an effective and robust monitoring solution for your EKS workloads. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/images/apm.png b/docusaurus/observability-best-practices/docs/patterns/images/apm.png new file mode 100644 index 000000000..49bcac6c3 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/apm.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/cieksfargateadot.png b/docusaurus/observability-best-practices/docs/patterns/images/cieksfargateadot.png new file mode 100644 index 000000000..ba6495d47 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/cieksfargateadot.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/crossaccountmonitoring.png b/docusaurus/observability-best-practices/docs/patterns/images/crossaccountmonitoring.png new file mode 100644 index 000000000..ce258c567 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/crossaccountmonitoring.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/ecs.png b/docusaurus/observability-best-practices/docs/patterns/images/ecs.png new file mode 100644 index 000000000..80068cb13 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/ecs.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/eksamp.png b/docusaurus/observability-best-practices/docs/patterns/images/eksamp.png new file mode 100644 index 000000000..8fba4793b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/eksamp.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/eksnodeexporterampamg.png b/docusaurus/observability-best-practices/docs/patterns/images/eksnodeexporterampamg.png new file mode 100644 index 000000000..4e06dfbdf Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/eksnodeexporterampamg.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/ekstracing.png b/docusaurus/observability-best-practices/docs/patterns/images/ekstracing.png new file mode 100644 index 000000000..3786c923e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/ekstracing.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/lambdalogging.png b/docusaurus/observability-best-practices/docs/patterns/images/lambdalogging.png new file mode 100644 index 000000000..caace2f77 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/lambdalogging.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/multiaccountoss.png b/docusaurus/observability-best-practices/docs/patterns/images/multiaccountoss.png new file mode 100644 index 000000000..610f2ca67 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/multiaccountoss.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/os.png b/docusaurus/observability-best-practices/docs/patterns/images/os.png new file mode 100644 index 000000000..02e31b984 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/os.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/otel.png b/docusaurus/observability-best-practices/docs/patterns/images/otel.png new file mode 100644 index 000000000..f1018ceb0 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/otel.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/otelpipeline.png b/docusaurus/observability-best-practices/docs/patterns/images/otelpipeline.png new file mode 100644 index 000000000..bc2dc5b9f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/otelpipeline.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/spark.png b/docusaurus/observability-best-practices/docs/patterns/images/spark.png new file mode 100644 index 000000000..3d0a46b45 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/spark.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/vpcflowlogs.png b/docusaurus/observability-best-practices/docs/patterns/images/vpcflowlogs.png new file mode 100644 index 000000000..882dfa1a9 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/vpcflowlogs.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/xrayec2.png b/docusaurus/observability-best-practices/docs/patterns/images/xrayec2.png new file mode 100644 index 000000000..2ae01dd9a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/xrayec2.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/xrayecs.png b/docusaurus/observability-best-practices/docs/patterns/images/xrayecs.png new file mode 100644 index 000000000..9fa3665c5 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/xrayecs.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/xrayeks.png b/docusaurus/observability-best-practices/docs/patterns/images/xrayeks.png new file mode 100644 index 000000000..34491e518 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/xrayeks.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/images/xraylambda.png b/docusaurus/observability-best-practices/docs/patterns/images/xraylambda.png new file mode 100644 index 000000000..acd86516a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/patterns/images/xraylambda.png differ diff --git a/docusaurus/observability-best-practices/docs/patterns/lambdalogging.md b/docusaurus/observability-best-practices/docs/patterns/lambdalogging.md new file mode 100644 index 000000000..6ba8f2cc0 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/lambdalogging.md @@ -0,0 +1,34 @@ +# Lambda Logging + + +In the world of serverless computing, observability is a critical aspect of ensuring the reliability, performance, and efficiency of your applications. AWS Lambda, a cornerstone of serverless architectures, provides a powerful and scalable platform for running event-driven code without the need to manage underlying infrastructure. However, as with any application, logging is essential for monitoring, troubleshooting, and gaining insights into the behavior and health of your Lambda functions. + +AWS Lambda seamlessly integrates with Amazon CloudWatch Logs, a fully-managed log management service, allowing you to centralize and analyze logs from your Lambda functions. By configuring your Lambda functions to log to CloudWatch Logs, you can unlock a range of benefits and capabilities that enhance the observability of your serverless applications. + +1. Centralized Log Management: CloudWatch Logs consolidates log data from multiple Lambda functions, providing a centralized location for log management and analysis. This centralization simplifies the process of monitoring and troubleshooting across distributed serverless applications. + +2. Real-time Log Streaming: CloudWatch Logs supports real-time log streaming, enabling you to view and analyze log data as it is generated by your Lambda functions. This real-time visibility ensures that you can quickly detect and respond to issues or errors, minimizing potential downtime or performance degradation. + +3. Log Retention and Archiving: CloudWatch Logs allows you to define retention policies for your log data, ensuring that logs are retained for the desired duration to meet compliance requirements or facilitate long-term analysis and auditing. + +4. Log Filtering and Searching: CloudWatch Logs provides powerful log filtering and searching capabilities, enabling you to quickly locate and analyze relevant log entries based on specific criteria or patterns. This feature streamlines the troubleshooting process and helps you quickly identify the root cause of issues. + +5. Monitoring and Alerting: By integrating CloudWatch Logs with other AWS services like Amazon CloudWatch, you can set up custom metrics, alarms, and triggers based on log data. This integration enables proactive monitoring and alerting, ensuring that you are notified of critical events or deviations from expected behavior. + +6. Integration with AWS Services: CloudWatch Logs seamlessly integrates with other AWS services, such as AWS Lambda Insights, AWS X-Ray, and AWS CloudTrail, enabling you to correlate log data with application performance metrics, distributed tracing, and security auditing, providing a comprehensive view of your serverless applications. +![Lambda logging](./images/lambdalogging.png) +*Figure 1: Lambda logging showing the events from S3 captured to AWS Cloudwatch* + +To leverage Lambda logging with CloudWatch Logs, you'll need to follow these general steps: + +1. Configure your Lambda functions to log to CloudWatch Logs by specifying the appropriate log group and log stream settings. +2. Define log retention policies according to your organization's requirements and compliance regulations. +3. Utilize CloudWatch Logs Insights to analyze and query log data, enabling you to identify patterns, trends, and potential issues. +4. Optionally, integrate CloudWatch Logs with other AWS services like CloudWatch, X-Ray, or CloudTrail to enhance monitoring, tracing, and security auditing capabilities. +5. Set up custom metrics, alarms, and notifications based on log data to enable proactive monitoring and alerting. + +While CloudWatch Logs provides robust logging capabilities for Lambda functions, it's important to consider potential challenges such as log data volume and cost management. As your serverless applications scale, the volume of log data can increase significantly, potentially impacting performance and incurring additional costs. Implementing log rotation, compression, and retention policies can help mitigate these challenges. + +Additionally, ensuring proper access control and data security for your log data is crucial. CloudWatch Logs provides granular access control mechanisms and encryption capabilities to protect the confidentiality and integrity of your log data. + +In conclusion, configuring Lambda functions to log to CloudWatch Logs is a fundamental practice for ensuring observability in serverless applications. By centralizing and analyzing log data, you can gain valuable insights, streamline troubleshooting processes, and maintain a robust and secure serverless infrastructure. With the integration of CloudWatch Logs and other AWS services, you can unlock advanced monitoring, tracing, and security capabilities, enabling you to build and maintain highly observable and reliable serverless applications. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/multiaccount.md b/docusaurus/observability-best-practices/docs/patterns/multiaccount.md new file mode 100644 index 000000000..c12a26487 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/multiaccount.md @@ -0,0 +1,46 @@ +# Cross account Monitoring with AWS Native services + +With the increasing complexity of modern cloud environments, managing and monitoring multiple AWS accounts has become a critical aspect of efficient cloud operations. AWS multi-account monitoring provides a centralized approach to monitoring and managing resources across multiple AWS accounts, enabling organizations to gain better visibility, enhance security, and streamline operations. + +In today's rapidly evolving digital landscape, organizations are under constant pressure to maintain a competitive edge and drive growth. Cloud computing has emerged as a game-changer, offering scalability, agility, and cost-effectiveness. However, as cloud adoption continues to accelerate, the complexity of managing and monitoring these environments also increases exponentially. This is where AWS multi-account monitoring comes into play, providing a powerful solution for efficiently managing resources across multiple AWS accounts. + +AWS multi-account monitoring offers a range of benefits that can significantly enhance an organization's cloud operations. One of the primary advantages is centralized visibility, which consolidates monitoring data from multiple AWS accounts into a single pane of glass. This comprehensive view of the cloud infrastructure enables organizations to gain a holistic understanding of their resources, enabling better decision-making and resource optimization. Moreover, AWS multi-account monitoring plays a crucial role in improving security and compliance. By enforcing consistent security policies and enabling the detection of potential threats across all accounts, organizations can proactively address vulnerabilities and mitigate risks. Compliance requirements can also be effectively monitored and adhered to, ensuring that the organization operates within regulatory frameworks and industry standards. + + +## Stats: + +According to Gartner, by 2025, more than 95% of new digital workloads will be deployed on cloud-native platforms, emphasizing the need for robust multi-account monitoring solutions. A study by Cloud Conformity revealed that organizations with more than 25 AWS accounts experienced an average of 223 high-risk security incidents per month, highlighting the importance of centralized monitoring and governance. Forrester Research estimates that organizations with effective cloud governance and monitoring strategies can reduce operational costs by up to 30%. + +![Multi account monitoring](./images/crossaccountmonitoring.png) + *Figure 1: Cross account monitoring with AWS Cloudwatch* + +## Pros of AWS Multi-Account Monitoring: + +1. **Centralized Visibility**: Consolidate monitoring data from multiple AWS accounts into a single pane of glass, providing a comprehensive view of your cloud infrastructure. +2. **Improved Security and Compliance**: Enforce consistent security policies, detect potential threats, and ensure compliance across all accounts. +3. **Cost Optimization**: Identify and eliminate underutilized or redundant resources, optimizing cloud spending and reducing waste. +4. **Streamlined Operations**: Automate monitoring and alerting processes, reducing manual effort and improving operational efficiency. +5. **Scalability**: Easily onboard new AWS accounts and resources without compromising monitoring capabilities. + +## Cons of AWS Multi-Account Monitoring: + +1. **Implementation Complexity**: Setting up and configuring multi-account monitoring can be challenging, especially in large-scale environments. +2. **Data Aggregation Overhead**: Collecting and aggregating data from multiple accounts can introduce performance overhead and latency. +3. **Access Management**: Managing access and permissions across multiple accounts can become complex and error-prone. +4. **Cost Implications**: Implementing and maintaining a comprehensive multi-account monitoring solution may incur additional costs, if not done right. + +## Key AWS Services and Tools for Multi-Account Monitoring: + +1. **AWS Organizations**: Centrally manage and govern multiple AWS accounts, enabling consolidated billing, policy-based management, and account creation/management. +2. **AWS Config**: Continuously monitor and record resource configurations, enabling compliance auditing and change tracking across accounts. +3. **AWS CloudTrail**: Log and monitor API activity and user actions across multiple AWS accounts for security and operational purposes. +4. **Amazon CloudWatch**: Monitor and collect metrics, logs, and events from various AWS resources across multiple accounts for centralized monitoring and alerting. +5. **AWS Security Hub**: Centrally view and manage security findings across multiple AWS accounts, enabling comprehensive security monitoring and compliance tracking. + +## References: + +1. AWS Documentation: "Monitoring Multiple AWS Accounts" (https://docs.aws.amazon.com/solutions/latest/multi-account-monitoring/welcome.html) +2. Gartner Research: "Cloud Adoption Trends and Key Considerations for 2023" (https://www.gartner.com/en/documents/4009858) +3. Cloud Conformity Report: "The State of AWS Security and Compliance in the Cloud" (https://www.cloudconformity.com/knowledge-base/the-state-of-aws-security-and-compliance-in-the-cloud.html) +4. Forrester Research: "The Total Economic Impact™ Of AWS Cloud Governance Solutions" (https://d1.awsstatic.com/ +5. How Audible used Amazon CloudWatch cross-account observability to resolve severity tickets faster (https://aws.amazon.com/blogs/mt/how-audible-used-amazon-cloudwatch-cross-account-observability-to-resolve-severity-tickets-faster/) \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/multiaccountoss.md b/docusaurus/observability-best-practices/docs/patterns/multiaccountoss.md new file mode 100644 index 000000000..a2e609c76 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/multiaccountoss.md @@ -0,0 +1,38 @@ +# Cross account monitoring with AWS Open source service + +## Introduction + +Modern cloud environments often span multiple accounts and include on-premises infrastructure, creating complex monitoring challenges. To address these challenges, a sophisticated monitoring architecture can be implemented using AWS services and industry-standard tools. This architecture enables comprehensive visibility across diverse environments, facilitating efficient management and quick issue resolution. + +## Core Components + +At the heart of this monitoring solution is AWS Distro for OpenTelemetry (ADOT), which serves as a centralized collection point for metrics from various sources. ADOT is deployed in a dedicated central AWS account, forming the hub of the monitoring infrastructure. This central deployment allows for streamlined data aggregation and processing. + +Amazon Managed Service for Prometheus is another crucial component, providing a scalable and managed time-series database for storing the collected metrics. This service eliminates the need for self-managed Prometheus instances, reducing operational overhead and ensuring high availability of metric data. + +For visualization and analysis, Grafana is integrated into the architecture. Grafana connects to the Amazon Managed Service for Prometheus, offering powerful querying capabilities and customizable dashboards. This allows teams to create insightful visualizations and set up alerting based on the collected metrics. + +![multiaccount AMP](./images/multiaccountoss.png) +*Figure 1: Multi account monitoring with AWS Open source services* + +## Data Collection and Flow + +The architecture supports data collection from multiple AWS accounts, referred to as monitored accounts. These accounts use the OpenTelemetry Protocol (OTLP) to export their metrics to the central ADOT instance. This standardized approach ensures consistency in data format and facilitates easy integration of new accounts into the monitoring setup. + +On-premises infrastructure is also incorporated into this monitoring solution. These systems send their metrics data to the central ADOT instance using secure HTTPS POST requests. This method allows for the inclusion of legacy or non-cloud systems in the overall monitoring strategy, providing a truly comprehensive view of the entire IT environment. + +Once the data reaches the central ADOT instance, it is processed and forwarded to the Amazon Managed Service for Prometheus using the Prometheus remote write protocol. This step ensures that all collected metrics are stored in a format optimized for time-series data, enabling efficient querying and analysis. + +## Benefits and Considerations + +This architecture offers several key benefits. It provides a centralized view of metrics from diverse sources, enabling holistic monitoring of complex environments. The use of managed services reduces the operational burden on teams, allowing them to focus on analysis rather than infrastructure maintenance. Additionally, the architecture is highly scalable, capable of accommodating growth in both the number of monitored systems and the volume of metrics collected. + +However, implementing this architecture also comes with considerations. The centralized nature of the solution means that the monitoring infrastructure in the central account becomes critical, requiring careful planning for high availability and disaster recovery. There may also be cost implications associated with data transfer between accounts and the usage of managed services, which should be factored into budgeting decisions. + +Security is another important aspect to consider. Proper IAM roles and permissions must be set up to allow secure cross-account metric collection. For on-premises systems, ensuring secure and authenticated HTTPS connections is crucial to maintain the integrity and confidentiality of the monitoring data. + +## Conclusion + +This advanced AWS cloud monitoring architecture provides a robust solution for organizations with complex, multi-account, and hybrid infrastructure environments. By leveraging AWS managed services and industry-standard tools like OpenTelemetry and Grafana, it offers a scalable and powerful monitoring solution. While it requires careful planning and management to implement effectively, the benefits of comprehensive visibility and centralized monitoring make it a valuable approach for modern cloud-native and hybrid environments. + +The flexibility of this architecture allows it to adapt to various organizational needs and can evolve as monitoring requirements change. As cloud environments continue to grow in complexity, having such a centralized and comprehensive monitoring solution becomes increasingly critical for maintaining operational excellence and ensuring optimal performance across all infrastructure components. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/o11ypipeline.md b/docusaurus/observability-best-practices/docs/patterns/o11ypipeline.md new file mode 100644 index 000000000..bc42601cc --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/o11ypipeline.md @@ -0,0 +1,95 @@ +# ADOT Observability Pipeline + +The observability pipeline consists of several components that work together to collect, manage, and analyze observability data from various sources. + +## EKS Cluster + +The EKS (Elastic Kubernetes Service) cluster hosts the main components of the observability pipeline. + +### Install ADOT Operator Helm Chart + +The ADOT (AWS Distro for OpenTelemetry) Operator is installed using a Helm chart. It manages the deployment and configuration of the observability pipeline components. + +### User Configured Collector + +The user-configured collector is managed by the operator and consists of the following components: + +- Collector as Deployment: The collector is deployed as a Kubernetes deployment, which ensures high availability and scalability. +- Collector-0, Collector-1, Collector-2: Multiple collector instances are deployed to handle the incoming observability data. They work together to distribute the workload and ensure reliable data collection. + +![OTEL pipeline](./images/otelpipeline.png) +*Figure 1: OpenTelemetry Pipleine* + +### Persistent Volume + +The persistent volume is used to store the collected observability data. It ensures data durability and allows for long-term storage and analysis. + +### Kubernetes Node + +The Kubernetes node hosts the application pods and the collector as a sidecar. + +- Application Container: The application container runs the actual application code and generates observability data. +- Collector as Sidecar: The collector runs as a sidecar container alongside the application container. It collects the observability data generated by the application. + +## Scrape Targets + +The observability pipeline collects data from various scrape targets, such as: + +- Scrape traces/metrics: The pipeline scrapes traces and metrics from the application and infrastructure components. + +## AWS Prometheus Remote Write Exporter + +The AWS Prometheus Remote Write Exporter is used to export the collected observability data to AWS services. + +## AWS CloudWatch EMF Exporter + +The AWS CloudWatch EMF (Embedded Metric Format) Exporter is used to export metrics to AWS CloudWatch. + +## AWS X-Ray Tracing Exporter + +The AWS X-Ray Tracing Exporter is used to export tracing data to AWS X-Ray for distributed tracing and performance analysis. + +The observability pipeline collects data from the scrape targets, processes it using the collectors, and exports it to various AWS services for further analysis and visualization. + + +## Collecting Metrics and Insights with ADOT + +1. **Instrumentation**: Similar to OpenTelemetry, ADOT provides libraries and SDKs to instrument your applications and services, capturing telemetry data such as metrics, traces, and logs. + +2. **Metrics Collection**: ADOT supports collecting and exporting system and application-level metrics, including AWS service metrics, providing insights into resource utilization and application performance. + +3. **Distributed Tracing**: ADOT enables distributed tracing across AWS services, containers, and on-premises environments, allowing you to trace requests and operations end-to-end. + +4. **Logging**: ADOT includes support for structured logging, correlating log data with other telemetry signals for comprehensive observability. + +5. **AWS Service Integrations**: ADOT provides tight integrations with AWS services like AWS X-Ray, AWS CloudWatch, Amazon Managed Service for Prometheus, and AWS Distro for OpenTelemetry Operator, enabling seamless telemetry collection and analysis within the AWS ecosystem. + +6. **Automatic Instrumentation**: ADOT offers automatic instrumentation capabilities for popular frameworks and libraries, simplifying the process of instrumenting existing applications. + +7. **Data Processing and Analysis**: Telemetry data collected by ADOT can be exported to AWS observability services like AWS X-Ray, Amazon Managed Service for Prometheus, and AWS CloudWatch, leveraging AWS-native analysis and visualization tools. + +## Benefits of Using ADOT + +1. **AWS-Native Integration**: ADOT is designed to seamlessly integrate with AWS services and infrastructure, providing a cohesive observability experience within the AWS ecosystem. + +2. **Performance and Scalability**: ADOT is optimized for performance and scalability, enabling efficient telemetry collection and analysis in large-scale AWS environments. + +3. **Security and Compliance**: ADOT adheres to AWS security best practices and is compliant with various industry standards, ensuring secure and compliant observability practices. + +4. **AWS Support**: As an AWS-supported distribution, ADOT benefits from AWS's extensive documentation, support channels, and long-term commitment to the OpenTelemetry project. + +## Difference between OpenTelemetry and ADOT + +While ADOT and OpenTelemetry share many core capabilities, there are some key differences: + +1. **AWS Integration**: ADOT is designed specifically for AWS environments and provides tight integrations with AWS services, while OpenTelemetry is a vendor-neutral project. + +2. **AWS Optimization**: ADOT is optimized for performance, scalability, and security within AWS environments, leveraging AWS-native services and best practices. + +3. **AWS Support**: ADOT benefits from official AWS support, documentation, and long-term commitment, while OpenTelemetry relies on community support. + +4. **AWS-Specific Features**: ADOT includes AWS-specific features and automatic instrumentation for AWS services, while OpenTelemetry focuses on general-purpose observability. + +5. **Vendor Neutrality**: OpenTelemetry is a vendor-neutral project, allowing integration with various observability platforms, while ADOT is primarily focused on AWS observability services. + +By leveraging ADOT, organizations can achieve comprehensive observability within the AWS ecosystem, benefiting from AWS-native integrations, optimized performance, and AWS support, while still maintaining the flexibility to leverage OpenTelemetry capabilities and community contributions. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/opensearch.md b/docusaurus/observability-best-practices/docs/patterns/opensearch.md new file mode 100644 index 000000000..2aaa102db --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/opensearch.md @@ -0,0 +1,41 @@ +# Opensearch Logging on AWS + +## Introduction +Opensearch is a popular open-source search and analytics engine that enables log aggregation, analysis and visualization. AWS provides several compute services like ECS (Elastic Container Service), EKS (Elastic Kubernetes Service) and EC2 (Elastic Compute Cloud) that can be used to deploy and run applications generating logs. Integrating Opensearch with these compute services allows centralized logging to monitor applications and infrastructure effectively. + +![Opensearch pipeline](./images/os.png) +*Figure 1: Opensearch Pipleine* + +## Architecture Overview +Here is a high-level architecture of Opensearch logging on AWS using ECS, EKS and EC2: + +1. Applications running on ECS, EKS or EC2 generate logs +2. A log agent (e.g. Fluentd, Fluent Bit, Logstash etc.) collects logs from the compute services +3. The log agent sends the logs to Amazon Opensearch Service, a managed Opensearch cluster +4. Opensearch indexes and stores the log data +5. Kibana, integrated with Opensearch, is used to search, analyze and visualize the log data + +Some key components: +- Amazon Opensearch Service: Managed Opensearch cluster for log aggregation and analytics +- Compute Services (ECS, EKS, EC2): Where applications generating logs are deployed +- Log Agents: Collect logs from compute and send to Opensearch +- Opensearch Index: Stores the log data +- Kibana: Visualization and analysis of log data + +## Pros +1. **Centralized Logging**: Aggregates logs from all compute services into Opensearch, enabling a single pane for log analysis +2. **Scalability**: Amazon Opensearch Service scales to ingest and analyze high volumes of log data +3. **Fully Managed**: Opensearch Service eliminates operational overhead of managing Opensearch +4. **Real-time Monitoring**: Ingest and visualize logs in near real-time for proactive monitoring +5. **Rich Analytics**: Kibana provides powerful tools to search, filter, analyze and visualize logs +6. **Extensibility**: Flexible to integrate with various log agents and AWS services + +## Cons +1. **Cost**: Log aggregation at scale to Opensearch can incur significant data transfer and storage costs +2. **Complex Setup**: Initial setup to stream logs from compute services to Opensearch can be involved +3. **Learning Curve**: Requires knowledge of Opensearch and Kibana for efficient utilization +4. **Large-scale Limitations**: For very large log volumes, Opensearch can face scalability and performance challenges +5. **Security Overhead**: Ensuring secure log transmission and access to Opensearch requires careful configuration + +## Conclusion +Integrating Opensearch with AWS compute services like ECS, EKS and EC2 enables powerful log aggregation and analytics capabilities. While it provides a scalable, centralized and near real-time logging solution, it's important to design the architecture carefully considering costs, security, scalability and performance. With the right implementation, Opensearch logging on AWS can greatly enhance observability into applications and infrastructure. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/otel.md b/docusaurus/observability-best-practices/docs/patterns/otel.md new file mode 100644 index 000000000..ee4bca1fd --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/otel.md @@ -0,0 +1,34 @@ +# Observability with OpenTelemetry + +OpenTelemetry is an open-source, vendor-neutral observability framework that provides a standardized way to collect and export telemetry data, including logs, metrics, and traces. By leveraging OpenTelemetry, organizations can implement a comprehensive observability pipeline while ensuring vendor independence and future-proofing their observability strategy. + +## Collecting Metrics and Insights with OpenTelemetry + +1. **Instrumentation**: The first step in using OpenTelemetry is to instrument your applications and services with the OpenTelemetry libraries or SDKs. These libraries automatically capture and export telemetry data, such as metrics, traces, and logs, from your application code. + +2. **Metrics Collection**: OpenTelemetry provides a standardized way to collect and export metrics from your application. These metrics can include system metrics (CPU, memory, disk usage), application-level metrics (request rates, error rates, latency), and custom business metrics specific to your application. + +3. **Distributed Tracing**: OpenTelemetry supports distributed tracing, enabling you to trace requests and operations as they propagate through your distributed system. This provides valuable insights into the end-to-end flow of requests, identifying bottlenecks, and troubleshooting performance issues. + +4. **Logging**: While OpenTelemetry's primary focus is on metrics and traces, it also provides a structured logging API that can be used to capture and export log data. This ensures that logs are correlated with other telemetry data, providing a holistic view of your system's behavior. + +5. **Exporters**: OpenTelemetry supports various exporters that allow you to send telemetry data to different backends or observability platforms. Popular exporters include Prometheus, Jaeger, Zipkin, and cloud-native observability solutions like AWS CloudWatch, Azure Monitor, and Google Cloud Operations. + +6. **Data Processing and Analysis**: Once the telemetry data is exported, you can leverage observability platforms, monitoring tools, or custom data processing pipelines to analyze and visualize the collected metrics, traces, and logs. This analysis can provide insights into system performance, identify bottlenecks, and aid in troubleshooting and root cause analysis. +![Otel](./images/otel.png) +*Figure 1: EKS Cluster sending observability signals with ADOT and FluentBit* + + +## Benefits of Using OpenTelemetry + +1. **Vendor Neutrality**: OpenTelemetry is an open-source, vendor-neutral project, ensuring that your observability strategy is not tied to a specific vendor or platform. This flexibility allows you to switch between observability backends or combine multiple solutions as needed. + +2. **Standardization**: OpenTelemetry provides a standardized way of collecting and exporting telemetry data, enabling consistent data formats and interoperability across different components and systems. + +3. **Future-Proofing**: By adopting OpenTelemetry, you can future-proof your observability strategy. As the project evolves and new features and integrations are added, your existing instrumentation can be easily updated without the need for significant code changes. + +4. **Comprehensive Observability**: OpenTelemetry supports multiple telemetry signals (metrics, traces, and logs), providing a comprehensive view of your system's behavior and performance. + +5. **Ecosystem and Community Support**: OpenTelemetry has a growing ecosystem of integrations, tools, and a vibrant community of contributors, ensuring continued development and support. + +By leveraging OpenTelemetry for observability, organizations can gain deep insights into their systems, enabling proactive monitoring, efficient troubleshooting, and data-driven decision-making, while maintaining flexibility and vendor independence in their observability strategy. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/sparkbigdata.md b/docusaurus/observability-best-practices/docs/patterns/sparkbigdata.md new file mode 100644 index 000000000..e2227c8a2 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/sparkbigdata.md @@ -0,0 +1,45 @@ +# Big Data Observability on AWS + +This diagram illustrates a best practice pattern for implementing observability in a Spark big data workflow on AWS. The pattern leverages various AWS services to collect, process, and analyze logs and metrics generated by Spark jobs. + +![Spark Bigdata](./images/spark.png) +*Figure 1: Spark Big Data observability* + +## Workflow + +1. **Users** submit Spark jobs to an **Amazon EMR** cluster. +2. The **Amazon EMR** cluster runs the Spark job, which distributes the workload across the cluster using **Apache Spark**. +3. During the execution of the Spark job, logs and metrics are generated and collected by **Amazon CloudWatch** and **Amazon EMR**. + +## Observability Components + +### Amazon EMR + +Amazon EMR is a managed service that simplifies running big data frameworks like Apache Spark on AWS. It provides a scalable and cost-effective platform for processing large volumes of data. + +### Amazon CloudWatch + +Amazon CloudWatch is a monitoring and observability service that collects and tracks metrics, logs, and events from various AWS resources and applications. In this pattern, CloudWatch is used to: + +1. Collect logs and metrics from the **EMR EC2 instances** running the Spark job. +2. Publish the collected logs to **Amazon CloudWatch Logs** for centralized log management and analysis. + +### EMR EC2 Instances + +The Spark job runs on EMR EC2 instances, which are the compute nodes of the EMR cluster. These instances generate logs and metrics that are collected by the **CloudWatch Agent** and sent to Amazon CloudWatch. + +## Best Practices + +To ensure effective observability of Spark big data workloads on AWS, consider the following best practices: + +1. **Centralized Log Management**: Use Amazon CloudWatch Logs to centralize the collection, storage, and analysis of logs generated by Spark jobs and EMR instances. This allows for easy troubleshooting and monitoring of the Spark workflow. + +2. **Metrics Collection**: Leverage the CloudWatch Agent to collect relevant metrics from the EMR EC2 instances, such as CPU utilization, memory usage, and disk I/O. These metrics provide insights into the performance and health of the Spark job. + +3. **Dashboards and Alarms**: Create CloudWatch dashboards to visualize key metrics and logs in real-time. Set up CloudWatch alarms to notify and alert when specific thresholds or anomalies are detected, enabling proactive monitoring and incident response. + +4. **Log Analytics**: Utilize Amazon CloudWatch Logs Insights or integrate with other log analytics tools to perform ad-hoc queries, troubleshoot issues, and gain valuable insights from the collected logs. + +5. **Performance Optimization**: Continuously monitor and analyze the performance of Spark jobs using the collected metrics and logs. Identify bottlenecks, optimize resource allocation, and tune Spark configurations to improve the efficiency and performance of the big data workload. + +By implementing this observability pattern and following best practices, organizations can effectively monitor, troubleshoot, and optimize their Spark big data workloads on AWS, ensuring reliable and efficient data processing at scale. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/patterns/vpcflowlogs.md b/docusaurus/observability-best-practices/docs/patterns/vpcflowlogs.md new file mode 100644 index 000000000..f8b11afb7 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/patterns/vpcflowlogs.md @@ -0,0 +1,36 @@ +# VPC Flow Logs for Network Observability + +In modern cloud environments, network observability plays a crucial role in ensuring the security, performance, and reliability of your applications and infrastructure. Amazon Virtual Private Cloud (VPC) Flow Logs, a feature provided by Amazon Web Services (AWS), offers a powerful tool for gaining visibility into network traffic within your VPCs, enabling effective troubleshooting and security analysis. + +VPC Flow Logs capture metadata about the IP traffic flowing in and out of your VPC, providing valuable insights into network communication patterns, potential security threats, and performance bottlenecks. By leveraging VPC Flow Logs, organizations can achieve the following benefits: + +1. **Network Traffic Visibility**: VPC Flow Logs record detailed information about network traffic, including source and destination IP addresses, ports, protocols, packet sizes, and flow directions. This comprehensive visibility into network traffic patterns enables organizations to identify anomalies, detect potential security threats, and optimize network configurations. + +2. **Security Monitoring and Threat Detection**: By analyzing VPC Flow Logs, security teams can monitor network traffic for suspicious activities, such as unauthorized access attempts, port scanning, or data exfiltration attempts. This proactive monitoring approach helps organizations detect and respond to potential security threats more effectively. + +3. **Compliance and Auditing**: VPC Flow Logs provide a detailed audit trail of network traffic, enabling organizations to meet compliance requirements and demonstrate adherence to security policies and industry regulations. This audit trail can also aid in forensic investigations and incident response efforts. + +4. **Application Performance Troubleshooting**: Network bottlenecks or connectivity issues can significantly impact application performance. VPC Flow Logs allow organizations to identify and troubleshoot network-related performance issues by analyzing traffic patterns, identifying potential bottlenecks, and optimizing network configurations accordingly. + +5. **Cost Optimization**: By analyzing VPC Flow Logs, organizations can gain insights into network traffic patterns and resource utilization. This information can be used to optimize network configurations, rightsizing network resources, and potentially reducing unnecessary costs associated with over-provisioning or underutilized resources. + +![VPC flow logs](./images/vpcflowlogs.png) +*Figure 1: VPC flow logs visualization with Grafana* + +To leverage VPC Flow Logs for network observability and troubleshooting, organizations can follow these general steps: + +1. **Enable VPC Flow Logs**: Configure VPC Flow Logs for your VPCs or specific network interfaces within your VPCs, specifying the desired log destination (e.g., Amazon CloudWatch Logs, Amazon S3, or a third-party log management solution). + +2. **Analyze Log Data**: Utilize log analysis tools or custom scripts to parse and analyze the VPC Flow Log data, identifying patterns, anomalies, or potential security threats based on the recorded network traffic information. + +3. **Integrate with Security and Monitoring Tools**: Incorporate VPC Flow Log data into your existing security and monitoring solutions, such as Security Information and Event Management (SIEM) systems, to correlate network traffic data with other security events and alerts. + +4. **Set Up Alerts and Notifications**: Configure alerts and notifications based on specific patterns or thresholds detected in VPC Flow Logs, enabling proactive response to potential security threats or performance issues. + +5. **Optimize Network Configurations**: Leverage insights from VPC Flow Logs to optimize network configurations, fine-tune security group rules, and implement traffic shaping or filtering mechanisms to enhance network performance and security posture. + +While VPC Flow Logs provide valuable network observability and troubleshooting capabilities, it's important to consider potential challenges such as log data volume and cost management. As the volume of network traffic increases, the amount of log data generated can grow significantly, potentially impacting storage costs and performance. Implementing log data retention policies, sampling strategies, and cost optimization techniques may be necessary to ensure an efficient and cost-effective logging solution. + +Additionally, ensuring proper access control and data security for your VPC Flow Logs is crucial. AWS provides granular access control mechanisms and encryption capabilities to protect the confidentiality and integrity of your log data. + +In conclusion, VPC Flow Logs are a powerful tool for achieving network observability and enabling effective troubleshooting in AWS environments. By providing detailed insights into network traffic patterns, VPC Flow Logs empower organizations to monitor security threats, optimize network configurations, troubleshoot performance issues, and maintain compliance. With the integration of VPC Flow Logs into existing security and monitoring solutions, organizations can enhance their overall observability and maintain a secure, high-performing, and reliable cloud infrastructure. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/recipes/aes.md b/docusaurus/observability-best-practices/docs/recipes/aes.md new file mode 100644 index 000000000..dbc0a9824 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/aes.md @@ -0,0 +1,32 @@ +# Amazon OpenSearch Service + +[Amazon OpenSearch Service][aes-main] (AOS), successor to Amazon Elasticsearch Service, +makes it easy for you to perform interactive log analytics, real-time application +monitoring, website search, and more. OpenSearch is an open source, distributed +search and analytics suite derived from Elasticsearch. It offers the latest +versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), +and visualization capabilities powered by OpenSearch Dashboards and Kibana +(1.5 to 7.10 versions). + +Check out the following recipes: + +- [AOS tutorial: a quick start guide][aos-tut] +- [Get started with AOS: T-shirt-size your domain][aos-gs] +- [Getting started with AOS][aes-ws] +- [Log Analytics with AOS][loan-ws] +- [Getting started with Open Distro for Elasticsearch][od-ws] +- [Know your data with Machine Learning][ml-ws] +- [Send CloudTrail Logs to AOS][ct-ws] +- [Searching DynamoDB Data with AOS][bs-ws] +- [Getting Started with Trace Analytics in AOS][tracing-aes] + +[aes-main]: https://aws.amazon.com/opensearch-service/ +[aos-gs]: https://aws.amazon.com/blogs/big-data/get-started-with-amazon-opensearch-service-t-shirt-size-your-domain/ +[aos-tut]: https://aws.amazon.com/blogs/big-data/amazon-opensearch-tutorial-a-quick-start-guide/ +[aes-ws]: https://intro.aesworkshops.com/ +[loan-ws]: https://aesworkshops.com/log-analytics/mainlab/ +[od-ws]: https://od4es.aesworkshops.com/ +[ml-ws]: https://reinvent.aesworkshops.com/ant346/ +[ct-ws]: https://cloudtrail.aesworkshops.com/ +[bs-ws]: https://bookstore.aesworkshops.com/ +[tracing-aes]: https://aws.amazon.com/blogs/big-data/getting-started-with-trace-analytics-in-amazon-elasticsearch-service/ diff --git a/docusaurus/observability-best-practices/docs/recipes/alerting.md b/docusaurus/observability-best-practices/docs/recipes/alerting.md new file mode 100644 index 000000000..992e493a7 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/alerting.md @@ -0,0 +1,7 @@ +# Alerting + +This section has a selection of recipes for various alerting systems and scenarios. + +- [Build proactive database monitoring for RDS with CW Logs, Lambda, and SNS][rds-cw-sns] + +[rds-cw-sns]: https://aws.amazon.com/blogs/database/build-proactive-database-monitoring-for-amazon-rds-with-amazon-cloudwatch-logs-aws-lambda-and-amazon-sns/ diff --git a/docusaurus/observability-best-practices/docs/recipes/amg.md b/docusaurus/observability-best-practices/docs/recipes/amg.md new file mode 100644 index 000000000..2208c01e8 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/amg.md @@ -0,0 +1,55 @@ +# Amazon Managed Grafana + +[Amazon Managed Grafana][amg-main] is a fully managed service based on open +source Grafana, enabling you to analyze your metrics, logs, and traces without +having to provision servers, configure and update software, or do the heavy +lifting involved in securing and scaling Grafana in production. You can create, +explore, and share observability dashboards with your team, connecting to +multiple data sources. + +Check out the following recipes: + +## Basics + +- [Getting Started][amg-gettingstarted] +- [Using Terraform for automation][amg-tf-automation] + +## Authentication and Access Control + +- [Direct SAML integration with identity providers][amg-saml] +- [Integrating identity providers (OneLogin, Ping Identity, Okta, and Azure AD) to SSO][amg-idps] +- [Integrating Google authentication via SAMLv2][amg-google-idps] +- [Setting up Amazon Managed Grafana cross-account data source using customer managed IAM roles][amg-cross-account-access] +- [Fine-grained access control in Amazon Managed Grafana using Grafana Teams][amg-grafana-teams] + +## Data sources and Visualizations + +- [Using Athena in Amazon Managed Grafana][amg-plugin-athena] +- [Using Redshift in Amazon Managed Grafana][amg-plugin-redshift] +- [Viewing custom metrics from statsd with Amazon Managed Service for Prometheus and Amazon Managed Grafana][amg-amp-statsd] +- [Setting up cross-account data source using customer managed IAM roles][amg-xacc-ds] + +## Others +- [Monitoring hybrid environments][amg-hybridenvs] +- [Managing Grafana and Loki in a regulated multitenant environment][grafana-loki-regenv] +- [Monitoring Amazon EKS Anywhere using Amazon Managed Service for Prometheus and Amazon Managed Grafana][amg-anywhere-monitoring] +- [Workshop for Getting Started][amg-oow] + + +[amg-main]: https://aws.amazon.com/grafana/ +[amg-gettingstarted]: https://aws.amazon.com/blogs/mt/amazon-managed-grafana-getting-started/ +[amg-saml]: https://aws.amazon.com/blogs/mt/amazon-managed-grafana-supports-direct-saml-integration-with-identity-providers/ +[amg-idps]: https://aws.amazon.com/blogs/opensource/integrating-identity-providers-such-as-onelogin-ping-identity-okta-and-azure-ad-to-sso-into-aws-managed-service-for-grafana/ +[amg-google-idps]: recipes/amg-google-auth-saml.md +[amg-hybridenvs]: https://aws.amazon.com/blogs/mt/monitoring-hybrid-environments-using-amazon-managed-service-for-grafana/ +[amg-xacc-ds]: https://aws.amazon.com/blogs/opensource/setting-up-amazon-managed-grafana-cross-account-data-source-using-customer-managed-iam-roles/ +[grafana-loki-regenv]: https://aws.amazon.com/blogs/opensource/how-to-manage-grafana-and-loki-in-a-regulated-multitenant-environment/ +[amg-oow]: https://observability.workshop.aws/en/amg.html +[amg-tf-automation]: recipes/amg-automation-tf.md +[amg-plugin-athena]: recipes/amg-athena-plugin.md +[amg-plugin-redshift]: recipes/amg-redshift-plugin.md +[amg-cross-account-access]: https://aws.amazon.com/blogs/opensource/setting-up-amazon-managed-grafana-cross-account-data-source-using-customer-managed-iam-roles/ +[amg-anywhere-monitoring]: https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-anywhere-using-amazon-managed-service-for-prometheus-and-amazon-managed-grafana/ +[amg-amp-statsd]: https://aws.amazon.com/blogs/mt/viewing-custom-metrics-from-statsd-with-amazon-managed-service-for-prometheus-and-amazon-managed-grafana/ +[amg-grafana-teams]: https://aws.amazon.com/blogs/mt/fine-grained-access-control-in-amazon-managed-grafana-using-grafana-teams/ + diff --git a/docusaurus/observability-best-practices/docs/recipes/amp.md b/docusaurus/observability-best-practices/docs/recipes/amp.md new file mode 100644 index 000000000..114b6f43a --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/amp.md @@ -0,0 +1,37 @@ +# Amazon Managed Service for Prometheus + +[Amazon Managed Service for Prometheus][amp-main] (AMP) is a Prometheus-compatible +monitoring service that makes it easy to monitor containerized applications at scale. +With AMP, you can use the Prometheus query language (PromQL) to monitor the +performance of containerized workloads without having to manage the underlying +infrastructure required to manage the ingestion, storage, and querying of operational +metrics. + +Check out the following recipes: + +- [Getting Started with AMP][amp-gettingstarted] +- [Using ADOT in EKS on EC2 to ingest to AMP and visualize in AMG](recipes/ec2-eks-metrics-go-adot-ampamg.md) +- [Setting up cross-account ingestion into AMP][amp-xaccount] +- [Metrics collection from ECS using AMP][amp-ecs-metrics] +- [Configuring Grafana Cloud Agent for AMP][amp-gcwa] +- [Set up cross-region metrics collection for AMP workspaces][amp-xregion-metrics] +- [Best practices for migrating self-hosted Prometheus on EKS to AMP][amp-migration] +- [Workshop for Getting Started with AMP][amp-oow] +- [Exporting Cloudwatch Metric Streams via Firehose and AWS Lambda to Amazon Managed Service for Prometheus](recipes/lambda-cw-metrics-go-amp.md) +- [Terraform as Infrastructure as a Code to deploy Amazon Managed Service for Prometheus and configure Alert manager](recipes/amp-alertmanager-terraform.md) +- [Monitor Istio on EKS using Amazon Managed Prometheus and Amazon Managed Grafana][amp-istio-monitoring] +- [Monitoring Amazon EKS Anywhere using Amazon Managed Service for Prometheus and Amazon Managed Grafana][amp-anywhere-monitoring] +- [Introducing Amazon EKS Observability Accelerator][eks-accelerator] +- [Installing the Prometheus mixin dashboards with AMP and Amazon Managed Grafana](recipes/amp-mixin-dashboards.md) +[amp-main]: https://aws.amazon.com/prometheus/ +[amp-gettingstarted]: https://aws.amazon.com/blogs/mt/getting-started-amazon-managed-service-for-prometheus/ +[amp-xaccount]: https://aws.amazon.com/blogs/opensource/setting-up-cross-account-ingestion-into-amazon-managed-service-for-prometheus/ +[amp-ecs-metrics]: https://aws.amazon.com/blogs/opensource/metrics-collection-from-amazon-ecs-using-amazon-managed-service-for-prometheus/ +[amp-gcwa]: https://aws.amazon.com/blogs/opensource/configuring-grafana-cloud-agent-for-amazon-managed-service-for-prometheus/ +[amp-xregion-metrics]: https://aws.amazon.com/blogs/opensource/set-up-cross-region-metrics-collection-for-amazon-managed-service-for-prometheus-workspaces/ +[amp-migration]: https://aws.amazon.com/blogs/opensource/best-practices-for-migrating-self-hosted-prometheus-on-amazon-eks-to-amazon-managed-service-for-prometheus/ +[amp-oow]: https://observability.workshop.aws/en/amp.html +[amp-istio-monitoring]: https://aws.amazon.com/blogs/mt/monitor-istio-on-eks-using-amazon-managed-prometheus-and-amazon-managed-grafana/ +[amp-anywhere-monitoring]: https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-anywhere-using-amazon-managed-service-for-prometheus-and-amazon-managed-grafana/ +[eks-accelerator]: recipes/eks-observability-accelerator.md +- [Auto-scaling Amazon EC2 using Amazon Managed Service for Prometheus and alert manager](recipes/as-ec2-using-amp-and-alertmanager.md) diff --git a/docusaurus/observability-best-practices/docs/recipes/anomaly-detection.md b/docusaurus/observability-best-practices/docs/recipes/anomaly-detection.md new file mode 100644 index 000000000..285c1361f --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/anomaly-detection.md @@ -0,0 +1,8 @@ +# Anomaly Detection + +This section contains recipes for anomaly detection. + +- [Enabling Anomaly Detection for a CloudWatch Metric][am-oow] + +[am-oow]: https://observability.workshop.aws/en/anomalydetection.html + diff --git a/docusaurus/observability-best-practices/docs/recipes/apprunner.md b/docusaurus/observability-best-practices/docs/recipes/apprunner.md new file mode 100644 index 000000000..7b75f83f7 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/apprunner.md @@ -0,0 +1,35 @@ +# AWS App Runner + +[AWS App Runner][apprunner-main] is a fully managed service that makes it easy for developers to quickly deploy containerized web applications and APIs, at scale and with no prior infrastructure experience required. Start with your source code or a container image. App Runner builds and deploys the web application automatically, load balances traffic with encryption, scales to meet your traffic needs, and makes it easy for your services to communicate with other AWS services and applications that run in a private Amazon VPC. With App Runner, rather than thinking about servers or scaling, you have more time to focus on your applications. + + + + +Check out the following recipes: + +## General +- [Container Day - Docker Con | How Developers can get to production web applications at scale easily](https://www.youtube.com/watch?v=Iyp9Ugk9oRs) +- [AWS Blog | Centralized observability for AWS App Runner services](https://aws.amazon.com/blogs/containers/centralized-observability-for-aws-app-runner-services/) +- [AWS Blog | Observability for AWS App Runner VPC networking](https://aws.amazon.com/blogs/containers/observability-for-aws-app-runner-vpc-networking/) +- [AWS Blog | Controlling and monitoring AWS App Runner applications with Amazon EventBridge](https://aws.amazon.com/blogs/containers/controlling-and-monitoring-aws-app-runner-applications-with-amazon-eventbridge/) + + +## Logs + +- [Viewing App Runner logs streamed to CloudWatch Logs][apprunner-cwl] + +## Metrics + +- [Viewing App Runner service metrics reported to CloudWatch][apprunner-cwm] + + +## Traces +- [Getting Started with AWS X-Ray tracing for App Runner using AWS Distro for OpenTelemetry](https://aws-otel.github.io/docs/getting-started/apprunner) +- [Containers from the Couch | AWS App Runner X-Ray Integration](https://youtu.be/cVr8N7enCMM) +- [AWS Blog | Tracing an AWS App Runner service using AWS X-Ray with OpenTelemetry](https://aws.amazon.com/blogs/containers/tracing-an-aws-app-runner-service-using-aws-x-ray-with-opentelemetry/) +- [AWS Blog | Enabling AWS X-Ray tracing for AWS App Runner service using AWS Copilot CLI](https://aws.amazon.com/blogs/containers/enabling-aws-x-ray-tracing-for-aws-app-runner-service-using-aws-copilot-cli/) + +[apprunner-main]: https://aws.amazon.com/apprunner/ +[aes-ws]: https://bookstore.aesworkshops.com/ +[apprunner-cwl]: https://docs.aws.amazon.com/apprunner/latest/dg/monitor-cwl.html +[apprunner-cwm]: https://docs.aws.amazon.com/apprunner/latest/dg/monitor-cw.html diff --git a/docusaurus/observability-best-practices/docs/recipes/cw.md b/docusaurus/observability-best-practices/docs/recipes/cw.md new file mode 100644 index 000000000..1d6478911 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/cw.md @@ -0,0 +1,40 @@ +# Amazon CloudWatch + +[Amazon CloudWatch][cw-main] (CW) is a monitoring and observability service built +for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. +CloudWatch collects monitoring and operational data in the form of logs, metrics, +and events, providing you with a unified view of AWS resources, applications, +and services that run on AWS and on-premises servers. + +Check out the following recipes: + +- [Build proactive database monitoring for RDS with CW Logs, Lambda, and SNS][rds-cw] +- [Implementing CloudWatch-centric observability for Kubernetes-native developers in EKS][swa-eks-cw] +- [Create Canaries via CW Synthetics][cw-synths] +- [Cloudwatch Logs Insights for Quering Logs][cw-logsi] +- [Lambda Insights][cw-lambda] +- [Anomaly Detection via CloudWatch][cw-am] +- [Metrics Alarms via CloudWatch][cw-alarms] +- [Choosing container logging options to avoid backpressure][cw-fluentbit] +- [Introducing CloudWatch Container Insights Prometheus Support with AWS Distro for OpenTelemetry on ECS and EKS][cwci-adot] +- [Monitoring ECS containerized Applications and Microservices using CW Container Insights][cwci-ecs] +- [Monitoring EKS containerized Applications and Microservices using CW Container Insights][cwci-eks] +- [Exporting Cloudwatch Metric Streams via Firehose and AWS Lambda to Amazon Managed Service for Prometheus](recipes/lambda-cw-metrics-go-amp.md) +- [Proactive autoscaling of Kubernetes workloads with KEDA and Amazon CloudWatch][cw-keda-eks-scaling] +- [Using Amazon CloudWatch Metrics explorer to aggregate and visualize metrics filtered by resource tags][metrics-explorer-filter-by-tags] + + +[cw-main]: https://aws.amazon.com/cloudwatch/ +[rds-cw]: https://aws.amazon.com/blogs/database/build-proactive-database-monitoring-for-amazon-rds-with-amazon-cloudwatch-logs-aws-lambda-and-amazon-sns/ +[swa-eks-cw]: https://aws.amazon.com/blogs/opensource/implementing-cloudwatch-centric-observability-for-kubernetes-native-developers-in-amazon-elastic-kubernetes-service/ +[cw-synths]: https://observability.workshop.aws/en/synthetics.html +[cw-logsi]: https://observability.workshop.aws/en/logsinsights.html +[cw-lambda]: https://observability.workshop.aws/en/logsinsights.html +[cw-am]: https://observability.workshop.aws/en/anomalydetection.html +[cw-alarms]: https://observability.workshop.aws/en/alarms/_mericalarm.html +[cw-fluentbit]: https://aws.amazon.com/blogs/containers/choosing-container-logging-options-to-avoid-backpressure/ +[cwci-adot]: https://aws.amazon.com/blogs/containers/introducing-cloudwatch-container-insights-prometheus-support-with-aws-distro-for-opentelemetry-on-amazon-ecs-and-amazon-eks/ +[cwci-ecs]: https://observability.workshop.aws/en/containerinsights/ecs.html +[cwci-eks]: https://observability.workshop.aws/en/containerinsights/eks.html +[cw-keda-eks-scaling]: https://aws.amazon.com/blogs/mt/proactive-autoscaling-of-kubernetes-workloads-with-keda-using-metrics-ingested-into-amazon-cloudwatch/ +[metrics-explorer-filter-by-tags]: recipes/metrics-explorer-filter-by-tags.md diff --git a/docusaurus/observability-best-practices/docs/recipes/dimensions.md b/docusaurus/observability-best-practices/docs/recipes/dimensions.md new file mode 100644 index 000000000..41c833f23 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/dimensions.md @@ -0,0 +1,128 @@ +# Dimensions + +In the context of this site we consider the o11y space along six dimensions. +Looking at each dimension independently is beneficial from an synthetic +point-of-view, that is, when you're trying to build out a concrete o11y solution +for a given workload, spanning developer-related aspects such as the programming +language used as well as operational topics, for example the runtime environment +like containers or Lambda functions. + +![o11y space](images/o11y-space.png) + + +:::note + "What is a signal?" + When we say signal here we mean any kinds of o11y data and metadata points, + including log entries, metrics, and traces. Unless we want to or have to be + more specific, we use "signal" and it should be clear from the context what + restrictions may apply. +::: + +Let's now have a look at each of the six dimensions one by one: + +## Destinations + +In this dimension we consider all kinds of signal destinations including long term +storage and graphical interfaces that let you consume signals. As a developer, +you want access to an UI or an API that allows you to discover, look up, and +correlate signals to troubleshoot your service. In an infrastructure or platform +role you want access to an UI or an API that allows you to manage, discover, +look up, and correlate signals to understand the state of the infrastructure. + +![Grafana screen shot](images/grafana.png) + +Ultimately, this is the most interesting dimension from a human point of view. +However, in order to be able to reap the benefits we first have to invest a bit +of work: we need to instrument our software and external dependencies and ingest +the signals into the destinations. + +So, how do the signals arrive in the destinations? Glad you asked, it's … + +## Agents + +How the signals are collected and routed to analytics. The signals can come +from two sources: either your application source code (see also the +language section) or from things your application depends on, +such as state managed in datastores as well as infrastructure like VPCs (see +also the infra & data section). + +Agents are part of the telemetry that you would use to collect +and ingest signals. The other part are the instrumented applications and infra +pieces like databases. + +## Languages + +This dimension is concerned with the programming language you use for writing +your service or application. Here, we're dealing with SDKs and libraries, such +as the [X-Ray SDKs][xraysdks] or what OpenTelemetry provides in the context +of [instrumentation][otelinst]. You want to make sure that an o11y solution +supports your programming language of choice for a given signal type such as +logs or metrics. + +## Infrastructure & databases + +With this dimension we mean any sort of application-external dependencies, +be it infrastructure like the VPC the service is running in or a datastore +like RDS or DynamoDB or a queue like SQS. + +:::tip + "Commonalities" + One thing all the sources in this dimension have + in common is that they are located outside of your application (as well + as the compute environment your app runs in) and with that you have to treat + them as an opaque box. +::: + +This dimension includes but is not limited to: + +- AWS infrastructure, for example [VPC flow logs][vpcfl]. +- Secondary APIs such as [Kubernetes control plane logs][kubecpl]. +- Signals from datastores, such as or [S3][s3mon], [RDS][rdsmon] or [SQS][sqstrace]. + + +## Compute unit + +The way your package, schedule, and run your code. For example, in Lambda that's a +function and in [ECS][ecs] and [EKS][eks] that unit is a container running in +a tasks (ECS) or pods (EKS), respectively. Containerized environments like Kubernetes +often allow for two options concerning telemetry deployments: as side cars or +as per-node (instance) daemon processes. + +## Compute engine + +This dimension refers to the base runtime environment, which may (in case of an +EC2 instance, for example) or may not (serverless offerings such as Fargate or Lambda) +be your responsibility to provision and patch. Depending on the compute engine +you use, the telemetry part might already be part of the offering, for example, +[EKS on Fargate][firelensef] has log routing via Fluent Bit integrated. + + +[aes]: https://aws.amazon.com/elasticsearch-service/ "Amazon Elasticsearch Service" +[adot]: https://aws-otel.github.io/ "AWS Distro for OpenTelemetry" +[amg]: https://aws.amazon.com/grafana/ "Amazon Managed Grafana" +[amp]: https://aws.amazon.com/prometheus/ "Amazon Managed Service for Prometheus" +[batch]: https://aws.amazon.com/batch/ "AWS Batch" +[beans]: https://aws.amazon.com/elasticbeanstalk/ "AWS Elastic Beanstalk" +[cw]: https://aws.amazon.com/cloudwatch/ "Amazon CloudWatch" +[dimensions]: ../dimensions +[ec2]: https://aws.amazon.com/ec2/ "Amazon EC2" +[ecs]: https://aws.amazon.com/ecs/ "Amazon Elastic Container Service" +[eks]: https://aws.amazon.com/eks/ "Amazon Elastic Kubernetes Service" +[fargate]: https://aws.amazon.com/fargate/ "AWS Fargate" +[fluentbit]: https://fluentbit.io/ "Fluent Bit" +[firelensef]: https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/ "Fluent Bit for Amazon EKS on AWS Fargate is here" +[jaeger]: https://www.jaegertracing.io/ "Jaeger" +[kafka]: https://kafka.apache.org/ "Apache Kafka" +[kubecpl]: https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html "Amazon EKS control plane logging" +[lambda]: https://aws.amazon.com/lambda/ "AWS Lambda" +[lightsail]: https://aws.amazon.com/lightsail/ "Amazon Lightsail" +[otel]: https://opentelemetry.io/ "OpenTelemetry" +[otelinst]: https://opentelemetry.io/docs/concepts/instrumenting/ +[promex]: https://prometheus.io/docs/instrumenting/exporters/ "Prometheus exporters and integrations" +[rdsmon]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.LoggingAndMonitoring.html "Logging and monitoring in Amazon RDS" +[s3]: https://aws.amazon.com/s3/ "Amazon S3" +[s3mon]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-incident-response.html "Logging and monitoring in Amazon S3" +[sqstrace]: https://docs.aws.amazon.com/xray/latest/devguide/xray-services-sqs.html "Amazon SQS and AWS X-Ray" +[vpcfl]: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html "VPC Flow Logs" +[xray]: https://aws.amazon.com/xray/ "AWS X-Ray" +[xraysdks]: https://docs.aws.amazon.com/xray/index.html diff --git a/docusaurus/observability-best-practices/docs/recipes/dynamodb.md b/docusaurus/observability-best-practices/docs/recipes/dynamodb.md new file mode 100644 index 000000000..2b4ced471 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/dynamodb.md @@ -0,0 +1,17 @@ +# Amazon DynamoDB + +[Amazon DynamoDB][ddb-main] is a key-value and document database that delivers +single-digit millisecond performance at any scale. It's a fully managed, +multi-region, multi-active, durable database with built-in security, backup and +restore, and in-memory caching for internet-scale applications. + +Check out the following recipes: + +- [Monitoring Amazon DynamoDB for operational awareness][ddb-opawa] +- [Searching DynamoDB data with Amazon Elasticsearch Service][ddb-aes-ws] +- [DynamoDB Contributor Insights][cwci-oow] + +[ddb-main]: https://aws.amazon.com/dynamodb/ +[ddb-opawa]: https://aws.amazon.com/blogs/database/monitoring-amazon-dynamodb-for-operational-awareness/ +[ddb-aes-ws]: https://search-ddb.aesworkshops.com/ +[cwci-oow]: https://observability.workshop.aws/en/contributorinsights/explore diff --git a/docusaurus/observability-best-practices/docs/recipes/ecs.md b/docusaurus/observability-best-practices/docs/recipes/ecs.md new file mode 100644 index 000000000..c0f30e601 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/ecs.md @@ -0,0 +1,40 @@ +# Amazon Elastic Container Service + +[Amazon Elastic Container Service][ecs-main] (ECS) is a fully managed container +orchestration service that helps you easily deploy, manage, and scale +containerized applications, deeply integrating with the rest of AWS. + +Check out the following recipes, grouped by compute engine: + +## General + +- [Deployment patterns for the AWS Distro for OpenTelemetry Collector with ECS][adot-patterns-ecs] +- [Simplifying Amazon ECS monitoring set up with AWS Distro for OpenTelemetry][ecs-adot-integration] + +## ECS on EC2 + +### Logs + +- [Under the hood: FireLens for Amazon ECS Tasks][firelens-uth] + +### Metrics + +- [Using AWS Distro for OpenTelemetry collector for cross-account metrics collection on Amazon ECS][adot-xaccount-metrics] +- [Metrics collection from ECS using Amazon Managed Service for Prometheus][ecs-amp] +- [Sending Envoy metrics from AWS App Mesh to Amazon CloudWatch][ecs-appmesh-cw] + +## ECS on Fargate + +### Logs + +- [Sample logging architectures for FireLens on Amazon ECS and AWS Fargate using Fluent Bit][firelens-fb] + + +[ecs-main]: https://aws.amazon.com/ecs/ +[adot-patterns-ecs]: https://aws.amazon.com/blogs/opensource/deployment-patterns-for-the-aws-distro-for-opentelemetry-collector-with-amazon-elastic-container-service/ +[firelens-uth]: https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/ +[adot-xaccount-metrics]: https://aws.amazon.com/blogs/opensource/using-aws-distro-for-opentelemetry-collector-for-cross-account-metrics-collection-on-amazon-ecs/ +[ecs-amp]: https://aws.amazon.com/blogs/opensource/metrics-collection-from-amazon-ecs-using-amazon-managed-service-for-prometheus/ +[firelens-fb]: https://github.com/aws-samples/amazon-ecs-firelens-examples#fluent-bit-examples +[ecs-adot-integration]: https://aws.amazon.com/blogs/opensource/simplifying-amazon-ecs-monitoring-set-up-with-aws-distro-for-opentelemetry/ +[ecs-appmesh-cw]: https://aws.amazon.com/blogs/containers/sending-envoy-metrics-from-aws-app-mesh-to-amazon-cloudwatch/ \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/recipes/eks.md b/docusaurus/observability-best-practices/docs/recipes/eks.md new file mode 100644 index 000000000..afcceb944 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/eks.md @@ -0,0 +1,74 @@ +# Amazon Elastic Kubernetes Service + +[Amazon Elastic Kubernetes Service][eks-main] (EKS) gives you the flexibility to +start, run, and scale Kubernetes applications in the AWS Cloud or on-premises. + +Check out the following recipes, grouped by compute engine: + +## EKS on EC2 + +### Logs + +- [Fluent Bit Integration in CloudWatch Container Insights for EKS][eks-cw-fb] +- [Logging with EFK Stack][eks-ws-efk] +- [Sample logging architectures for Fluent Bit and FluentD on EKS][eks-logging] + +### Metrics + +- [Getting Started with Amazon Managed Service for Prometheus][amp-gettingstarted] +- [Using ADOT in EKS on EC2 to ingest metrics to AMP and visualize in AMG][ec2-eks-metrics-go-adot-ampamg] +- [Configuring Grafana Cloud Agent for Amazon Managed Service for Prometheus][gcwa-amp] +- [Monitoring cluster using Prometheus and Grafana][eks-ws-prom-grafana] +- [Monitoring with Managed Prometheus and Managed Grafana][eks-ws-amp-amg] +- [CloudWatch Container Insights][eks-ws-cw-ci] +- [Set up cross-region metrics collection for AMP workspaces][amp-xregion] +- [Monitoring App Mesh environment on EKS using Amazon Managed Service for Prometheus][eks-am-amp-amg] +- [Monitor Istio on EKS using Amazon Managed Prometheus and Amazon Managed Grafana][eks-istio-monitoring] +- [Proactive autoscaling of Kubernetes workloads with KEDA and Amazon CloudWatch][eks-keda-cloudwatch-scaling] +- [Monitoring Amazon EKS Anywhere using Amazon Managed Service for Prometheus and Amazon Managed Grafana][eks-anywhere-monitoring] + +### Traces + +- [Migrating X-Ray tracing to AWS Distro for OpenTelemetry][eks-otel-xray] +- [Tracing with X-Ray][eks-ws-xray] + +## EKS on Fargate + +### Logs + +- [Fluent Bit for Amazon EKS on AWS Fargate is here][eks-fargate-logging] +- [Sample logging architectures for Fluent Bit and FluentD on EKS][eks-fb-example] + +### Metrics + +- [Using ADOT in EKS on Fargate to ingest metrics to AMP and visualize in AMG][fargate-eks-metrics-go-adot-ampamg] +- [CloudWatch Container Insights][eks-ws-cw-ci] +- [Set up cross-region metrics collection for AMP workspaces][amp-xregion] + +### Traces + +- [Using ADOT in EKS on Fargate with AWS X-Ray][fargate-eks-xray-go-adot-amg] +- [Tracing with X-Ray][eks-ws-xray] + + +[eks-main]: https://aws.amazon.com/eks/ +[eks-cw-fb]: https://aws.amazon.com/blogs/containers/fluent-bit-integration-in-cloudwatch-container-insights-for-eks/ +[eks-ws-efk]: https://www.eksworkshop.com/intermediate/230_logging/ +[eks-logging]: https://github.com/aws-samples/amazon-eks-fluent-logging-examples +[amp-gettingstarted]: https://aws.amazon.com/blogs/mt/getting-started-amazon-managed-service-for-prometheus/ +[ec2-eks-metrics-go-adot-ampamg]: recipes/ec2-eks-metrics-go-adot-ampamg.md +[gcwa-amp]: https://aws.amazon.com/blogs/opensource/configuring-grafana-cloud-agent-for-amazon-managed-service-for-prometheus/ +[eks-ws-prom-grafana]: https://www.eksworkshop.com/intermediate/240_monitoring/ +[eks-ws-amp-amg]: https://www.eksworkshop.com/intermediate/246_monitoring_amp_amg/ +[eks-ws-cw-ci]: https://www.eksworkshop.com/intermediate/250_cloudwatch_container_insights/ +[fargate-eks-metrics-go-adot-ampamg]: recipes/fargate-eks-metrics-go-adot-ampamg.md +[amp-xregion]: https://aws.amazon.com/blogs/opensource/set-up-cross-region-metrics-collection-for-amazon-managed-service-for-prometheus-workspaces/ +[eks-otel-xray]: https://aws.amazon.com/blogs/opensource/migrating-x-ray-tracing-to-aws-distro-for-opentelemetry/ +[eks-ws-xray]: https://www.eksworkshop.com/intermediate/245_x-ray/x-ray/ +[eks-fargate-logging]: https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/ +[eks-fb-example]: https://github.com/aws-samples/amazon-eks-fluent-logging-examples +[eks-am-amp-amg]: recipes/servicemesh-monitoring-ampamg.md +[fargate-eks-xray-go-adot-amg]: recipes/fargate-eks-xray-go-adot-amg.md +[eks-istio-monitoring]: https://aws.amazon.com/blogs/mt/monitor-istio-on-eks-using-amazon-managed-prometheus-and-amazon-managed-grafana/ +[eks-keda-cloudwatch-scaling]: https://aws.amazon.com/blogs/mt/proactive-autoscaling-of-kubernetes-workloads-with-keda-using-metrics-ingested-into-amazon-cloudwatch/ +[eks-anywhere-monitoring]: https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-anywhere-using-amazon-managed-service-for-prometheus-and-amazon-managed-grafana/ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/adot-default-pipeline.png b/docusaurus/observability-best-practices/docs/recipes/images/adot-default-pipeline.png new file mode 100644 index 000000000..18a4076ab Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/adot-default-pipeline.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/adot-metrics-pipeline.png b/docusaurus/observability-best-practices/docs/recipes/images/adot-metrics-pipeline.png new file mode 100644 index 000000000..fe4ceccc6 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/adot-metrics-pipeline.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/alert-config.png b/docusaurus/observability-best-practices/docs/recipes/images/alert-config.png new file mode 100644 index 000000000..b23464775 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/alert-config.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/alert-configuration.png b/docusaurus/observability-best-practices/docs/recipes/images/alert-configuration.png new file mode 100644 index 000000000..467f087e1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/alert-configuration.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-console-create-workspace-managed-permissions.jpg b/docusaurus/observability-best-practices/docs/recipes/images/amg-console-create-workspace-managed-permissions.jpg new file mode 100644 index 000000000..d304bd06f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-console-create-workspace-managed-permissions.jpg differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-osm-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-osm-dashboard.png new file mode 100644 index 000000000..a88b9d525 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-osm-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-plugin-athena-ds.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-plugin-athena-ds.png new file mode 100644 index 000000000..d2fd65f88 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-plugin-athena-ds.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-plugin-redshift-ds.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-plugin-redshift-ds.png new file mode 100644 index 000000000..c95c4e819 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-plugin-redshift-ds.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-prom-ds-with-tf.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-prom-ds-with-tf.png new file mode 100644 index 000000000..52b75355b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-prom-ds-with-tf.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-prom-sample-app-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-prom-sample-app-dashboard.png new file mode 100644 index 000000000..85e8fdc65 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-prom-sample-app-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-redshift-mon-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-redshift-mon-dashboard.png new file mode 100644 index 000000000..ef7b1bf67 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-redshift-mon-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/1.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/1.png new file mode 100644 index 000000000..7c6a677ba Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/1.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/10.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/10.png new file mode 100644 index 000000000..04d811b7c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/10.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/11.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/11.png new file mode 100644 index 000000000..1fad46e82 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/11.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/12.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/12.png new file mode 100644 index 000000000..9cd050507 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/12.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/13.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/13.png new file mode 100644 index 000000000..f436cbcf4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/13.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/2.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/2.png new file mode 100644 index 000000000..3673106bc Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/2.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/3.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/3.png new file mode 100644 index 000000000..bb10a239a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/3.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/4.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/4.png new file mode 100644 index 000000000..962f1b04d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/4.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/5.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/5.png new file mode 100644 index 000000000..631d8b877 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/5.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/6.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/6.png new file mode 100644 index 000000000..3180a6cb8 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/6.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/7.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/7.png new file mode 100644 index 000000000..8febbad35 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/7.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/8.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/8.png new file mode 100644 index 000000000..b4941cdf0 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/8.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/9.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/9.png new file mode 100644 index 000000000..6485684a8 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-saml-google-auth/9.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/amg-vpcfl-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/amg-vpcfl-dashboard.png new file mode 100644 index 000000000..62f23e60d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/amg-vpcfl-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/api-key-creation.png b/docusaurus/observability-best-practices/docs/recipes/images/api-key-creation.png new file mode 100644 index 000000000..a82a9a7cc Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/api-key-creation.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/api-key-result.png b/docusaurus/observability-best-practices/docs/recipes/images/api-key-result.png new file mode 100644 index 000000000..98b79056c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/api-key-result.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/api-keys-menu-item.png b/docusaurus/observability-best-practices/docs/recipes/images/api-keys-menu-item.png new file mode 100644 index 000000000..f30deacec Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/api-keys-menu-item.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/azure-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/azure-dashboard.png new file mode 100644 index 000000000..bf066d304 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/azure-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/azure-monitor-grafana.png b/docusaurus/observability-best-practices/docs/recipes/images/azure-monitor-grafana.png new file mode 100644 index 000000000..9f285965c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/azure-monitor-grafana.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/azure-monitor-metrics.png b/docusaurus/observability-best-practices/docs/recipes/images/azure-monitor-metrics.png new file mode 100644 index 000000000..61806d27d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/azure-monitor-metrics.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/cdk-amp-iam-changes.png b/docusaurus/observability-best-practices/docs/recipes/images/cdk-amp-iam-changes.png new file mode 100644 index 000000000..0b5305ff9 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/cdk-amp-iam-changes.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/cloudwatch-metric-stream-configuration.png b/docusaurus/observability-best-practices/docs/recipes/images/cloudwatch-metric-stream-configuration.png new file mode 100644 index 000000000..9999cc42d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/cloudwatch-metric-stream-configuration.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/configuring-amp-datasource.png b/docusaurus/observability-best-practices/docs/recipes/images/configuring-amp-datasource.png new file mode 100644 index 000000000..47905122f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/configuring-amp-datasource.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/datasource-addition.png b/docusaurus/observability-best-practices/docs/recipes/images/datasource-addition.png new file mode 100644 index 000000000..1a47f2eb1 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/datasource-addition.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/datasource.png b/docusaurus/observability-best-practices/docs/recipes/images/datasource.png new file mode 100644 index 000000000..fe554b202 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/datasource.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/downstream-latency.png b/docusaurus/observability-best-practices/docs/recipes/images/downstream-latency.png new file mode 100644 index 000000000..e164cce95 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/downstream-latency.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager3.png b/docusaurus/observability-best-practices/docs/recipes/images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager3.png new file mode 100644 index 000000000..0250b5092 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager3.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager4.png b/docusaurus/observability-best-practices/docs/recipes/images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager4.png new file mode 100644 index 000000000..c4c95c176 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager4.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager5.png b/docusaurus/observability-best-practices/docs/recipes/images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager5.png new file mode 100644 index 000000000..6662d4e7d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager5.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/ec2-vpc-flowlogs-creation.png b/docusaurus/observability-best-practices/docs/recipes/images/ec2-vpc-flowlogs-creation.png new file mode 100644 index 000000000..913a13419 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/ec2-vpc-flowlogs-creation.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/APIKeycreated2.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/APIKeycreated2.png new file mode 100644 index 000000000..05d2294e4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/APIKeycreated2.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify1.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify1.png new file mode 100644 index 000000000..05c1a8a0a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify1.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify2.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify2.png new file mode 100644 index 000000000..bc4a9d19f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify2.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify3.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify3.png new file mode 100644 index 000000000..3f559d665 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify3.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify4.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify4.png new file mode 100644 index 000000000..01b2e158b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/GrafanaConnectionVerify4.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JMXMetrics1.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JMXMetrics1.png new file mode 100644 index 000000000..61e965363 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JMXMetrics1.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JMXMetrics2.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JMXMetrics2.png new file mode 100644 index 000000000..ec8d3fa8c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JMXMetrics2.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JMXMetrics3.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JMXMetrics3.png new file mode 100644 index 000000000..96f7ad75d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JMXMetrics3.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JavaJMXImage1.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JavaJMXImage1.png new file mode 100644 index 000000000..c11444d69 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/JavaJMXImage1.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Terraform3.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Terraform3.png new file mode 100644 index 000000000..c23509a71 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Terraform3.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Terraform4.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Terraform4.png new file mode 100644 index 000000000..409d4a2ba Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Terraform4.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Terraform5.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Terraform5.png new file mode 100644 index 000000000..d63569276 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Terraform5.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/TerraformModules1.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/TerraformModules1.png new file mode 100644 index 000000000..caed4d147 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/TerraformModules1.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/TerraformModules2.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/TerraformModules2.png new file mode 100644 index 000000000..f8265466a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/TerraformModules2.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Welcom_Amazon_Grafana.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Welcom_Amazon_Grafana.png new file mode 100644 index 000000000..8325b6757 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/Welcom_Amazon_Grafana.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/addAPIKey3.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/addAPIKey3.png new file mode 100644 index 000000000..8a0d807ae Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/addAPIKey3.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/grafanaAPIKey1.png b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/grafanaAPIKey1.png new file mode 100644 index 000000000..365a65c56 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/eks-observability-accelerator-images/grafanaAPIKey1.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/grafana.png b/docusaurus/observability-best-practices/docs/recipes/images/grafana.png new file mode 100644 index 000000000..80ee5ffed Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/grafana.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/import-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/import-dashboard.png new file mode 100644 index 000000000..d23a4696a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/import-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-CPUUtilization-TeamX.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-CPUUtilization-TeamX.png new file mode 100644 index 000000000..fe196dc8c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-CPUUtilization-TeamX.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-CPUUtilization-Webserver-lab.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-CPUUtilization-Webserver-lab.png new file mode 100644 index 000000000..e07e528b4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-CPUUtilization-Webserver-lab.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-conf-example.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-conf-example.png new file mode 100644 index 000000000..c8b2725c4 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-conf-example.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-create-new-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-create-new-dashboard.png new file mode 100644 index 000000000..b8c440a00 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-create-new-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-cw-menu.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-cw-menu.png new file mode 100644 index 000000000..719177632 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-cw-menu.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-by-tag-name-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-by-tag-name-dashboard.png new file mode 100644 index 000000000..17d4bbf16 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-by-tag-name-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-by-tag-team-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-by-tag-team-dashboard.png new file mode 100644 index 000000000..1fd5b752c Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-by-tag-team-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-cpu-utilization-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-cpu-utilization-dashboard.png new file mode 100644 index 000000000..c0dc92f7e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-cpu-utilization-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-metrics.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-metrics.png new file mode 100644 index 000000000..07665aefa Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-metrics.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-teamx-name-tag.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-teamx-name-tag.png new file mode 100644 index 000000000..97fc6874a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-teamx-name-tag.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-teamx-tag.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-teamx-tag.png new file mode 100644 index 000000000..3be2831b8 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-teamx-tag.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-teamy-tag.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-teamy-tag.png new file mode 100644 index 000000000..24be2c41e Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-teamy-tag.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-templates-ec2-by-type.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-templates-ec2-by-type.png new file mode 100644 index 000000000..ec335f21f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-templates-ec2-by-type.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-templates-ec2-tag-team-x-y.png b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-templates-ec2-tag-team-x-y.png new file mode 100644 index 000000000..186d35632 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/metrics-explorer-filter-by-tags/metrics-explorer-templates-ec2-tag-team-x-y.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/monitoring-appmesh-environment.png b/docusaurus/observability-best-practices/docs/recipes/images/monitoring-appmesh-environment.png new file mode 100644 index 000000000..0f1954e9d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/monitoring-appmesh-environment.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/notification-channels.png b/docusaurus/observability-best-practices/docs/recipes/images/notification-channels.png new file mode 100644 index 000000000..91905a07a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/notification-channels.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/o11y-space.png b/docusaurus/observability-best-practices/docs/recipes/images/o11y-space.png new file mode 100644 index 000000000..53b30d6cd Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/o11y-space.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/placeholder-grafana-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/placeholder-grafana-dashboard.png new file mode 100644 index 000000000..20d12578a Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/placeholder-grafana-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/search.png b/docusaurus/observability-best-practices/docs/recipes/images/search.png new file mode 100644 index 000000000..333f5e01b Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/search.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/slack-notification.png b/docusaurus/observability-best-practices/docs/recipes/images/slack-notification.png new file mode 100644 index 000000000..b704e6711 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/slack-notification.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/telemetry.png b/docusaurus/observability-best-practices/docs/recipes/images/telemetry.png new file mode 100644 index 000000000..5cbaca44d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/telemetry.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/workspace-creation.png b/docusaurus/observability-best-practices/docs/recipes/images/workspace-creation.png new file mode 100644 index 000000000..3ad768cb6 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/workspace-creation.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/x-ray-amg-ho11y-dashboard.png b/docusaurus/observability-best-practices/docs/recipes/images/x-ray-amg-ho11y-dashboard.png new file mode 100644 index 000000000..a40001c00 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/x-ray-amg-ho11y-dashboard.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/x-ray-amg-ho11y-explore.png b/docusaurus/observability-best-practices/docs/recipes/images/x-ray-amg-ho11y-explore.png new file mode 100644 index 000000000..60292a5cf Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/x-ray-amg-ho11y-explore.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/images/x-ray-cw-ho11y.png b/docusaurus/observability-best-practices/docs/recipes/images/x-ray-cw-ho11y.png new file mode 100644 index 000000000..604c5f96f Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/images/x-ray-cw-ho11y.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/index.md b/docusaurus/observability-best-practices/docs/recipes/index.md new file mode 100644 index 000000000..5bbe491cc --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/index.md @@ -0,0 +1,116 @@ +# Recipes + +In here you will find curated guidance, how-to's, and links to other resources that help with the application of observability (o11y) to various use cases. This includes managed services such as [Amazon Managed Service for Prometheus][amp] +and [Amazon Managed Grafana][amg] as well as agents, for example [OpenTelemetry][otel] +and [Fluent Bit][fluentbit]. Content here is not resitricted to AWS tools alone though, and many open source projects are referenced here. + +We want to address the needs of both developers and infrastructure folks equally, so many of the recipes "cast a wide net". We encourge you to explore and find the solutions that work best for what you are seeking to accomplish. + +:::info + The content here is derived from actual customer engagement by our Solutions Architects, Professional Services, and feedback from other customers. Everything you will find here has been implemented by our actual customers in their own environments. +::: + +The way we think about the o11y space is as follows: we decompose it into +[six dimensions][dimensions] you can then combine to arrive at a specific solution: + +| dimension | examples | +|---------------|--------------| +| Destinations | [Prometheus][amp] · [Grafana][amg] · [OpenSearch][aes] · [CloudWatch][cw] · [Jaeger][jaeger] | +| Agents | [ADOT][adot] · [Fluent Bit][fluentbit] · CW agent · X-Ray agent | +| Languages | [Java][java] · Python · .NET · [JavaScript][nodejs] · Go · Rust | +| Infra & databases | [RDS][rds] · [DynamoDB][dynamodb] · [MSK][msk] | +| Compute unit | [Batch][batch] · [ECS][ecs] · [EKS][eks] · [AEB][beans] · [Lambda][lambda] · [AppRunner][apprunner] | +| Compute engine | [Fargate][fargate] · [EC2][ec2] · [Lightsail][lightsail] | + +:::note + "Example solution requirement" + I need a logging solution for a Python app I'm running on EKS on Fargate + with the goal to store the logs in an S3 bucket for further consumption +::: + +One stack that would fit this need is the following: + +1. *Destination*: An S3 bucket for further consumption of data +1. *Agent*: FluentBit to emit log data from EKS +1. *Language*: Python +1. *Infra & DB*: N/A +1. *Compute unit*: Kubernetes (EKS) +1. *Compute engine*: EC2 + +Not every dimension needs to be specified and sometimes it's hard to decide where +to start. Try different paths and compare the pros and cons of certain recipes. + +To simplify navigation, we're grouping the six dimension into the following +categories: + +- **By Compute**: covering compute engines and units +- **By Infra & Data**: covering infrastructure and databases +- **By Language**: covering languages +- **By Destination**: covering telemetry and analytics +- **Tasks**: covering anomaly detection, alerting, troubleshooting, and more + +[Learn more about dimensions …](https://aws-observability.github.io/observability-best-practices/recipes/dimensions/) + +## How to use + +You can either use the top navigation menu to browse to a specific index page, +starting with a rough selection. For example, `By Compute` -> `EKS` -> +`Fargate` -> `Logs`. + +Alternatively, you can search the site pressing `/` or the `s` key: + +![o11y space](images/search.png) + +:::info + "License" + All recipes published on this site are available via the + [MIT-0][mit0] license, a modification to the usual MIT license + that removes the requirement for attribution. +::: + +## How to contribute + +Start a [discussion][discussion] on what you plan to do and we take it from there. + +## Learn more + +The recipes on this site are a good practices collection. In addition, there +are a number of places where you can learn more about the status of open source +projects we use as well as about the managed services from the recipes, so +check out: + +- [observability @ aws][o11yataws], a playlist of AWS folks talking about + their projects and services. +- [AWS observability workshops](https://aws-observability.github.io/observability-best-practices/recipes/workshops/), to try out the offerings in a + structured manner. +- The [AWS monitoring and observability][o11yhome] homepage with pointers + to case studies and partners. + +[aes]: aes.md "Amazon Elasticsearch Service" +[adot]: https://aws-otel.github.io/ "AWS Distro for OpenTelemetry" +[amg]: amg.md "Amazon Managed Grafana" +[amp]: amp.md "Amazon Managed Service for Prometheus" +[batch]: https://aws.amazon.com/batch/ "AWS Batch" +[beans]: https://aws.amazon.com/elasticbeanstalk/ "AWS Elastic Beanstalk" +[cw]: cw.md "Amazon CloudWatch" +[dimensions]: dimensions.md +[dynamodb]: dynamodb.md "Amazon DynamoDB" +[ec2]: https://aws.amazon.com/ec2/ "Amazon EC2" +[ecs]: ecs.md "Amazon Elastic Container Service" +[eks]: eks.md "Amazon Elastic Kubernetes Service" +[fargate]: https://aws.amazon.com/fargate/ "AWS Fargate" +[fluentbit]: https://fluentbit.io/ "Fluent Bit" +[jaeger]: https://www.jaegertracing.io/ "Jaeger" +[kafka]: https://kafka.apache.org/ "Apache Kafka" +[apprunner]: apprunner.md "AWS App Runner" +[lambda]: lambda.md "AWS Lambda" +[lightsail]: https://aws.amazon.com/lightsail/ "Amazon Lightsail" +[otel]: https://opentelemetry.io/ "OpenTelemetry" +[java]: java.md +[nodejs]: nodejs.md +[rds]: rds.md "Amazon Relational Database Service" +[msk]: msk.md "Amazon Managed Streaming for Apache Kafka" +[mit0]: https://github.com/aws/mit-0 "MIT-0" +[discussion]: https://github.com/aws-observability/observability-best-practices/discussions "Discussions" +[o11yataws]: https://www.youtube.com/playlist?list=PLaiiCkpc1U7Wy7XwkpfgyOhIf_06IK3U_ "Observability @ AWS YouTube playlist" +[o11yhome]: https://aws.amazon.com/products/management-and-governance/use-cases/monitoring-and-observability/ "AWS Observability home" diff --git a/docusaurus/observability-best-practices/docs/recipes/infra.md b/docusaurus/observability-best-practices/docs/recipes/infra.md new file mode 100644 index 000000000..a4542f8cf --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/infra.md @@ -0,0 +1,38 @@ +# Infrastructure & Databases + +## Networking + +- [Monitor your Application Load Balancers][alb-docs] +- [Monitor your Network Load Balancers][nlb-docs] +- [VPC Flow Logs][vpcfl] +- [VPC Flow logs analysis using Amazon Elasticsearch Service][vpcf-ws] + +## Compute + +- [Amazon EKS control plane logging][eks-cp] +- [AWS Lambda monitoring and observability][lambda-docs] + +## Databases, storage and queues + +- [Amazon Relational Database Service][rds] +- [Amazon DynamoDB][ddb] +- [Amazon Managed Streaming for Apache Kafka][msk] +- [Logging and monitoring in Amazon S3][s3mon] +- [Amazon SQS and AWS X-Ray][sqstrace] + + +## Others + +- [Prometheus exporters][prometheus-exporters] + +[alb-docs]: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-monitoring.html +[nlb-docs]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-monitoring.html +[vpcfl]: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html +[eks-cp]: https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html +[lambda-docs]: https://docs.aws.amazon.com/lambda/latest/operatorguide/monitoring-observability.html +[rds]: rds.md +[ddb]: dynamodb.md +[msk]: msk.md +[s3mon]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-incident-response.html +[sqstrace]: https://docs.aws.amazon.com/xray/latest/devguide/xray-services-sqs.html +[prometheus-exporters]: https://prometheus.io/docs/instrumenting/exporters/ diff --git a/docusaurus/observability-best-practices/docs/recipes/java.md b/docusaurus/observability-best-practices/docs/recipes/java.md new file mode 100644 index 000000000..0648e0ca4 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/java.md @@ -0,0 +1,6 @@ +# Java + + +- [StatsD and Java Support in AWS Distro for OpenTelemetry][statsd-adot] + +[statsd-adot]: https://aws.amazon.com/blogs/opensource/aws-distro-for-opentelemetry-adds-statsd-and-java-support/ diff --git a/docusaurus/observability-best-practices/docs/recipes/lambda.md b/docusaurus/observability-best-practices/docs/recipes/lambda.md new file mode 100644 index 000000000..d3e1b4458 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/lambda.md @@ -0,0 +1,27 @@ +# AWS Lambda + +[AWS Lambda][lambda-main] is a serverless compute service that lets you run +code without provisioning or managing servers, creating workload-aware cluster +scaling logic, maintaining event integrations, or managing runtimes. + +Check out the following recipes: + +## Logs + +- [Deploy and Monitor a Serverless Application][aes-ws] + +## Metrics + +- [Introducing CloudWatch Lambda Insights][lambda-cwi] +- [Exporting Cloudwatch Metric Streams via Firehose and AWS Lambda to Amazon Managed Service for Prometheus](recipes/lambda-cw-metrics-go-amp.md) + +## Traces + +- [Auto-instrumenting a Python application with an AWS Distro for OpenTelemetry Lambda layer][lambda-layer-python-xray-adot] +- [Tracing AWS Lambda functions in AWS X-Ray with OpenTelemetry][lambda-xray-adot] + +[lambda-main]: https://aws.amazon.com/lambda/ +[aes-ws]: https://bookstore.aesworkshops.com/ +[lambda-cwi]: https://aws.amazon.com/blogs/mt/introducing-cloudwatch-lambda-insights/ +[lambda-xray-adot]: https://aws.amazon.com/blogs/opensource/tracing-aws-lambda-functions-in-aws-x-ray-with-opentelemetry/ +[lambda-layer-python-xray-adot]: https://aws.amazon.com/blogs/opensource/auto-instrumenting-a-python-application-with-an-aws-distro-for-opentelemetry-lambda-layer/ diff --git a/docusaurus/observability-best-practices/docs/recipes/msk.md b/docusaurus/observability-best-practices/docs/recipes/msk.md new file mode 100644 index 000000000..410252b77 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/msk.md @@ -0,0 +1,14 @@ +# Amazon Managed Streaming for Apache Kafka + +[Amazon Managed Streaming for Apache Kafka][msk-main] (MSK) is a fully managed service that makes it +easy for you to build and run applications that use Apache Kafka to process +streaming data. Amazon MSK continuously monitors cluster health and automatically +replaces unhealthy nodes with no downtime to your application. In addition, +Amazon MSK secures your Apache Kafka cluster by encrypting data at rest. + +Check out the following recipes: + +- [Amazon Managed Streaming for Apache Kafka: Open Monitoring with Prometheus][msk-prom] + +[msk-main]: https://aws.amazon.com/msk/ +[msk-prom]: https://docs.aws.amazon.com/msk/latest/developerguide/open-monitoring.html diff --git a/docusaurus/observability-best-practices/docs/recipes/nodejs.md b/docusaurus/observability-best-practices/docs/recipes/nodejs.md new file mode 100644 index 000000000..0ead7e569 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/nodejs.md @@ -0,0 +1,8 @@ +# Node.js + + +- [NodeJS library to generate embedded CloudWatch metrics][node-cw] + + +[node-cw]: https://catalog.workshops.aws/observability/en-US/aws-native/metrics/emf/clientlibrary + diff --git a/docusaurus/observability-best-practices/docs/recipes/rds.md b/docusaurus/observability-best-practices/docs/recipes/rds.md new file mode 100644 index 000000000..7d1f3a4dd --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/rds.md @@ -0,0 +1,20 @@ +# Amazon Relational Database Service + +[Amazon Relational Database Service][rds-main] (RDS) makes it easy to set up, +operate, and scale a relational database in the cloud. It provides cost-efficient +and resizable capacity while automating time-consuming administration tasks such +as hardware provisioning, database setup, patching and backups. + +Check out the following recipes: + +- [Build proactive database monitoring for RDS with CloudWatch Logs, Lambda, and SNS][rds-cw-sns] +- [Monitor RDS for PostgreSQL and Aurora for PostgreSQL database log errors and set up notifications using CloudWatch][rds-pg-au] +- [Logging and monitoring in Amazon RDS][rds-mon] +- [Performance Insights metrics published to CloudWatch][rds-pi-cw] + +[rds-main]: https://aws.amazon.com/rds/ +[rds-cw-sns]: https://aws.amazon.com/blogs/database/build-proactive-database-monitoring-for-amazon-rds-with-amazon-cloudwatch-logs-aws-lambda-and-amazon-sns/ +[rds-pg-au]: https://aws.amazon.com/blogs/database/monitor-amazon-rds-for-postgresql-and-amazon-aurora-for-postgresql-database-log-errors-and-set-up-notifications-using-amazon-cloudwatch/ +[rds-mon]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.LoggingAndMonitoring.html +[rds-pi-cw]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.Cloudwatch.html + diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/README.md b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/README.md new file mode 100644 index 000000000..689ea00d2 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/README.md @@ -0,0 +1,15 @@ +Organizations have started adopting [Amazon Workspaces](https://docs.aws.amazon.com/workspaces/latest/adminguide/amazon-workspaces.html) as virtual cloud based desktop as a solution (DAAS) to replace their existing traditional desktop solution to shift the cost and effort of maintaining laptops and desktops to a cloud pay-as-you-go model. Organizations using Amazon Workspaces would need support of these managed services to monitor their workspaces environment for Day 2 operations. A cloud based managed open source monitoring solution such as Amazon Managed Service for Prometheus and Amazon Managed Grafana helps IT teams to quickly setup and operate a monitoring solution to save cost. Monitoring CPU, memory, network, or disk activity from Amazon Workspace eliminates guesswork while troubleshooting Amazon Workspaces environment. + +A managed monitoring solution on your Amazon Workspaces environments yields following organizational benefits: + +* Service desk staff can quickly identify and drill down to Amazon Workspace issues that need investigation without guesswork by leveraging managed monitoring services such as Amazon Managed Service for Prometheus and Amazon Managed Grafana +* Service desks staffs can investigate Amazon Workspace issues after the event using the historical data in Amazon Managed Service for Prometheus +* Eliminates long calls that waste time questioning business users on Amazon Workspaces issues + + +In this blog post, we will set up Amazon Managed Service for Prometheus, Amazon Managed Grafana, and a Prometheus server on Amazon Elastic Compute Cloud (EC2) to provide a monitoring solution for Amazon Workspaces. We will automate the deployment of Prometheus agents on any new Amazon Workspace using Active Directory Group Policy Objects (GPO). + +**Solution Architecture** + +The following diagram demonstrates the solution to monitor your Amazon Workspaces environment using AWS native managed services such as Amazon Managed Service for Prometheus and Amazon Managed Grafana. This solution will deploy a Prometheus server on Amazon Elastic Compute Cloud (EC2) instance which polls prometheus agents on your Amazon Workspace periodically and remote writes metrics to Amazon Managed Service for Prometheus. We will be using Amazon Managed Grafana to query and visualize metrics on your Amazon Workspaces infrastructure. +![Screenshot](prometheus.drawio-dotted.drawio.png) \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/cleanup.sh b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/cleanup.sh new file mode 100644 index 000000000..c3d12bcdb --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/cleanup.sh @@ -0,0 +1,82 @@ +#!/usr/bin/env bash +# This script will perform the following: +# 1. Clean up the EC2 +# 2. Clean up networking +# 3. Clean up the workspace +# 4. Clean up the IAM resources +# 5. Clean up the AMP workspace +# This script should be run using the command ". ./cleanup.sh" to preserve the environment variables. + +printf "Starting Cleanup.\n" + +aws ec2 stop-instances --instance-ids $(aws ec2 describe-instances \ + --filters Name=tag:Name,Values=PROMETHEUSFORWARDERSERVER \ + "Name=instance-state-name,Values=running" --query \ + Reservations[].Instances[].InstanceId --output text) + +sleep 60 + +aws ec2 terminate-instances --instance-ids $(aws ec2 describe-instances \ + --filters Name=tag:Name,Values=PROMETHEUSFORWARDERSERVER --query \ + Reservations[].Instances[].InstanceId --output text) + +sleep 60 + +export PROM_VPCID=$(aws ec2 describe-vpcs --filter \ + Name=tag:Name,Values=PROMETHEUS_VPC --query 'Vpcs[*].VpcId' --output text) + +aws ec2 detach-internet-gateway --vpc-id $PROM_VPCID \ + --internet-gateway-id $(aws ec2 describe-internet-gateways --filter \ + Name=tag:Name,Values=PROMETHEUS_IGW --query \ + 'InternetGateways[*].InternetGatewayId' --output text) + +aws ec2 delete-internet-gateway --internet-gateway-id \ + $(aws ec2 describe-internet-gateways --filter Name=tag:Name,Values=PROMETHEUS_IGW \ + --query 'InternetGateways[*].InternetGatewayId' --output text) + +aws ec2 delete-vpc-peering-connection --vpc-peering-connection-id \ + $(aws ec2 describe-vpc-peering-connections --query \ + "VpcPeeringConnections[*].VpcPeeringConnectionId" --filters \ + "Name=tag:Name,Values=PROMETHEUS_WKSPACE_PEERING" --output text) + +aws ec2 delete-subnet --subnet-id $(aws ec2 describe-subnets --filter \ + Name=tag:Name,Values=PROMETHEUS_PUBSUBNET --query \ + 'Subnets[*].SubnetId' --output text) + +aws ec2 delete-route-table --route-table-id $(aws ec2 describe-route-tables --filter \ + Name=tag:Name,Values=PROMETHEUS_ROUTE --query \ + 'RouteTables[*].RouteTableId' --output text) + +aws ec2 delete-security-group --group-id \ +$(aws ec2 describe-security-groups \ + --group-ids --filter \ + Name=group-name,Values=PROMETHEUS_SERVER_SG --query \ + 'SecurityGroups[*].GroupId' --output text) + +aws ec2 delete-security-group --group-id \ +$(aws ec2 describe-security-groups \ + --group-ids --filter \ + Name=group-name,Values=PROMETHEUS_TO_WORKSPACES_SG --query \ + 'SecurityGroups[*].GroupId' --output text) + +aws ec2 delete-vpc --vpc-id $PROM_VPCID + +aws workspaces terminate-workspaces --terminate-workspace-requests \ + $(aws workspaces describe-workspaces --user-name $WORKSPACES_USER \ + --directory-id $WORKSPACES_DIRECTORY --query 'Workspaces[*].WorkspaceId' \ + --output text) + +aws iam detach-role-policy --role-name PromWrite \ + --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess + +aws iam remove-role-from-instance-profile --instance-profile-name \ + PromWrite --role-name PromWrite + +aws iam delete-instance-profile --instance-profile-name PromWrite + +aws iam delete-role --role-name PromWrite + +aws amp delete-workspace --workspace-id $AMP_WORKSPACE_ID + +printf "Automation Complete!!!\n" + diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/prometheus.drawio-dotted.drawio.png b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/prometheus.drawio-dotted.drawio.png new file mode 100644 index 000000000..8e2faf7c2 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/prometheus.drawio-dotted.drawio.png differ diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/prometheusmonitor.sh b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/prometheusmonitor.sh new file mode 100644 index 000000000..3d7dd5e18 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/prometheusmonitor.sh @@ -0,0 +1,257 @@ +#!/usr/bin/env bash +# This script will do the following: +# 1. Create networking +# 2. Create AMP +# 3. Create EC2 & SG +# 4. Create IAM +# 5. Create Workspaces +# This script should be run using the command ". ./prometheusmonitor.sh" to preserve the environment variables. + +printf "Starting Automation --- Setting up network.\n" + +PROMETHEUS_CIDR=192.168.100.0/24 +PROMETHEUS_VPCID=$(aws ec2 create-vpc \ + --cidr-block $PROMETHEUS_CIDR \ + --tag-specification ResourceType=vpc,Tags=['{Key=Name,Value=PROMETHEUS_VPC}'] \ + --query Vpc.VpcId \ + --output text) + +PROMETHEUS_PUBSUBNET_CIDR=192.168.100.0/25 +PROMETHEUS_PUBSUBNET_ID=$(aws ec2 create-subnet \ + --vpc-id $PROMETHEUS_VPCID \ + --cidr-block $PROMETHEUS_CIDR \ + --tag-specification ResourceType=subnet,Tags=['{Key=Name,Value=PROMETHEUS_PUBSUBNET}'] \ + --query Subnet.SubnetId --output text) + +PROMETHEUS_IGW_ID=$(aws ec2 create-internet-gateway \ + --tag-specifications ResourceType=internet-gateway,Tags=['{Key=Name,Value=PROMETHEUS_IGW}'] \ + --query InternetGateway.InternetGatewayId \ + --output text) + +aws ec2 attach-internet-gateway \ + --vpc-id $PROMETHEUS_VPCID \ + --internet-gateway-id $PROMETHEUS_IGW_ID + +PROMETHEUS_RT_ID=$(aws ec2 create-route-table \ + --vpc-id $PROMETHEUS_VPCID \ + --tag-specifications ResourceType=route-table,Tags=['{Key=Name,Value=PROMETHEUS_ROUTE}'] \ + --query RouteTable.RouteTableId \ + --output text) + +aws ec2 associate-route-table \ + --route-table-id $PROMETHEUS_RT_ID \ + --subnet-id $PROMETHEUS_PUBSUBNET_ID + +aws ec2 create-route \ + --route-table-id $PROMETHEUS_RT_ID \ + --destination-cidr-block 0.0.0.0/0 \ + --gateway-id $PROMETHEUS_IGW_ID + + aws ec2 modify-subnet-attribute --subnet-id $PROMETHEUS_PUBSUBNET_ID \ + --map-public-ip-on-launch + +PROMETHEUS_WKSPACE_PEER=$(aws ec2 create-vpc-peering-connection --vpc-id \ + "$PROMETHEUS_VPCID" --peer-vpc-id "$WORKSPACES_VPCID" --query \ + VpcPeeringConnection.VpcPeeringConnectionId --output text) + +aws ec2 accept-vpc-peering-connection --vpc-peering-connection-id \ + "$PROMETHEUS_WKSPACE_PEER" + +aws ec2 create-tags --resources "$PROMETHEUS_WKSPACE_PEER" --tags \ + 'Key=Name,Value=PROMETHEUS_WKSPACE_PEERING' + +aws ec2 create-route --route-table-id $(aws ec2 describe-route-tables \ + --filter Name=vpc-id,Values=$WORKSPACES_VPCID Name=association.main,Values=false\ + --query RouteTables[].RouteTableId --output text) --destination-cidr-block \ + $PROMETHEUS_CIDR --vpc-peering-connection-id "$PROMETHEUS_WKSPACE_PEER" + +aws ec2 create-route --route-table-id $(aws ec2 describe-route-tables --filter \ + Name=vpc-id,Values=$PROMETHEUS_VPCID Name=tag:Name,Values=PROMETHEUS_ROUTE \ + --query RouteTables[].RouteTableId --output text) --destination-cidr-block \ + $(aws ec2 describe-vpcs --vpc-ids $WORKSPACES_VPCID \ + --query Vpcs[].CidrBlock --output text) --vpc-peering-connection-id \ + "$PROMETHEUS_WKSPACE_PEER" + +printf "Setting up AMP workspace.\n" + +aws amp create-workspace \ + --alias $AMP_WORKSPACE_NAME \ + --region $AWS_REGION + +AMP_WORKSPACE_ID=$(aws amp list-workspaces \ + --alias $AMP_WORKSPACE_NAME \ + --region=${AWS_REGION} \ + --query 'workspaces[0].[workspaceId]' \ + --output text) + +# Be sure that the status code is ACTIVE with the below commands and it takes couple of minutes for status code to become ACTIVE. + +aws amp describe-workspace \ + --workspace-id $AMP_WORKSPACE_ID + +printf "Setting up Prometheus EC2.\n" + +PROMETHEUS_IMAGEID=$(aws ssm get-parameters --names \ + /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2 \ + --query 'Parameters[0].[Value]' --output text) + +aws ec2 create-key-pair \ + --key-name MyKeyPair \ + --output text > MyKeyPair.pem + +# Now, we will create the EC2 which will run the Prometheus Server to send Workspaces metrics to the AWS AMP Service. + +aws ec2 run-instances \ + --image-id $PROMETHEUS_IMAGEID \ + --no-cli-pager \ + --count 1 --instance-type t2.medium \ + --key-name MyKeyPair \ + --subnet-id $PROMETHEUS_PUBSUBNET_ID \ + --security-group-ids $(aws ec2 describe-security-groups \ + --filter Name=vpc-id,Values=$PROMETHEUS_VPCID \ + --query SecurityGroups[].GroupId \ + --output text) \ + --user-data file://workspacesprometheus.txt \ + --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=PROMETHEUSFORWARDERSERVER}]' + +printf "waiting for EC2 to finish configuration. \n" + +sleep 600 + +printf "Setting up security groups.\n" + +aws ec2 create-security-group \ + --group-name PROMETHEUS_TO_WORKSPACES_SG \ + --vpc-id $WORKSPACES_VPCID \ + --description "Security Group for Workspace instances to allow ports 9182" + +aws ec2 authorize-security-group-ingress \ + --group-id $(aws ec2 describe-security-groups --filters \ + Name=vpc-id,Values=String,$WORKSPACES_VPCID \ + Name=group-name,Values=Name,PROMETHEUS_TO_WORKSPACES_SG \ + --query SecurityGroups[].GroupId --output text) \ + --protocol tcp \ + --port 9182 \ + --cidr $(aws ec2 describe-instances --filters Name=tag:Name,Values=PROMETHEUSFORWARDERSERVER \ + --query Reservations[].Instances[].PrivateIpAddress --output text)/32 + +aws ec2 create-security-group \ + --group-name PROMETHEUS_SERVER_SG \ + --vpc-id $PROMETHEUS_VPCID \ + --description "Security Group for Prometheus EC2 instances to allow ports 9090 and 22" + +PROMETHEUS_SERVER_SG_GID=$(aws ec2 describe-security-groups --filters \ + Name=vpc-id,Values=String,$PROMETHEUS_VPCID \ + Name=group-name,Values=Name,PROMETHEUS_SERVER_SG \ + --query SecurityGroups[].GroupId --output text) + +aws ec2 authorize-security-group-ingress \ + --group-id $PROMETHEUS_SERVER_SG_GID \ + --protocol tcp --port 9090 --cidr $(aws ec2 describe-vpcs \ + --vpc-ids $PROMETHEUS_VPCID --query Vpcs[].CidrBlock --output text) + +aws ec2 authorize-security-group-ingress \ + --group-id $PROMETHEUS_SERVER_SG_GID \ + --protocol tcp --port 9090 --cidr $(aws ec2 describe-vpcs \ + --vpc-ids $WORKSPACES_VPCID --query Vpcs[].CidrBlock --output text) + +aws ec2 authorize-security-group-ingress \ + --group-id $PROMETHEUS_SERVER_SG_GID \ + --protocol tcp --port 22 --cidr $(aws ec2 describe-vpcs --vpc-ids \ + $WORKSPACES_VPCID --query Vpcs[].CidrBlock --output text) + +aws ec2 authorize-security-group-ingress \ + --group-id $PROMETHEUS_SERVER_SG_GID \ + --protocol tcp --port 22 --cidr $(aws ec2 describe-vpcs --vpc-ids \ + $PROMETHEUS_VPCID --query Vpcs[].CidrBlock --output text) + + aws ec2 describe-security-groups \ + --filters Name=vpc-id,Values=String,$PROMETHEUS_VPCID \ + Name=group-name,Values=String,PROMETHEUS_SERVER_SG + +WORKSPACES_SG_GID=$(aws ec2 describe-security-groups --filters \ + Name=vpc-id,Values=String,$WORKSPACES_VPCID \ + Name=group-name,Values=Name,PROMETHEUS_TO_WORKSPACES_SG \ + --query SecurityGroups[].GroupId --output text) + +aws workspaces modify-workspace-creation-properties \ +--resource-id $(aws workspaces describe-workspace-directories \ + --query Directories[].DirectoryId \ + --output text) \ +--workspace-creation-properties CustomSecurityGroupId="$WORKSPACES_SG_GID" + +PROMETHEUS_INSTANCE_ID=$(aws ec2 describe-instances --query \ + "Reservations[*].Instances[*].InstanceId" --filters \ + "Name=tag:Name,Values=*PROMETHEUSFORWARDERSERVER*" \ + "Name=instance-state-name,Values=running" --output text) + +PROMETHEUS_SG_ID=$(aws ec2 describe-security-groups \ + --filters Name=vpc-id,Values=String,$PROMETHEUS_VPCID Name=group-name,Values=Name,PROMETHEUS_SERVER_SG \ + --query SecurityGroups[].GroupId --output text) + +aws ec2 modify-instance-attribute \ + --instance-id $PROMETHEUS_INSTANCE_ID \ + --groups $PROMETHEUS_SG_ID + +printf "Setting up IAM.\n" + +cat > AmazonPrometheusRemoteWriteTrust.json << EOF +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "ec2.amazonaws.com" + }, + "Action": "sts:AssumeRole" + } + ] +} +EOF + +aws iam create-role \ + --role-name PromWrite \ + --assume-role-policy-document file://AmazonPrometheusRemoteWriteTrust.json + +aws iam attach-role-policy \ + --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \ + --role-name PromWrite + +#This Role must be attached to an instance profile so Amazon EC2 can use the Role + +aws iam create-instance-profile \ + --instance-profile-name PromWrite + +aws iam add-role-to-instance-profile \ + --instance-profile-name PromWrite \ + --role-name PromWrite + +printf "waiting five minutes for the instance profile creation and assign it to EC2. \n" +sleep 300 + +#Now the the role & instance profile is created successfully, it must be attached to the PrometheusServer on EC2. + +aws ec2 associate-iam-instance-profile \ +--iam-instance-profile Name=PromWrite \ +--instance-id $PROMETHEUS_INSTANCE_ID + +printf "creating workspace.\n" + +cat > create-workspaces.json << EOF + +{ + "Workspaces" : [ + { + "DirectoryId" : "$WORKSPACES_DIRECTORY", + "UserName" : "$WORKSPACES_USER", + "BundleId" : "$WORKSPACES_BUNDLE" + } + ] +} +EOF + +aws workspaces create-workspaces \ + --cli-input-json file://create-workspaces.json + +printf "Automation Complete!!!\n" \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/workspacesprometheus.txt b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/workspacesprometheus.txt new file mode 100644 index 000000000..fad2d3cf9 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/Workspaces-Monitoring-AMP-AMG/workspacesprometheus.txt @@ -0,0 +1,166 @@ +#!/bin/bash +#user script for prometheus server +#runs as root + + +# The userdata script configures the following elements +# +# * Automatic installation of a Prometheus server +# * A systemd service for Prometheus that is enabled for automatic start on reboot +# * A retention of four hours of Prometheus data on the EC2, since the AMP service retains 150 days +# * User accounts, without logon rights, that run the Prometheus service and Linux agent exporter +# * A starting Prometheus configuration file in /etc/prometheus that needs configuration for your AMP service and DHCP clients +# * An example script that updates Prometheus server with DHCP client addresses every four hours, this will need additional configuration for operational use. The example script shows an Active Directory Zone Export into Prometheus format. +# * A backup script for the /etc/prometheus/prometheus.yml configuration file +# * A Linux exporter is installed so the server could be monitored, with additional configuration +# * systemd start of the SSM agent + + + + + +useradd --no-create-home --shell /bin/false prometheus +useradd -m --shell /bin/false node_exporter +mkdir /var/lib/prometheus +mkdir /etc/prometheus +chown prometheus:prometheus /etc/prometheus +chown prometheus:prometheus /var/lib/prometheus +chown prometheus:prometheus /usr/local/bin/prometheus + +cd /home/ec2-user +#server +#ADJUST FOR LATEST VERSION PATH + +wget https://github.com/prometheus/prometheus/releases/download/v2.34.0/prometheus-2.34.0.linux-amd64.tar.gz + +tar -xf prometheus-2.34.0.linux-amd64.tar.gz + +cp prometheus-2.34.0.linux-amd64/prometheus /usr/local/bin/ +cp prometheus-2.34.0.linux-amd64/promtool /usr/local/bin/ +chown prometheus:prometheus /usr/local/bin/prometheus +chown prometheus:prometheus /usr/local/bin/promtool +cp -r prometheus-2.34.0.linux-amd64/consoles /etc/prometheus/ +cp -r prometheus-2.34.0.linux-amd64/console_libraries /etc/prometheus/ +chown -R prometheus:prometheus /etc/prometheus/consoles +chown -R prometheus:prometheus /etc/prometheus/console_libraries + + +#node_exporter +#ADJUST FOR LATEST VERSION PATH + +cd /home/ec2-user +wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz + +tar -xzvf node_exporter-1.3.1.linux-amd64.tar.gz +mv node_exporter-1.3.1.linux-amd64 /home/node_exporter +mv /home/node_exporter/node_exporter-1.3.1.linux-amd64 /home/node_exporter/agent + +chown -R node_exporter:node_exporter /home/node_exporter + + +#Add node_exporter as systemd service +tee -a /etc/systemd/system/node_exporter.service << END +[Unit] +Description=Node Exporter +Wants=network-online.target +After=network-online.target +[Service] +User=node_exporter +ExecStart=/home/node_exporter/agent/node_exporter —web.listen-address=:9182 +[Install] +WantedBy=default.target +END + +systemctl daemon-reload +systemctl enable node_exporter +systemctl start node_exporter + + + +tee -a /etc/systemd/system/prometheus.service << END + +#insert this config: +[Unit] +Description=Prometheus +Wants=network-online.target +After=network-online.target + +[Service] +User=prometheus +Group=prometheus +Type=simple +ExecStart=/usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yml --storage.tsdb.path /var/lib/prometheus/ --storage.tsdb.retention.time 4h --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries + +[Install] +WantedBy=multi-user.target +END + + +#Set the service to automatically run and check status + +systemctl daemon-reload +systemctl enable prometheus +systemctl start prometheus + +#create backup folder +mkdir /home/ec2-user/bak +tee -a /home/ec2-user/zone-transfer-to-prometheus.sh << END + +#this script needs to pull DNS from route53, +#Active Directory, etc into Prometheus format + +#cron job is set to run every 4 hours +#customize for route53 DHCP, Active Directory DHCP, or whatever DHCP is used for your target monitoring systems +#backup the /etc/prometheus/prometheus.yml #to /home/ec2-user/bak as part of the script +#!/bin/bash + +#clear any variable for rawexp before we pull in #the DNS zone info +rawexp='' + +#zone transfer from Active Directory to sanitized format that only has A records, strips out others, adds the +#port format correctly for prometheus file, puts in quoted hostname in front and back of name +#this example uses a test source Bind domain transfer that is not Route 53 format +#this is commented, modify for your specific DNS data +#rawexp=/usr/bin/dig axfr @dnsserver.name.com zonetransfer.com |/usr/bin/sed "s/.name.com/.name.com:9182', /" |/usr/bin/awk $1 '/[Dd]E/{print}'| /usr/bin/awk '$4 ~ /A/ && !/AAA/ && !/SOA/ && !/AFSDB/ && !/NAPTR/ && !/DHCID/ && !/CNAME/ { printf "\x27" "%s ", $1 }' + +#echo $rawexp +#take off the last comma on the target string, create yaml format of line +newtargets=/usr/bin/echo $rawexp |/usr/bin/sed "s/..$//g" |/usr/bin/sed "s/$/]/" | /usr/bin/sed "s/^/ - targets: [/" + +#sanity check to make sure the targets look good +#/usr/bin/echo $newtargets + +#backup the file before mod with date and time in bak folder +/usr/bin/cp /etc/prometheus/prometheus.yml /home/ec2-user/bak/prometheus.yml-date +%F_%R.bak + +#clear the targets line so appending isn't forced into file +#this is commented, uncomment to start replacing prometheus DHCP targets +#/usr/bin/sed -i "/targets/d" /etc/prometheus/prometheus.yml + +#replace the target line in the prometheus.yml file +#this is commented, uncomment to start replacing prometheus DHCP targets +#/usr/bin/sed -i "/static_configs:/a\ $newtargets" /etc/prometheus/prometheus.yml + + +#new config loaded, restart the service to read the new DHCP addresses +/usr/bin/systemctl restart prometheus +/usr/bin/systemctl status prometheus + +END + + +chown ec2-user.ec2-user /home/ec2-user/zone-transfer-to-prometheus.sh + +chmod 750 /home/ec2-user/zone-transfer-to-prometheus.sh + +#set DHCP input of targets to prometheus.yml is every 4 hours +#crontab <, + lat DECIMAL(9,7), + lon DECIMAL(10,7), + nds ARRAY>, + members ARRAY>, + changeset BIGINT, + timestamp TIMESTAMP, + uid BIGINT, + user STRING, + version BIGINT +) +STORED AS ORCFILE +LOCATION 's3://osm-pds/planet/'; +``` + +Query 2: + +```sql +CREATE EXTERNAL TABLE planet_history ( + id BIGINT, + type STRING, + tags MAP, + lat DECIMAL(9,7), + lon DECIMAL(10,7), + nds ARRAY>, + members ARRAY>, + changeset BIGINT, + timestamp TIMESTAMP, + uid BIGINT, + user STRING, + version BIGINT, + visible BOOLEAN +) +STORED AS ORCFILE +LOCATION 's3://osm-pds/planet-history/'; +``` + +Query 3: + +```sql +CREATE EXTERNAL TABLE changesets ( + id BIGINT, + tags MAP, + created_at TIMESTAMP, + open BOOLEAN, + closed_at TIMESTAMP, + comments_count BIGINT, + min_lat DECIMAL(9,7), + max_lat DECIMAL(9,7), + min_lon DECIMAL(10,7), + max_lon DECIMAL(10,7), + num_changes BIGINT, + uid BIGINT, + user STRING +) +STORED AS ORCFILE +LOCATION 's3://osm-pds/changesets/'; +``` + +#### Load VPC flow logs data + +The second use case is a security-motivated one: analyzing network traffic +using [VPC Flow Logs][vpcflowlogs]. + +First, we need to tell EC2 to generate VPC Flow Logs for us. So, if you have +not done this already, you go ahead now and [create VPC flow logs][createvpcfl] +either on the network interfaces level, subnet level, or VPC level. + +:::note + To improve query performance and minimize the storage footprint, we store + the VPC flow logs in [Parquet][parquet], a columnar storage format + that supports nested data. +::: + +For our setup it doesn't matter what option you choose (network interfaces, +subnet, or VPC), as long as you publish them to an S3 bucket in Parquet format +as shown below: + +![Screen shot of the EC2 console "Create flow log" panel](../images/ec2-vpc-flowlogs-creation.png) + +Now, again via the [Athena console][athena-console], create the table for the +VPC flow logs data in the same database you imported the OSM data, or create a new one, +if you prefer to do so. + +Use the following SQL query and make sure that you're replacing +`VPC_FLOW_LOGS_LOCATION_IN_S3` with your own bucket/folder: + + +```sql +CREATE EXTERNAL TABLE vpclogs ( + `version` int, + `account_id` string, + `interface_id` string, + `srcaddr` string, + `dstaddr` string, + `srcport` int, + `dstport` int, + `protocol` bigint, + `packets` bigint, + `bytes` bigint, + `start` bigint, + `end` bigint, + `action` string, + `log_status` string, + `vpc_id` string, + `subnet_id` string, + `instance_id` string, + `tcp_flags` int, + `type` string, + `pkt_srcaddr` string, + `pkt_dstaddr` string, + `region` string, + `az_id` string, + `sublocation_type` string, + `sublocation_id` string, + `pkt_src_aws_service` string, + `pkt_dst_aws_service` string, + `flow_direction` string, + `traffic_path` int +) +STORED AS PARQUET +LOCATION 'VPC_FLOW_LOGS_LOCATION_IN_S3' +``` + +For example, `VPC_FLOW_LOGS_LOCATION_IN_S3` could look something like the +following if you're using the S3 bucket `allmyflowlogs`: + +``` +s3://allmyflowlogs/AWSLogs/12345678901/vpcflowlogs/eu-west-1/2021/ +``` + +Now that the datasets are available in Athena, let's move on to Grafana. + +### Set up Grafana + +We need a Grafana instance, so go ahead and set up a new [Amazon Managed Grafana +workspace][amg-workspace], for example by using the [Getting Started][amg-getting-started] guide, +or use an existing one. + +:::warning + To use AWS data source configuration, first go to the Amazon Managed Grafana + console to enable service-mananged IAM roles that grant the workspace the + IAM policies necessary to read the Athena resources. + Further, note the following: + + 1. The Athena workgroup you plan to use needs to be tagged with the key + `GrafanaDataSource` and value `true` for the service managed permissions + to be permitted to use the workgroup. + 1. The service-managed IAM policy only grants access to query result buckets + that start with `grafana-athena-query-results-`, so for any other bucket + you MUST add permissions manually. + 1. You have to add the `s3:Get*` and `s3:List*` permissions for the underlying data source + being queried manually. +::: + + + + +To set up the Athena data source, use the left-hand toolbar and choose the +lower AWS icon and then choose "Athena". Select your default region you want +the plugin to discover the Athena data source to use, and then select the +accounts that you want, and finally choose "Add data source". + +Alternatively, you can manually add and configure the Athena data source by +following these steps: + +1. Click on the "Configurations" icon on the left-hand toolbar and then on "Add data source". +1. Search for "Athena". +1. [OPTIONAL] Configure the authentication provider (recommended: workspace IAM + role). +1. Select your targeted Athena data source, database, and workgroup. +1. If your workgroup doesn't have an output location configured already, + specify the S3 bucket and folder to use for query results. Note that the + bucket has to start with `grafana-athena-query-results-` if you want to + benefit from the service-managed policy. +1. Click "Save & test". + +You should see something like the following: + +![Screen shot of the Athena data source config](../images/amg-plugin-athena-ds.png) + + + + +## Usage + +And now let's look at how to use our Athena datasets from Grafana. + +### Use geographical data + +The [OpenStreetMap][osm] (OSM) data in Athena can answer a number of questions, +such as "where are certain amenities". Let's see that in action. + +For example, a SQL query against the OSM dataset to list places that offer food +in the Las Vegas region is as follows: + +```sql +SELECT +tags['amenity'] AS amenity, +tags['name'] AS name, +tags['website'] AS website, +lat, lon +FROM planet +WHERE type = 'node' + AND tags['amenity'] IN ('bar', 'pub', 'fast_food', 'restaurant') + AND lon BETWEEN -115.5 AND -114.5 + AND lat BETWEEN 36.1 AND 36.3 +LIMIT 500; +``` + +:::info + The Las Vegas region in above query is defined as everything with a latitude + between `36.1` and `36.3` as well as a longitude between `-115.5` and `-114.5`. + You could turn that into a set of variables (one for each corner) and make + the Geomap plugin adaptable to other regions. +::: +To visualize the OSM data using above query, you can import an example dashboard, +available via [osm-sample-dashboard.json](./amg-athena-plugin/osm-sample-dashboard.json) +that looks as follows: + +![Screen shot of the OSM dashboard in AMG](../images/amg-osm-dashboard.png) + +:::note + In above screen shot we use the Geomap visualization (in the left panel) to + plot the data points. +::: +### Use VPC flow logs data + +To analyze the VPC flow log data, detecting SSH and RDP traffic, use the +following SQL queries. + +Getting a tabular overview on SSH/RDP traffic: + +```sql +SELECT +srcaddr, dstaddr, account_id, action, protocol, bytes, log_status +FROM vpclogs +WHERE +srcport in (22, 3389) +OR +dstport IN (22, 3389) +ORDER BY start ASC; +``` + +Getting a time series view on bytes accepted and rejected: + +```sql +SELECT +from_unixtime(start), sum(bytes), action +FROM vpclogs +WHERE +srcport in (22,3389) +OR +dstport IN (22, 3389) +GROUP BY start, action +ORDER BY start ASC; +``` + +:::tip + If you want to limit the amount of data queried in Athena, consider using + the `$__timeFilter` macro. +::: + +To visualize the VPC flow log data, you can import an example dashboard, +available via [vpcfl-sample-dashboard.json](./amg-athena-plugin/vpcfl-sample-dashboard.json) +that looks as follows: + +![Screen shot of the VPC flow logs dashboard in AMG](../images/amg-vpcfl-dashboard.png) + +From here, you can use the following guides to create your own dashboard in +Amazon Managed Grafana: + +* [User Guide: Dashboards](https://docs.aws.amazon.com/grafana/latest/userguide/dashboard-overview.html) +* [Best practices for creating dashboards](https://grafana.com/docs/grafana/latest/best-practices/best-practices-for-creating-dashboards/) + +That's it, congratulations you've learned how to use Athena from Grafana! + +## Cleanup + +Remove the OSM data from the Athena database you've been using and then +the Amazon Managed Grafana workspace by removing it from the console. + +[athena]: https://aws.amazon.com/athena/ +[amg]: https://aws.amazon.com/grafana/ +[athena-ds]: https://grafana.com/grafana/plugins/grafana-athena-datasource/ +[aws-cli]: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html +[aws-cli-conf]: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html +[amg-getting-started]: https://aws.amazon.com/blogs/mt/amazon-managed-grafana-getting-started/ +[awsod]: https://registry.opendata.aws/ +[osm]: https://aws.amazon.com/blogs/big-data/querying-openstreetmap-with-amazon-athena/ +[vpcflowlogs]: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html +[createvpcfl]: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-s3.html#flow-logs-s3-create-flow-log +[athena-console]: https://console.aws.amazon.com/athena/ +[amg-workspace]: https://console.aws.amazon.com/grafana/home#/workspaces +[parquet]: https://github.com/apache/parquet-format diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/amg-athena-plugin/osm-sample-dashboard.json b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-athena-plugin/osm-sample-dashboard.json new file mode 100644 index 000000000..91f564208 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-athena-plugin/osm-sample-dashboard.json @@ -0,0 +1,256 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": "-- Grafana --", + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "gnetId": null, + "graphTooltip": 0, + "id": 2, + "links": [ + { + "asDropdown": false, + "icon": "external link", + "includeVars": false, + "keepTime": false, + "tags": [], + "targetBlank": false, + "title": "Source", + "tooltip": "", + "type": "link", + "url": "https://aws.amazon.com/blogs/big-data/querying-openstreetmap-with-amazon-athena/" + } + ], + "liveNow": false, + "panels": [ + { + "datasource": null, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 42, + "w": 9, + "x": 0, + "y": 0 + }, + "id": 2, + "options": { + "basemap": { + "config": {}, + "type": "default" + }, + "controls": { + "mouseWheelZoom": true, + "showAttribution": true, + "showDebug": false, + "showScale": false, + "showZoom": true + }, + "layers": [ + { + "config": { + "color": { + "fixed": "red" + }, + "fillOpacity": 0.4, + "shape": "circle", + "showLegend": true, + "size": { + "fixed": 5, + "max": 15, + "min": 2 + } + }, + "location": { + "mode": "auto" + }, + "type": "markers" + } + ], + "view": { + "id": "coords", + "lat": 36.186461, + "lon": -115.223865, + "zoom": 10.5 + } + }, + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "sampledb", + "region": "__default" + }, + "format": 1, + "rawSQL": "SELECT tags['amenity'] as amenity, tags['name'] as name, tags['website'] as website, lat, lon from planet\nWHERE type = 'node'\n AND tags['amenity'] IN ('bar', 'pub', 'fast_food', 'restaurant')\n AND lon BETWEEN -115.5 AND -114.5\n AND lat BETWEEN 36.1 AND 36.3\nLIMIT 500;", + "refId": "A", + "table": "planet" + } + ], + "title": "OpenStreetMap", + "type": "geomap" + }, + { + "datasource": null, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "displayMode": "auto", + "filterable": true + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "address" + }, + "properties": [ + { + "id": "custom.width", + "value": 484 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "tags" + }, + "properties": [ + { + "id": "custom.width", + "value": 1076 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "name" + }, + "properties": [ + { + "id": "custom.width", + "value": 257 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "amenity" + }, + "properties": [ + { + "id": "custom.width", + "value": 178 + } + ] + } + ] + }, + "gridPos": { + "h": 42, + "w": 11, + "x": 9, + "y": 0 + }, + "id": 3, + "options": { + "showHeader": true, + "sortBy": [ + { + "desc": true, + "displayName": "website" + } + ] + }, + "pluginVersion": "8.2.2", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "sampledb", + "region": "__default" + }, + "format": 1, + "rawSQL": "SELECT tags['amenity'] as amenity, tags['name'] as name, concat(tags['addr:housenumber'], ', ', tags['addr:street']) as address, tags['website'] as website from planet\nWHERE type = 'node'\n AND tags['amenity'] IN ('bar', 'pub', 'fast_food', 'restaurant')\n AND lon BETWEEN -115.5 AND -114.5\n AND lat BETWEEN 36.1 AND 36.3\nLIMIT 500;\n", + "refId": "A", + "table": "planet" + } + ], + "title": "OpenStreetMap", + "type": "table" + } + ], + "schemaVersion": 31, + "style": "dark", + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-1y", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "Athena: OpenStreetMap about Las Vegas", + "uid": "Tja0ElF7k", + "version": 14 +} diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/amg-athena-plugin/vpcfl-sample-dashboard.json b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-athena-plugin/vpcfl-sample-dashboard.json new file mode 100644 index 000000000..911a7cb13 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-athena-plugin/vpcfl-sample-dashboard.json @@ -0,0 +1,562 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": "-- Grafana --", + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "gnetId": null, + "graphTooltip": 0, + "id": 6, + "links": [], + "liveNow": false, + "panels": [ + { + "datasource": null, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "displayMode": "auto" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "action" + }, + "properties": [ + { + "id": "custom.width", + "value": 113 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "protocol" + }, + "properties": [ + { + "id": "custom.width", + "value": 89 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "bytes" + }, + "properties": [ + { + "id": "custom.width", + "value": 99 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "srcaddr" + }, + "properties": [ + { + "id": "custom.width", + "value": 182 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "dstaddr" + }, + "properties": [ + { + "id": "custom.width", + "value": 157 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "account_id" + }, + "properties": [ + { + "id": "custom.width", + "value": 173 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "log_status" + }, + "properties": [ + { + "id": "custom.width", + "value": 106 + } + ] + } + ] + }, + "gridPos": { + "h": 20, + "w": 7, + "x": 0, + "y": 0 + }, + "id": 2, + "options": { + "showHeader": true, + "sortBy": [] + }, + "pluginVersion": "8.2.5", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "sampledb", + "region": "__default" + }, + "format": 1, + "hide": false, + "rawSQL": "SELECT\nsrcaddr, dstaddr, account_id, action, protocol, bytes, log_status\nFROM vpclogs\nWHERE\nsrcport in (22,3389) \nOR\ndstport IN (22, 3389)\nORDER BY start ASC;\n", + "refId": "A" + } + ], + "title": "SSH and RDP traffic", + "type": "table" + }, + { + "datasource": null, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "bytes REJECT" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "_col1 REJECT" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 20, + "w": 17, + "x": 7, + "y": 0 + }, + "id": 3, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "single" + } + }, + "pluginVersion": "8.2.2", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "sampledb", + "region": "__default" + }, + "format": 0, + "hide": false, + "rawSQL": "SELECT\nfrom_unixtime(start), sum(bytes), action\nFROM vpclogs\nWHERE\nsrcport in (22,3389)\nOR\ndstport IN (22, 3389)\nGROUP BY start, action\nORDER BY start ASC;", + "refId": "A" + } + ], + "title": "SSH and RDP traffic bytes", + "type": "timeseries" + }, + { + "datasource": null, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "displayMode": "auto" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "action" + }, + "properties": [ + { + "id": "custom.width", + "value": 113 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "protocol" + }, + "properties": [ + { + "id": "custom.width", + "value": 89 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "bytes" + }, + "properties": [ + { + "id": "custom.width", + "value": 99 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "srcaddr" + }, + "properties": [ + { + "id": "custom.width", + "value": 182 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "dstaddr" + }, + "properties": [ + { + "id": "custom.width", + "value": 157 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "account_id" + }, + "properties": [ + { + "id": "custom.width", + "value": 173 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "log_status" + }, + "properties": [ + { + "id": "custom.width", + "value": 106 + } + ] + } + ] + }, + "gridPos": { + "h": 20, + "w": 7, + "x": 0, + "y": 20 + }, + "id": 4, + "options": { + "showHeader": true, + "sortBy": [] + }, + "pluginVersion": "8.2.5", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "sampledb", + "region": "__default" + }, + "format": 1, + "hide": false, + "rawSQL": "SELECT\nsrcaddr, dstaddr, account_id, action, protocol, bytes, log_status\nFROM vpclogs\nWHERE\nsrcport = 53 \nOR\ndstport = 53\nORDER BY start ASC;\n", + "refId": "A" + } + ], + "title": "DNS traffic", + "type": "table" + }, + { + "datasource": null, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "bytes REJECT" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "_col1 REJECT" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-red", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 20, + "w": 17, + "x": 7, + "y": 20 + }, + "id": 5, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "single" + } + }, + "pluginVersion": "8.2.2", + "targets": [ + { + "connectionArgs": { + "catalog": "__default", + "database": "sampledb", + "region": "__default" + }, + "format": 0, + "hide": false, + "rawSQL": "SELECT\nfrom_unixtime(start), sum(bytes), action\nFROM vpclogs\nWHERE\nsrcport = 53\nOR\ndstport = 53\nGROUP BY start, action\nORDER BY start ASC;\n", + "refId": "A" + } + ], + "title": "DNS traffic bytes", + "type": "timeseries" + } + ], + "refresh": false, + "schemaVersion": 32, + "style": "dark", + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-6h", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "Amazon VPC Flow Logs", + "uid": "k6E7fq5nz", + "version": 14 +} diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/amg-automation-tf.md b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-automation-tf.md new file mode 100644 index 000000000..aa1566576 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-automation-tf.md @@ -0,0 +1,279 @@ +# Using Terraform for Amazon Managed Grafana automation + +In this recipe we show you how use Terraform to automate Amazon Managed Grafana, +for example to add datasources or dashboards consistently across a number of workspaces. + +:::note + This guide will take approximately 30 minutes to complete. +::: +## Prerequisites + +* The [AWS command line][aws-cli] is installed and [configured][aws-cli-conf] in your local environment. +* You have the [Terraform][tf] command line installed in your local environment. +* You have an Amazon Managed Service for Prometheus workspace ready to use. +* You have an Amazon Managed Grafana workspace ready to use. + +## Set up Amazon Managed Grafana + +In order for Terraform to [authenticate][grafana-authn] against Grafana, we are +using an API Key, which acts as a kind of password. + +:::info + The API key is an [RFC 6750][rfc6750] HTTP Bearer header + with a 51 character long alpha-numeric value authenticating the caller with + every request against the Grafana API. +::: + +So, before we can set up the Terraform manifest, we first need to create an +API key. You do this via the Grafana UI as follows. + +First, select from the left-hand side menu in the `Configuration` section +the `API keys` menu item: + +![Configuration, API keys menu item](../images/api-keys-menu-item.png) + +Now create a new API key, give it a name that makes sense for your task at +hand, assign it `Admin` role and set the duration time to, for example, one day: + +![API key creation](../images/api-key-creation.png) + +:::note + The API key is valid for a limited time, in AMG you can use values up to 30 days. +::: +Once you hit the `Add` button you should see a pop-up dialog that contains the +API key: + +![API key result](../images/api-key-result.png) + +:::warning + This is the only time you will see the API key, so store it from here + in a safe place, we will need it in the Terraform manifest later. +::: +With this we've set up everything we need in Amazon Managed Grafana in order to +use Terraform for automation, so let's move on to this step. + +## Automation with Terraform + +### Preparing Terraform + +For Terraform to be able to interact with Grafana, we're using the official +[Grafana provider][tf-grafana-provider] in version 1.13.3 or above. + +In the following, we want to automate the creation of a data source, in our +case we want to add a Prometheus [data source][tf-ds], to be exact, an +AMP workspace. + +First, create a file called `main.tf` with the following content: + +``` +terraform { + required_providers { + grafana = { + source = "grafana/grafana" + version = ">= 1.13.3" + } + } +} + +provider "grafana" { + url = "INSERT YOUR GRAFANA WORKSPACE URL HERE" + auth = "INSERT YOUR API KEY HERE" +} + +resource "grafana_data_source" "prometheus" { + type = "prometheus" + name = "amp" + is_default = true + url = "INSERT YOUR AMP WORKSPACE URL HERE " + json_data { + http_method = "POST" + sigv4_auth = true + sigv4_auth_type = "workspace-iam-role" + sigv4_region = "eu-west-1" + } +} +``` +In above file you need to insert three values that depend on your environment. + +In the Grafana provider section: + +* `url` … the Grafana workspace URL which looks something like the following: + `https://xxxxxxxx.grafana-workspace.eu-west-1.amazonaws.com`. +* `auth` … the API key you have created in the previous step. + +In the Prometheus resource section, insert the `url` which is the AMP +workspace URL in the form of +`https://aps-workspaces.eu-west-1.amazonaws.com/workspaces/ws-xxxxxxxxx`. + +:::note + If you're using Amazon Managed Grafana in a different region than the one + shown in the file, you will have to, in addition to above, also set the + `sigv4_region` to your region. +::: +To wrap up the preparation phase, let's now initialize Terraform: + +``` +$ terraform init +Initializing the backend... + +Initializing provider plugins... +- Finding grafana/grafana versions matching ">= 1.13.3"... +- Installing grafana/grafana v1.13.3... +- Installed grafana/grafana v1.13.3 (signed by a HashiCorp partner, key ID 570AA42029AE241A) + +Partner and community providers are signed by their developers. +If you'd like to know more about provider signing, you can read about it here: +https://www.terraform.io/docs/cli/plugins/signing.html + +Terraform has created a lock file .terraform.lock.hcl to record the provider +selections it made above. Include this file in your version control repository +so that Terraform can guarantee to make the same selections by default when +you run "terraform init" in the future. + +Terraform has been successfully initialized! + +You may now begin working with Terraform. Try running "terraform plan" to see +any changes that are required for your infrastructure. All Terraform commands +should now work. + +If you ever set or change modules or backend configuration for Terraform, +rerun this command to reinitialize your working directory. If you forget, other +commands will detect it and remind you to do so if necessary. +``` + +With that, we're all set and can use Terraform to automate the data source +creation as explained in the following. + +### Using Terraform + +Usually, you would first have a look what Terraform's plan is, like so: + +``` +$ terraform plan + +Terraform used the selected providers to generate the following execution plan. +Resource actions are indicated with the following symbols: + + create + +Terraform will perform the following actions: + + # grafana_data_source.prometheus will be created + + resource "grafana_data_source" "prometheus" { + + access_mode = "proxy" + + basic_auth_enabled = false + + id = (known after apply) + + is_default = true + + name = "amp" + + type = "prometheus" + + url = "https://aps-workspaces.eu-west-1.amazonaws.com/workspaces/ws-xxxxxx/" + + + json_data { + + http_method = "POST" + + sigv4_auth = true + + sigv4_auth_type = "workspace-iam-role" + + sigv4_region = "eu-west-1" + } + } + +Plan: 1 to add, 0 to change, 0 to destroy. + +─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── + +Note: You didn't use the -out option to save this plan, so Terraform can't guarantee to take exactly these actions if you run "terraform apply" now. + +``` + +If you're happy with what you see there, you can apply the plan: + +``` +$ terraform apply + +Terraform used the selected providers to generate the following execution plan. +Resource actions are indicated with the following symbols: + + create + +Terraform will perform the following actions: + + # grafana_data_source.prometheus will be created + + resource "grafana_data_source" "prometheus" { + + access_mode = "proxy" + + basic_auth_enabled = false + + id = (known after apply) + + is_default = true + + name = "amp" + + type = "prometheus" + + url = "https://aps-workspaces.eu-west-1.amazonaws.com/workspaces/ws-xxxxxxxxx/" + + + json_data { + + http_method = "POST" + + sigv4_auth = true + + sigv4_auth_type = "workspace-iam-role" + + sigv4_region = "eu-west-1" + } + } + +Plan: 1 to add, 0 to change, 0 to destroy. + +Do you want to perform these actions? + Terraform will perform the actions described above. + Only 'yes' will be accepted to approve. + + Enter a value: yes + +grafana_data_source.prometheus: Creating... +grafana_data_source.prometheus: Creation complete after 1s [id=10] + +Apply complete! Resources: 1 added, 0 changed, 0 destroyed. + +``` + +When you now go to the data source list in Grafana you should see something +like the following: + +![AMP as data source in AMG](../images/amg-prom-ds-with-tf.png) + +To verify if your newly created data source works, you can hit the blue `Save & +test` button at the bottom and you should see a `Data source is working` +confirmation message as a result here. + +You can use Terraform also to automate other things, for example, the [Grafana +provider][tf-grafana-provider] supports managing folders and dashboards. + +Let's say you want to create a folder to organize your dashboards, for example: + +``` +resource "grafana_folder" "examplefolder" { + title = "devops" +} +``` + +Further, say you have a dashboard called `example-dashboard.json`, and you want +to create it in the folder from above, then you would use the following snippet: + +``` +resource "grafana_dashboard" "exampledashboard" { + folder = grafana_folder.examplefolder.id + config_json = file("example-dashboard.json") +} +``` + +Terraform is a powerful tool for automation and you can use it as shown here +to manage your Grafana resources. + +:::note + Keep in mind, though, that the [state in Terraform][tf-state] is, by default, + managed locally. This means, if you plan to collaboratively work with Terraform, + you need to pick one of the options available that allow you to share the state across a team. +::: +## Cleanup + +Remove the Amazon Managed Grafana workspace by removing it from the console. + +[aws-cli]: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html +[aws-cli-conf]: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html +[tf]: https://www.terraform.io/downloads.html +[grafana-authn]: https://grafana.com/docs/grafana/latest/http_api/auth/ +[rfc6750]: https://datatracker.ietf.org/doc/html/rfc6750 +[tf-grafana-provider]: https://registry.terraform.io/providers/grafana/grafana/latest/docs +[tf-ds]: https://registry.terraform.io/providers/grafana/grafana/latest/docs/resources/data_source +[tf-state]: https://www.terraform.io/docs/language/state/remote.html diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/amg-google-auth-saml.md b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-google-auth-saml.md new file mode 100644 index 000000000..60a03f081 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-google-auth-saml.md @@ -0,0 +1,91 @@ +# Configure Google Workspaces authentication with Amazon Managed Grafana using SAML + +In this guide, we will walk through how you can setup Google Workspaces as an +identity provider (IdP) for Amazon Managed Grafana using SAML v2.0 protocol. + +In order to follow this guide you need to create a paid [Google Workspaces][google-workspaces] +account in addition to having an [Amazon Managed Grafana workspace][amg-ws] created. + +### Create Amazon Managed Grafana workspace + +Log into the Amazon Managed Grafana console and click **Create workspace.** In the following screen, +provide a workspace name as shown below. Then click **Next**: + +![Create Workspace - Specify workspace details](../images/amg-saml-google-auth/1.png) + +In the **Configure settings** page, select **Security Assertion Markup Language (SAML)** +option so you can configure a SAML based Identity Provider for users to log in: + +![Create Workspace - Configure settings](../images/amg-saml-google-auth/2.png) + +Select the data sources you want to choose and click **Next**: +![Create Workspace - Permission settings](../images/amg-saml-google-auth/3.png) + +Click on **Create workspace** button in the **Review and create** screen: +![Create Workspace - Review settings](../images/amg-saml-google-auth/4.png) + +This will create a new Amazon Managed Grafana workspace as shown below: + +![Create Workspace - Create AMG workspace](../images/amg-saml-google-auth/5.png) + +### Configure Google Workspaces + +Login to Google Workspaces with Super Admin permissions and go +to **Web and mobile apps** under **Apps** section. There, click on **Add App** +and select **Add custom SAML app.** Now give the app a name as shown below. +Click **CONTINUE.**: + +![Google Workspace - Add custom SAML app - App details](../images/amg-saml-google-auth/6.png) + + +On the next screen, click on **DOWNLOAD METADATA** button to download the SAML metadata file. Click **CONTINUE.** + +![Google Workspace - Add custom SAML app - Download Metadata](../images/amg-saml-google-auth/7.png) + +On the next screen, you will see the ACS URL, Entity ID and Start URL fields. +You can get the values for these fields from the Amazon Managed Grafana console. + +Select **EMAIL** from the drop down in the **Name ID format** field and select **Basic Information > Primary email** in the **Name ID** field. + +Click **CONTINUE.** +![Google Workspace - Add custom SAML app - Service provider details](../images/amg-saml-google-auth/8.png) + +![AMG - SAML Configuration details](../images/amg-saml-google-auth/9.png) + +In the **Attribute mapping** screen, make the mapping between **Google Directory attributes** and **App attributes** as shown in the screenshot below + +![Google Workspace - Add custom SAML app - Attribute mapping](../images/amg-saml-google-auth/10.png) + +For users logging in through Google authentication to have **Admin** privileges +in **Amazon Managed Grafana**, set the **Department** field’s value as ***monitoring*.** You can choose any field and any value for this. Whatever you choose to use on the Google Workspaces side, make sure you make the mapping on Amazon Managed Grafana SAML settings to reflect that. + +### Upload SAML metadata into Amazon Managed Grafana + +Now in the Amazon Managed Grafana console, click **Upload or copy/paste** option +and select **Choose file** button to upload the SAML metadata file downloaded +from Google Workspaces, earlier. + +In the **Assertion mapping** section, type in **Department** in the +**Assertion attribute role** field and **monitoring** in the **Admin role values** field. +This will allow users logging in with **Department** as **monitoring** to +have **Admin** privileges in Grafana so they can perform administrator duties +such as creating dashboards and datasources. + +Set values under **Additional settings - optional** section as shown in the +screenshot below. Click on **Save SAML configuration**: + +![AMG SAML - Assertion mapping](../images/amg-saml-google-auth/11.png) + +Now Amazon Managed Grafana is set up to authenticate users using Google Workspaces. + +When users login, they will be redirected to the Google login page like so: + +![Google Workspace - Google sign in](../images/amg-saml-google-auth/12.png) + +After entering their credentials, they will be logged into Grafana as shown in the screenshot below. +![AMG - Grafana user settings page](../images/amg-saml-google-auth/13.png) + +As you can see, the user was able to successfully login to Grafana using Google Workspaces authentication. + +[google-workspaces]: https://workspace.google.com/ +[amg-ws]: https://docs.aws.amazon.com/grafana/latest/userguide/getting-started-with-AMG.html#AMG-getting-started-workspace diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/amg-redshift-plugin.md b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-redshift-plugin.md new file mode 100644 index 000000000..d74cd8c94 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/amg-redshift-plugin.md @@ -0,0 +1,91 @@ +# Using Redshift in Amazon Managed Grafana + +In this recipe we show you how to use [Amazon Redshift][redshift]—a petabyte-scale data +warehouse service using standard SQL—in [Amazon Managed Grafana][amg]. This integration +is enabled by the [Redshift data source for Grafana][redshift-ds], an open source +plugin available for you to use in any DIY Grafana instance as well as +pre-installed in Amazon Managed Grafana. + +:::note + This guide will take approximately 10 minutes to complete. +::: +## Prerequisites + +1. You have admin access to Amazon Redshift from your account. +1. Tag your Amazon Redshift cluster with `GrafanaDataSource: true`. +1. In order to benefit from the service-managed policies, create the database + credentials in one of the following ways: + 1. If you want to use the default mechanism, that is, the temporary credentials + option, to authenticate against the Redshift database, you must create a database + user named `redshift_data_api_user`. + 1. If you want to use the credentials from Secrets Manager, you must tag the + secret with `RedshiftQueryOwner: true`. + +:::tip + For more information on how to work with the service-managed or custom policies, + see the [examples in the Amazon Managed Grafana docs][svpolicies]. +::: + +## Infrastructure +We need a Grafana instance, so go ahead and set up a new [Amazon Managed Grafana +workspace][amg-workspace], for example by using the [Getting Started][amg-getting-started] guide, +or use an existing one. + +:::note + To use AWS data source configuration, first go to the Amazon Managed Grafana + console to enable service-mananged IAM roles that grant the workspace the + IAM policies necessary to read the Athena resources. +::: + +To set up the Athena data source, use the left-hand toolbar and choose the +lower AWS icon and then choose "Redshift". Select your default region you want +the plugin to discover the Redshift data source to use, and then select the +accounts that you want, and finally choose "Add data source". + +Alternatively, you can manually add and configure the Redshift data source by +following these steps: + +1. Click on the "Configurations" icon on the left-hand toolbar and then on "Add data source". +1. Search for "Redshift". +1. [OPTIONAL] Configure the authentication provider (recommended: workspace IAM + role). +1. Provide the "Cluster Identifier", "Database", and "Database User" values. +1. Click "Save & test". + +You should see something like the following: + +![Screen shot of the Redshift data source config](../images/amg-plugin-redshift-ds.png) + +## Usage +We will be using the [Redshift Advance Monitoring][redshift-mon] setup. +Since all is available out of the box, there's nothing else to configure at +this point. + +You can import the Redshift monitoring dashboard, included in the Redshift +plugin. Once imported you should see something like this: + +![Screen shot of the Redshift dashboard in AMG](../images/amg-redshift-mon-dashboard.png) + +From here, you can use the following guides to create your own dashboard in +Amazon Managed Grafana: + +* [User Guide: Dashboards](https://docs.aws.amazon.com/grafana/latest/userguide/dashboard-overview.html) +* [Best practices for creating dashboards](https://grafana.com/docs/grafana/latest/best-practices/best-practices-for-creating-dashboards/) + +That's it, congratulations you've learned how to use Redshift from Grafana! + +## Cleanup + +Remove the Redshift database you've been using and then +the Amazon Managed Grafana workspace by removing it from the console. + +[redshift]: https://aws.amazon.com/redshift/ +[amg]: https://aws.amazon.com/grafana/ +[svpolicies]: https://docs.aws.amazon.com/grafana/latest/userguide/security_iam_id-based-policy-examples.html +[redshift-ds]: https://grafana.com/grafana/plugins/grafana-redshift-datasource/ +[aws-cli]: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html +[aws-cli-conf]: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html +[amg-getting-started]: https://aws.amazon.com/blogs/mt/amazon-managed-grafana-getting-started/ +[redshift-console]: https://console.aws.amazon.com/redshift/ +[redshift-mon]: https://github.com/awslabs/amazon-redshift-monitoring +[amg-workspace]: https://console.aws.amazon.com/grafana/home#/workspaces diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/amp-alertmanager-terraform.md b/docusaurus/observability-best-practices/docs/recipes/recipes/amp-alertmanager-terraform.md new file mode 100644 index 000000000..e60a8b0ec --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/amp-alertmanager-terraform.md @@ -0,0 +1,127 @@ +# Terraform as Infrastructure as a Code to deploy Amazon Managed Service for Prometheus and configure Alert manager + +In this recipe, we will demonstrate how you can use [Terraform](https://www.terraform.io/) to provision [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) and configure rules management and alert manager to send notification to a [SNS](https://docs.aws.amazon.com/sns/) topic if a certain condition is met. + + +:::note + This guide will take approximately 30 minutes to complete. +::: +## Prerequisites + +You will need the following to complete the setup: + +* [Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html) +* [AWS CLI version 2](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) +* [Terraform CLI](https://www.terraform.io/downloads) +* [AWS Distro for OpenTelemetry(ADOT)](https://aws-otel.github.io/) +* [eksctl](https://eksctl.io/) +* [kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) +* [jq](https://stedolan.github.io/jq/download/) +* [helm](https://helm.sh/) +* [SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html) +* [awscurl](https://github.com/okigan/awscurl) + +In the recipe, we will use a sample application in order to demonstrate the metric scraping using ADOT and remote write the metrics to the Amazon Managed Service for Prometheus workspace. Fork and clone the sample app from the repository at [aws-otel-community](https://github.com/aws-observability/aws-otel-community). + +This Prometheus sample app generates all 4 Prometheus metric types (counter, gauge, histogram, summary) and exposes them at the /metrics endpoint + +A health check endpoint also exists at / + +The following is a list of optional command line flags for configuration: + +listen_address: (default = 0.0.0.0:8080) defines the address and port that the sample app is exposed to. This is primarily to conform with the test framework requirements. + +metric_count: (default=1) the amount of each type of metric to generate. The same amount of metrics is always generated per metric type. + +label_count: (default=1) the amount of labels per metric to generate. + + +datapoint_count: (default=1) the number of data-points per metric to generate. + +### Enabling Metric collection using AWS Distro for Opentelemetry +1. Fork and clone the sample app from the repository at aws-otel-community. +Then run the following commands. + +``` +cd ./sample-apps/prometheus +docker build . -t prometheus-sample-app:latest +``` +2. Push this image to a registry such as Amazon ECR. You can use the following command to create a new ECR repository in your account. Make sure to set "YOUR_REGION" as well. + +``` +aws ecr create-repository \ + --repository-name prometheus-sample-app \ + --image-scanning-configuration scanOnPush=true \ + --region +``` +3. Deploy the sample app in the cluster by copying this Kubernetes configuration and applying it. Change the image to the image that you just pushed by replacing `PUBLIC_SAMPLE_APP_IMAGE` in the prometheus-sample-app.yaml file. + +``` +curl https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/examples/eks/aws-prometheus/prometheus-sample-app.yaml -o prometheus-sample-app.yaml +kubectl apply -f prometheus-sample-app.yaml +``` +4. Start a default instance of the ADOT Collector. To do so, first enter the following command to pull the Kubernetes configuration for ADOT Collector. + +``` +curl https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/examples/eks/aws-prometheus/prometheus-daemonset.yaml -o prometheus-daemonset.yaml +``` +Then edit the template file, substituting the remote_write endpoint for your Amazon Managed Service for Prometheus workspace for `YOUR_ENDPOINT` and your Region for `YOUR_REGION`. +Use the remote_write endpoint that is displayed in the Amazon Managed Service for Prometheus console when you look at your workspace details. +You'll also need to change `YOUR_ACCOUNT_ID` in the service account section of the Kubernetes configuration to your AWS account ID. + +In this recipe, the ADOT Collector configuration uses an annotation `(scrape=true)` to tell which target endpoints to scrape. This allows the ADOT Collector to distinguish the sample app endpoint from kube-system endpoints in your cluster. You can remove this from the re-label configurations if you want to scrape a different sample app. +5. Enter the following command to deploy the ADOT collector. +``` +kubectl apply -f eks-prometheus-daemonset.yaml +``` + +### Configure workspace with Terraform + +Now, we will provision a Amazon Managed Service for Prometheus workspace and will define an alerting rule that causes the Alert Manager to send a notification if a certain condition (defined in ```expr```) holds true for a specified time period (```for```). Code in the Terraform language is stored in plain text files with the .tf file extension. There is also a JSON-based variant of the language that is named with the .tf.json file extension. + +We will now use the [main.tf](./amp-alertmanager-terraform/main.tf) to deploy the resources using terraform. Before running the terraform command, we will export the `region` and `sns_topic` variable. + +``` +export TF_VAR_region= +export TF_VAR_sns_topic= +``` + +Now, we will execute the below commands to provision the workspace: + +``` +terraform init +terraform plan +terraform apply +``` + +Once the above steps are complete, verify the setup end-to-end by using awscurl and query the endpoint. Ensure the `WORKSPACE_ID` variable is replaced with the appropriate Amazon Managed Service for Prometheus workspace id. + +On running the below command, look for the metric “metric:recording_rule”, and, if you successfully find the metric, then you’ve successfully created a recording rule: + +``` +awscurl https://aps-workspaces.us-east-1.amazonaws.com/workspaces/$WORKSPACE_ID/api/v1/rules --service="aps" +``` +Sample Output: +``` +"status":"success","data":{"groups":[{"name":"alert-test","file":"rules","rules":[{"state":"firing","name":"metric:alerting_rule","query":"rate(adot_test_counter0[5m]) \u003e 5","duration":0,"labels":{},"annotations":{},"alerts":[{"labels":{"alertname":"metric:alerting_rule"},"annotations":{},"state":"firing","activeAt":"2021-09-16T13:20:35.9664022Z","value":"6.96890019778219e+01"}],"health":"ok","lastError":"","type":"alerting","lastEvaluation":"2021-09-16T18:41:35.967122005Z","evaluationTime":0.018121408}],"interval":60,"lastEvaluation":"2021-09-16T18:41:35.967104769Z","evaluationTime":0.018142997},{"name":"test","file":"rules","rules":[{"name":"metric:recording_rule","query":"rate(adot_test_counter0[5m])","labels":{},"health":"ok","lastError":"","type":"recording","lastEvaluation":"2021-09-16T18:40:44.650001548Z","evaluationTime":0.018381387}],"interval":60,"lastEvaluation":"2021-09-16T18:40:44.649986468Z","evaluationTime":0.018400463}]},"errorType":"","error":""} +``` + +We can further query the alertmanager endpoint to confirm the same +``` +awscurl https://aps-workspaces.us-east-1.amazonaws.com/workspaces/$WORKSPACE_ID/alertmanager/api/v2/alerts --service="aps" -H "Content-Type: application/json" +``` +Sample Output: +``` +[{"annotations":{},"endsAt":"2021-09-16T18:48:35.966Z","fingerprint":"114212a24ca97549","receivers":[{"name":"default"}],"startsAt":"2021-09-16T13:20:35.966Z","status":{"inhibitedBy":[],"silencedBy":[],"state":"active"},"updatedAt":"2021-09-16T18:44:35.984Z","generatorURL":"/graph?g0.expr=sum%28rate%28envoy_http_downstream_rq_time_bucket%5B1m%5D%29%29+%3E+5\u0026g0.tab=1","labels":{"alertname":"metric:alerting_rule"}}] +``` +This confirms the alert was triggered and sent to SNS via the SNS receiver + +## Clean up + +Run the following command to terminate the Amazon Managed Service for Prometheus workspace. Make sure you delete the EKS Cluster that was created as well: + + +``` +terraform destroy +``` + diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/amp-alertmanager-terraform/main.tf b/docusaurus/observability-best-practices/docs/recipes/recipes/amp-alertmanager-terraform/main.tf new file mode 100644 index 000000000..711ece9d9 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/amp-alertmanager-terraform/main.tf @@ -0,0 +1,46 @@ +provider "aws" { + profile = "default" + region = us-east-1 +} +variable "region" { +} +variable "sns_topic" { +} +resource "aws_prometheus_workspace" "amp-terraform-ws" { + alias = "amp-terraform-ws" +} + +resource "aws_prometheus_rule_group_namespace" "amp-terraform-ws" { + name = "rules" + workspace_id = aws_prometheus_workspace.amp-terraform-ws.id + data = < 0.014 + for: 5m +EOF +} + +resource "aws_prometheus_alert_manager_definition" "amp-terraform-ws" { + workspace_id = aws_prometheus_workspace.amp-terraform-ws.id + definition = < prometheus_rules.b64 +aws amp create-rule-groups-namespace --data file://prometheus_rules.b64 --name kubernetes-mixin --workspace-id < --region <> +``` + + + +Download the contents of the ‘dashboard_out’ folder from the Cloud9 environment and upload them using the Grafana web UI. diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/as-ec2-using-amp-and-alertmanager.md b/docusaurus/observability-best-practices/docs/recipes/recipes/as-ec2-using-amp-and-alertmanager.md new file mode 100644 index 000000000..b1901e6fe --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/as-ec2-using-amp-and-alertmanager.md @@ -0,0 +1,158 @@ +# Auto-scaling Amazon EC2 using Amazon Managed Service for Prometheus and alert manager + +Customers want to migrate their existing Prometheus workloads to the cloud and utilize all that the cloud offers. AWS has services like Amazon [EC2 Auto Scaling](https://aws.amazon.com/ec2/autoscaling/), which lets you scale out [Amazon Elastic Compute Cloud (Amazon EC2)](https://aws.amazon.com/pm/ec2/) instances based on metrics like CPU or memory utilization. Applications that use Prometheus metrics can easily integrate into EC2 Auto Scaling without needing to replace their monitoring stack. In this post, I will walk you through configuring Amazon EC2 Auto Scaling to work with [Amazon Managed Service for Prometheus Alert Manager](https://aws.amazon.com/prometheus/). This approach lets you move a Prometheus-based workload to the cloud while taking advantage of services like autoscaling. + +Amazon Managed Service for Prometheus provides support for [alerting rules](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-Ruler.html) that use [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/). The [Prometheus alerting rules documentation](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) provides the syntax and examples of valid alerting rules. Likewise, the Prometheus alert manager documentation references both the [syntax](https://prometheus.io/docs/prometheus/latest/configuration/template_reference/) and [examples](https://prometheus.io/docs/prometheus/latest/configuration/template_examples/) of valid alert manager configurations. + +## Solution overview + +First, let’s briefly review Amazon EC2 Auto Scaling‘s concept of an [Auto Scaling group](https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html) which is a logical collection of Amazon EC2 instances. An Auto Scaling group can launch EC2 instances based on a predefined launch template. The [launch template](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-launch-templates.html) contains information used to launch the Amazon EC2 instance, including the AMI ID, the instance type, network settings, and [AWS Identity and Access Management (IAM)](https://aws.amazon.com/iam/) instance profile. + +Amazon EC2 Auto Scaling groups have a [minimum size, maximum size, and desired capacity](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html) concepts. When Amazon EC2 Auto Scaling detects that the current running capacity of the Auto Scaling group is above or below the desired capacity, it will automatically scale out or scale in as needed. This scaling approach lets you utilize elasticity within your workload while still keeping bounds on both capacity and costs. + +To demonstrate this solution, I have created an Amazon EC2 Auto Scaling group that contains two Amazon EC2 instances. These instances [remote write instance metrics](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-ingest-metrics-remote-write-EC2.html) to an Amazon Managed Service for Prometheus workspace. I have set the Auto Scaling group’s minimum size to two (to maintain high availability), and I’ve set the group’s maximum size to 10 (to help control costs). As more traffic hits the solution, additional Amazon EC2 instances are automatically added to support the load, up to the Amazon EC2 Auto Scaling group’s maximum size. As the load decreases, those Amazon EC2 instances are terminated until the Amazon EC2 Auto Scaling group reaches the group’s minimum size. This approach lets you have a performant application by utilizing the elasticity of the cloud. + +Note that as you scrape more and more resources, you could quickly overwhelm the capabilities of a single Prometheus server. You can avoid this situation by scaling Prometheus servers linearly with the workload. This approach ensures that you can collect metric data at the granularity that you want. + +To support the Auto Scaling of a Prometheus workload, I have created an Amazon Managed Service for Prometheus workspace with the following rules: + +` YAML ` +``` +groups: +- name: example + rules: + - alert: HostHighCpuLoad + expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 60 + for: 5m + labels: + severity: warning + event_type: scale_up + annotations: + summary: Host high CPU load (instance {{ $labels.instance }}) + description: "CPU load is > 60%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" + - alert: HostLowCpuLoad + expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) < 30 + for: 5m + labels: + severity: warning + event_type: scale_down + annotations: + summary: Host low CPU load (instance {{ $labels.instance }}) + description: "CPU load is < 30%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" + +``` + +This rules set creates a ` HostHighCpuLoad ` and a ` HostLowCpuLoad ` rules. These alerts trigger when the CPU is greater than 60% or less than 30% utilization over a five-minute period. + +After raising an alert, the alert manager will forward the message into an Amazon SNS topic, passing an ` alert_type ` (the alert name) and ` event_type ` (scale_down or scale_up). + +` YAML ` +``` +alertmanager_config: | + route: + receiver: default_receiver + repeat_interval: 5m + + receivers: + - name: default_receiver + sns_configs: + - topic_arn: + send_resolved: false + sigv4: + region: us-east-1 + message: | + alert_type: {{ .CommonLabels.alertname }} + event_type: {{ .CommonLabels.event_type }} + +``` + +An AWS [Lambda](https://aws.amazon.com/lambda/) function is subscribed to the Amazon SNS topic. I have written logic in the Lambda function to inspect the Amazon SNS message and determine if a ` scale_up ` or ` scale_down ` event should happen. Then, the Lambda function increments or decrements the desired capacity of the Amazon EC2 Auto Scaling group. The Amazon EC2 Auto Scaling group detects a requested change in capacity, and then invokes or deallocates Amazon EC2 instances. + +The Lambda code to support Auto Scaling is as follows: + +` Python ` +``` +import json +import boto3 +import os + +def lambda_handler(event, context): + print(event) + msg = event['Records'][0]['Sns']['Message'] + + scale_type = '' + if msg.find('scale_up') > -1: + scale_type = 'scale_up' + else: + scale_type = 'scale_down' + + get_desired_instance_count(scale_type) + +def get_desired_instance_count(scale_type): + + client = boto3.client('autoscaling') + asg_name = os.environ['ASG_NAME'] + response = client.describe_auto_scaling_groups(AutoScalingGroupNames=[ asg_name]) + + minSize = response['AutoScalingGroups'][0]['MinSize'] + maxSize = response['AutoScalingGroups'][0]['MaxSize'] + desiredCapacity = response['AutoScalingGroups'][0]['DesiredCapacity'] + + if scale_type == "scale_up": + desiredCapacity = min(desiredCapacity+1, maxSize) + if scale_type == "scale_down": + desiredCapacity = max(desiredCapacity - 1, minSize) + + print('Scale type: {}; new capacity: {}'.format(scale_type, desiredCapacity)) + response = client.set_desired_capacity(AutoScalingGroupName=asg_name, DesiredCapacity=desiredCapacity, HonorCooldown=False) + +``` + +The full architecture can be reviewed in the following figure. + +![Architecture](../images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager3.png) + +## Testing out the solution + +You can launch an AWS CloudFormation template to automatically provision this solution. + +Stack prerequisites: + +* An [Amazon Virtual Private Cloud (Amazon VPC)](https://aws.amazon.com/vpc/) +* An AWS Security Group that allows outbound traffic + +Select the Download Launch Stack Template link to download and set up the template in your account. As part of the configuration process, you must specify the subnets and the security groups that you want associated with the Amazon EC2 instances. See the following figure for details. + +[## Download Launch Stack Template ](https://prometheus-autoscale.s3.amazonaws.com/prometheus-autoscale.template) + +![Launch Stack](../images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager4.png) + +This is the CloudFormation stack details screen, where the stack name has been set as prometheus-autoscale. The stack parameters include a URL of the Linux installer for Prometheus, the URL for the Linux Node Exporter for Prometheus, the subnets and security groups used in the solution, the AMI and instance type to use, and the maximum capacity of the Amazon EC2 Auto Scaling group. + +The stack will take approximately eight minutes to deploy. Once complete, you will find two Amazon EC2 instances that have been deployed and are running in the Amazon EC2 Auto Scaling group that has been created for you. To validate that this solution auto-scales via Amazon Managed Service for Prometheus Alert Manager, you apply load to the Amazon EC2 instances using the [AWS Systems Manager Run Command](https://docs.aws.amazon.com/systems-manager/latest/userguide/execute-remote-commands.html) and the [AWSFIS-Run-CPU-Stress automation document](https://docs.aws.amazon.com/fis/latest/userguide/actions-ssm-agent.html#awsfis-run-cpu-stress). + +As stress is applied to the CPUs in the Amazon EC2 Auto Scaling group, alert manager publishes these alerts, which the Lambda function responds to by scaling up the Auto Scaling group. As CPU consumption decreases, the low CPU alert in the Amazon Managed Service for Prometheus workspace fires, alert manager publishes the alert to the Amazon SNS topic, and the Lambda function responds by responds by scaling down the Auto Scaling group, as demonstrated in the following figure. + +![Dashboard](../images/ec2-autoscaling-amp-alertmgr/as-ec2-amp-alertmanager5.png) + +The Grafana dashboard has a line showing that CPU has spiked to 100%. Although the CPU is high, another line shows that the number of instances has stepped up from 2 to 10. Once CPU has decreased, the number of instances slowly decreases back down to 2. + +## Costs + +Amazon Managed Service for Prometheus is priced based on the metrics ingested, metrics stored, and metrics queried. Visit the [Amazon Managed Service for Prometheus pricing page](https://aws.amazon.com/prometheus/pricing/) for the latest pricing and pricing examples. + +Amazon SNS is priced based on the number of monthly API requests made. Message delivery between Amazon SNS and Lambda is free, but it does charge for the amount of data transferred between Amazon SNS and Lambda. See the [latest Amazon SNS pricing details](https://aws.amazon.com/sns/pricing/). + +Lambda is priced based on the duration of your function execution and the number of requests made to the function. See the latest [AWS Lambda pricing details](https://aws.amazon.com/lambda/pricing/). + +There are [no additional charges for using](https://aws.amazon.com/ec2/autoscaling/pricing/) Amazon EC2 Auto Scaling. + +## Conclusion + +By using Amazon Managed Service for Prometheus, alert manager, Amazon SNS, and Lambda, you can control the scaling activities of an Amazon EC2 Auto Scaling group. The solution in this post demonstrates how you can move existing Prometheus workloads to AWS, while also utilizing Amazon EC2 Auto Scaling. As load increases to the application, it seamlessly scales to meet demand. + +In this example, the Amazon EC2 Auto Scaling group scaled based on CPU, but you can follow a similar approach for any Prometheus metric from your workload. This approach provides fine-grained control over scaling actions, thereby making sure that you can scale your workload on the metric that provides the most business value. + +In previous blog posts, we’ve also demonstrated how you can use [Amazon Managed Service for Prometheus Alert Manager to receive alerts with PagerDuty](https://aws.amazon.com/blogs/mt/using-amazon-managed-service-for-prometheus-alert-manager-to-receive-alerts-with-pagerduty/) and [how to integrate Amazon Managed Service for Prometheus with Slack](https://aws.amazon.com/blogs/mt/how-to-integrate-amazon-managed-service-for-prometheus-with-slack/). These solutions show how you can receive alerts from your workspace in the way that is most useful to you. + +For next steps, see how to [create your own rules configuration file](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-rules-upload.html) for Amazon Managed Service for Prometheus, and set up your own [alert receiver](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-receiver.html). Moreover, check out [Awesome Prometheus alerts](https://awesome-prometheus-alerts.grep.to/alertmanager) for some good examples of alerting rules that can be used within alert manager. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg.md b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg.md new file mode 100644 index 000000000..28f91ddba --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg.md @@ -0,0 +1,317 @@ +# Using AWS Distro for OpenTelemetry in EKS on EC2 with Amazon Managed Service for Prometheus + +In this recipe we show you how to instrument a [sample Go application](https://github.com/aws-observability/aws-otel-community/tree/master/sample-apps/prometheus-sample-app) and +use [AWS Distro for OpenTelemetry (ADOT)](https://aws.amazon.com/otel) to ingest metrics into +[Amazon Managed Service for Prometheus (AMP)](https://aws.amazon.com/prometheus/) . +Then we're using [Amazon Managed Grafana (AMG)](https://aws.amazon.com/grafana/) to visualize the metrics. + +We will be setting up an [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/) +on EC2 cluster and [Amazon Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/) +repository to demonstrate a complete scenario. + +:::note + This guide will take approximately 1 hour to complete. +::: +## Infrastructure +In the following section we will be setting up the infrastructure for this recipe. + +### Architecture + + +The ADOT pipeline enables us to use the +[ADOT Collector](https://github.com/aws-observability/aws-otel-collector) to +scrape a Prometheus-instrumented application, and ingest the scraped metrics to +Amazon Managed Service for Prometheus. + +![Architecture](../images/adot-metrics-pipeline.png) + +The ADOT Collector includes two components specific to Prometheus: + +* the Prometheus Receiver, and +* the AWS Prometheus Remote Write Exporter. + +:::info + For more information on Prometheus Remote Write Exporter check out: + [Getting Started with Prometheus Remote Write Exporter for AMP](https://aws-otel.github.io/docs/getting-started/prometheus-remote-write-exporter) +::: + +### Prerequisites + +* The AWS CLI is [installed](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in your environment. +* You need to install the [eksctl](https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html) command in your environment. +* You need to install [kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) in your environment. +* You have [docker](https://docs.docker.com/get-docker/) installed into your environment. + +### Create EKS on EC2 cluster + +Our demo application in this recipe will be running on top of EKS. +You can either use an existing EKS cluster or create one using [cluster-config.yaml](./ec2-eks-metrics-go-adot-ampamg/cluster-config.yaml). + +This template will create a new cluster with two EC2 `t2.large` nodes. + +Edit the template file and set `` to one of the +[supported regions for AMP](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html#AMP-supported-Regions). + +Make sure to overwrite `` in your session, for example in bash: +``` +export AWS_DEFAULT_REGION= +``` + +Create your cluster using the following command. +``` +eksctl create cluster -f cluster-config.yaml +``` + +### Set up an ECR repository + +In order to deploy our application to EKS we need a container registry. +You can use the following command to create a new ECR registry in your account. +Make sure to set `` as well. + +``` +aws ecr create-repository \ + --repository-name prometheus-sample-app \ + --image-scanning-configuration scanOnPush=true \ + --region +``` + +### Set up AMP + + +create a workspace using the AWS CLI +``` +aws amp create-workspace --alias prometheus-sample-app +``` + +Verify the workspace is created using: +``` +aws amp list-workspaces +``` + +:::info + For more details check out the [AMP Getting started](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-getting-started.html) guide. +::: + +### Set up ADOT Collector + +Download [adot-collector-ec2.yaml](./ec2-eks-metrics-go-adot-ampamg/adot-collector-ec2.yaml) +and edit this YAML doc with the parameters described in the next steps. + +In this example, the ADOT Collector configuration uses an annotation `(scrape=true)` +to tell which target endpoints to scrape. This allows the ADOT Collector to distinguish +the sample app endpoint from `kube-system` endpoints in your cluster. +You can remove this from the re-label configurations if you want to scrape a different sample app. + +Use the following steps to edit the downloaded file for your environment: + +1\. Replace `` with your current region. + +2\. Replace `` with the remote write URL of your workspace. + +Get your AMP remote write URL endpoint by executing the following queries. + +First, get the workspace ID like so: + +``` +YOUR_WORKSPACE_ID=$(aws amp list-workspaces \ + --alias prometheus-sample-app \ + --query 'workspaces[0].workspaceId' --output text) +``` + +Now get the remote write URL endpoint URL for your workspace using: + +``` +YOUR_ENDPOINT=$(aws amp describe-workspace \ + --workspace-id $YOUR_WORKSPACE_ID \ + --query 'workspace.prometheusEndpoint' --output text)api/v1/remote_write +``` + +:::warning + Make sure that `YOUR_ENDPOINT` is in fact the remote write URL, that is, + the URL should end in `/api/v1/remote_write`. +::: +After creating deployment file we can now apply this to our cluster by using the following command: + +``` +kubectl apply -f adot-collector-ec2.yaml +``` + +:::info + For more information check out the [AWS Distro for OpenTelemetry (ADOT) + Collector Setup](https://aws-otel.github.io/docs/getting-started/prometheus-remote-write-exporter/eks#aws-distro-for-opentelemetry-adot-collector-setup). +::: + +### Set up AMG + +Setup a new AMG workspace using the [Amazon Managed Grafana – Getting Started](https://aws.amazon.com/blogs/mt/amazon-managed-grafana-getting-started/) guide. + +Make sure to add "Amazon Managed Service for Prometheus" as a datasource during creation. + +![Service managed permission settings](https://d2908q01vomqb2.cloudfront.net/972a67c48192728a34979d9a35164c1295401b71/2020/12/09/image008-1024x870.jpg) + + +## Application + +In this recipe we will be using a +[sample application](https://github.com/aws-observability/aws-otel-community/tree/master/sample-apps/prometheus) +from the AWS Observability repository. + +This Prometheus sample app generates all four Prometheus metric types +(counter, gauge, histogram, summary) and exposes them at the `/metrics` endpoint. + +### Build container image + +To build the container image, first clone the Git repository and change +into the directory as follows: + +``` +git clone https://github.com/aws-observability/aws-otel-community.git && \ +cd ./aws-otel-community/sample-apps/prometheus +``` + +First, set the region (if not already done above) and account ID to what is applicable in your case. +Replace `` with your current region. For +example, in the Bash shell this would look as follows: + +``` +export AWS_DEFAULT_REGION= +export ACCOUNTID=`aws sts get-caller-identity --query Account --output text` +``` + +Next, build the container image: + +``` +docker build . -t "$ACCOUNTID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/prometheus-sample-app:latest" +``` + +:::note + If `go mod` fails in your environment due to a proxy.golang.or i/o timeout, + you are able to bypass the go mod proxy by editing the Dockerfile. + + Change the following line in the Docker file: + ``` + RUN GO111MODULE=on go mod download + ``` + to: + ``` + RUN GOPROXY=direct GO111MODULE=on go mod download + ``` +::: +Now you can push the container image to the ECR repo you created earlier on. + +For that, first log in to the default ECR registry: + +``` +aws ecr get-login-password --region $AWS_DEFAULT_REGION | \ + docker login --username AWS --password-stdin \ + "$ACCOUNTID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" +``` + +And finally, push the container image to the ECR repository you created, above: + +``` +docker push "$ACCOUNTID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/prometheus-sample-app:latest" +``` + +### Deploy sample app + +Edit [prometheus-sample-app.yaml](./ec2-eks-metrics-go-adot-ampamg/prometheus-sample-app.yaml) +to contain your ECR image path. That is, replace `ACCOUNTID` and `AWS_DEFAULT_REGION` in the +file with your own values: + +``` + # change the following to your container image: + image: "ACCOUNTID.dkr.ecr.AWS_DEFAULT_REGION.amazonaws.com/prometheus-sample-app:latest" +``` + +Now you can deploy the sample app to your cluster using: + +``` +kubectl apply -f prometheus-sample-app.yaml +``` + +## End-to-end + +Now that you have the infrastructure and the application in place, we will +test out the setup, sending metrics from the Go app running in EKS to AMP and +visualize it in AMG. + +### Verify your pipeline is working + +To verify if the ADOT collector is scraping the pod of the sample app and +ingests the metrics into AMP, we look at the collector logs. + +Enter the following command to follow the ADOT collector logs: + +``` +kubectl -n adot-col logs adot-collector -f +``` + +One example output in the logs of the scraped metrics from the sample app +should look like the following: + +``` +... +Resource labels: + -> service.name: STRING(kubernetes-service-endpoints) + -> host.name: STRING(192.168.16.238) + -> port: STRING(8080) + -> scheme: STRING(http) +InstrumentationLibraryMetrics #0 +Metric #0 +Descriptor: + -> Name: test_gauge0 + -> Description: This is my gauge + -> Unit: + -> DataType: DoubleGauge +DoubleDataPoints #0 +StartTime: 0 +Timestamp: 1606511460471000000 +Value: 0.000000 +... +``` + +:::tip + To verify if AMP received the metrics, you can use [awscurl](https://github.com/okigan/awscurl). + This tool enables you to send HTTP requests from the command line with AWS Sigv4 authentication, + so you must have AWS credentials set up locally with the correct permissions to query from AMP. + In the following command replace `$AMP_ENDPOINT` with the endpoint for your AMP workspace: + + ``` + $ awscurl --service="aps" \ + --region="$AWS_DEFAULT_REGION" "https://$AMP_ENDPOINT/api/v1/query?query=adot_test_gauge0" + {"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"adot_test_gauge0"},"value":[1606512592.493,"16.87214000011479"]}]}} + ``` +::: +### Create a Grafana dashboard + +You can import an example dashboard, available via +[prometheus-sample-app-dashboard.json](./ec2-eks-metrics-go-adot-ampamg/prometheus-sample-app-dashboard.json), +for the sample app that looks as follows: + +![Screen shot of the Prometheus sample app dashboard in AMG](../images/amg-prom-sample-app-dashboard.png) + +Further, use the following guides to create your own dashboard in Amazon Managed Grafana: + +* [User Guide: Dashboards](https://docs.aws.amazon.com/grafana/latest/userguide/dashboard-overview.html) +* [Best practices for creating dashboards](https://grafana.com/docs/grafana/latest/best-practices/best-practices-for-creating-dashboards/) + +That's it, congratulations you've learned how to use ADOT in EKS on EC2 to +ingest metrics. + +## Cleanup + +1. Remove the resources and cluster +``` +kubectl delete all --all +eksctl delete cluster --name amp-eks-ec2 +``` +2. Remove the AMP workspace +``` +aws amp delete-workspace --workspace-id `aws amp list-workspaces --alias prometheus-sample-app --query 'workspaces[0].workspaceId' --output text` +``` +3. Remove the amp-iamproxy-ingest-role IAM role +``` +aws delete-role --role-name amp-iamproxy-ingest-role +``` +4. Remove the AMG workspace by removing it from the console. diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/adot-collector-ec2.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/adot-collector-ec2.yaml new file mode 100644 index 000000000..ee32e716a --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/adot-collector-ec2.yaml @@ -0,0 +1,154 @@ +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: adot-collector-conf + namespace: adot-col + labels: + app: aws-adot + component: adot-collector-conf +data: + adot-collector-config: | + receivers: + prometheus: + config: + global: + scrape_interval: 15s + scrape_timeout: 10s + scrape_configs: + - job_name: "kubernetes-service-endpoints" + kubernetes_sd_configs: + - role: endpoints + relabel_configs: + - source_labels: [__meta_kubernetes_service_annotation_scrape] + action: keep + regex: true + exporters: + prometheusremotewrite: + # replace this with your endpoint, in double quotes: + endpoint: + auth: + authenticator: sigv4auth + logging: + loglevel: debug + extensions: + health_check: + pprof: + endpoint: :1888 + zpages: + endpoint: :55679 + sigv4auth: + # replace this with your region, in double quotes: + region: + service: "aps" + service: + extensions: [pprof, zpages, health_check, sigv4auth] + pipelines: + metrics: + receivers: [prometheus] + exporters: [logging, prometheusremotewrite] +--- +kind: ClusterRole +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: adotcol-admin-role +rules: + - apiGroups: [""] + resources: + - nodes + - nodes/proxy + - services + - endpoints + - pods + verbs: ["get", "list", "watch"] + - apiGroups: + - extensions + resources: + - ingresses + verbs: ["get", "list", "watch"] + - nonResourceURLs: ["/metrics"] + verbs: ["get"] +--- +kind: ClusterRoleBinding +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: adotcol-admin-role-binding +subjects: + - kind: ServiceAccount + name: adot-collector + namespace: adot-col +roleRef: + kind: ClusterRole + name: adotcol-admin-role + apiGroup: rbac.authorization.k8s.io +--- +apiVersion: v1 +kind: Service +metadata: + name: adot-collector + namespace: adot-col + labels: + app: aws-adot + component: adot-collector +spec: + ports: + - name: metrics + port: 8888 + selector: + component: adot-collector + type: NodePort +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: adot-collector + namespace: adot-col + labels: + app: aws-adot + component: adot-collector +spec: + selector: + matchLabels: + app: aws-adot + component: adot-collector + minReadySeconds: 5 + template: + metadata: + labels: + app: aws-adot + component: adot-collector + spec: + serviceAccountName: adot-collector + containers: + - command: + - "/awscollector" + - "--config=/conf/adot-collector-config.yaml" + image: public.ecr.aws/aws-observability/aws-otel-collector:latest + name: adot-collector + resources: + limits: + cpu: 1 + memory: 2Gi + requests: + cpu: 200m + memory: 400Mi + ports: + - containerPort: 8888 + volumeMounts: + - name: adot-collector-config-vol + mountPath: /conf + livenessProbe: + httpGet: + path: / + port: 13133 + readinessProbe: + httpGet: + path: / + port: 13133 + volumes: + - configMap: + name: adot-collector-conf + items: + - key: adot-collector-config + path: adot-collector-config.yaml + name: adot-collector-config-vol diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/cluster-config.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/cluster-config.yaml new file mode 100644 index 000000000..3e65da974 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/cluster-config.yaml @@ -0,0 +1,44 @@ + +apiVersion: eksctl.io/v1alpha5 +kind: ClusterConfig +metadata: + name: amp-eks-ec2 + region: + version: '1.21' +iam: + withOIDC: true + serviceAccounts: + - metadata: + name: adot-collector + namespace: adot-col + labels: {aws-usage: "application"} + attachPolicy: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: + - "aps:RemoteWrite" + - "aps:GetSeries" + - "aps:GetLabels" + - "aps:GetMetricMetadata" + - "aps:QueryMetrics" + - "logs:PutLogEvents" + - "logs:CreateLogGroup" + - "logs:CreateLogStream" + - "logs:DescribeLogStreams" + - "logs:DescribeLogGroups" + - "xray:PutTraceSegments" + - "xray:PutTelemetryRecord" + - "xray:GetSamplingRules" + - "xray:GetSamplingTargets" + - "xray:GetSamplingStatisticSummaries" + - "ssm:GetParameters" + Resource: "*" +nodeGroups: + - name: ng-1 + instanceType: t2.large + desiredCapacity: 2 + volumeSize: 80 +cloudWatch: + clusterLogging: + enableTypes: ["*"] diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/prometheus-sample-app-dashboard.json b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/prometheus-sample-app-dashboard.json new file mode 100644 index 000000000..9be23393f --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/prometheus-sample-app-dashboard.json @@ -0,0 +1,524 @@ +{ + "__inputs": [ + { + "name": "DS_PROMETHEUS_RECIPES", + "label": "Prometheus Recipes", + "description": "", + "type": "datasource", + "pluginId": "prometheus", + "pluginName": "Prometheus" + }, + { + "name": "DS_AMP", + "label": "AMP", + "description": "", + "type": "datasource", + "pluginId": "prometheus", + "pluginName": "Prometheus" + } + ], + "__requires": [ + { + "type": "grafana", + "id": "grafana", + "name": "Grafana", + "version": "8.0.5" + }, + { + "type": "panel", + "id": "heatmap", + "name": "Heatmap", + "version": "" + }, + { + "type": "datasource", + "id": "prometheus", + "name": "Prometheus", + "version": "1.0.0" + }, + { + "type": "panel", + "id": "text", + "name": "Text", + "version": "" + }, + { + "type": "panel", + "id": "timeseries", + "name": "Time series", + "version": "" + } + ], + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": "-- Grafana --", + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "gnetId": null, + "graphTooltip": 0, + "id": null, + "iteration": 1633721800673, + "links": [], + "panels": [ + { + "datasource": null, + "gridPos": { + "h": 3, + "w": 5, + "x": 0, + "y": 0 + }, + "id": 2, + "options": { + "content": "For the source of the sample app, see the [aws-observability/aws-otel-community](https://github.com/aws-observability/aws-otel-community/tree/master/sample-apps/prometheus) repository.", + "mode": "markdown" + }, + "pluginVersion": "8.0.5", + "title": "Source", + "type": "text" + }, + { + "datasource": "${DS_PROMETHEUS_RECIPES}", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 12, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "smooth", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "line" + } + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "blue", + "value": null + }, + { + "color": "purple", + "value": 50 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 12, + "w": 24, + "x": 0, + "y": 3 + }, + "id": 4, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "right" + }, + "tooltip": { + "mode": "single" + } + }, + "targets": [ + { + "exemplar": true, + "expr": "rate(test_counter1[5m])", + "interval": "", + "legendFormat": "rate of counter 1", + "refId": "counter 1" + }, + { + "exemplar": true, + "expr": "rate(test_counter8[5m])", + "hide": false, + "interval": "", + "legendFormat": "rate of counter 8", + "refId": "counter 8" + } + ], + "title": "Counters", + "type": "timeseries" + }, + { + "datasource": "${DS_PROMETHEUS_RECIPES}", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 12, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "stepBefore", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "line" + } + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "blue", + "value": null + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "gauge 0/gauge 2" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "average of gauge 5" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-blue", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 12, + "w": 24, + "x": 0, + "y": 15 + }, + "id": 5, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "right" + }, + "tooltip": { + "mode": "single" + } + }, + "targets": [ + { + "exemplar": true, + "expr": "test_gauge0/test_gauge2*100", + "interval": "", + "legendFormat": "gauge 0/gauge 2", + "refId": "gauge ratio" + }, + { + "exemplar": true, + "expr": "avg(test_gauge5)/2", + "hide": false, + "interval": "", + "legendFormat": "average of gauge 5", + "refId": "A" + } + ], + "title": "Gauges", + "type": "timeseries" + }, + { + "datasource": "${DS_PROMETHEUS_RECIPES}", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 3, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "stepAfter", + "lineWidth": 1, + "pointSize": 2, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "p90" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "p50" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "p95" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-yellow", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 14, + "w": 12, + "x": 0, + "y": 27 + }, + "id": 7, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "list", + "placement": "right" + }, + "tooltip": { + "mode": "single" + } + }, + "targets": [ + { + "exemplar": true, + "expr": "histogram_quantile(0.5, sum(rate(test_histogram0_bucket[5m])) by (le))", + "format": "time_series", + "interval": "", + "legendFormat": "p50", + "refId": "p50" + } + ], + "title": "Histograms (percentiles)", + "type": "timeseries" + }, + { + "cards": { + "cardPadding": null, + "cardRound": null + }, + "color": { + "cardColor": "#b4ff00", + "colorScale": "sqrt", + "colorScheme": "interpolateViridis", + "exponent": 0.5, + "mode": "spectrum" + }, + "dataFormat": "timeseries", + "datasource": "${DS_PROMETHEUS_RECIPES}", + "gridPos": { + "h": 14, + "w": 12, + "x": 12, + "y": 27 + }, + "heatmap": {}, + "hideZeroBuckets": false, + "highlightCards": true, + "id": 8, + "legend": { + "show": false + }, + "reverseYBuckets": false, + "targets": [ + { + "exemplar": true, + "expr": "histogram_quantile(0.5, sum(rate(test_histogram1_bucket[5m])) by (le))", + "interval": "", + "legendFormat": "", + "refId": "A" + } + ], + "title": "Histograms (heatmap)", + "tooltip": { + "show": true, + "showHistogram": false + }, + "type": "heatmap", + "xAxis": { + "show": true + }, + "xBucketNumber": null, + "xBucketSize": null, + "yAxis": { + "decimals": null, + "format": "short", + "logBase": 1, + "max": null, + "min": null, + "show": true, + "splitFactor": null + }, + "yBucketBound": "auto", + "yBucketNumber": null, + "yBucketSize": null + } + ], + "refresh": false, + "schemaVersion": 30, + "style": "dark", + "tags": [], + "templating": { + "list": [ + { + "allValue": null, + "current": {}, + "datasource": "${DS_AMP}", + "definition": "", + "description": null, + "error": null, + "hide": 0, + "includeAll": false, + "label": null, + "multi": false, + "name": "query0", + "options": [], + "query": "", + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + } + ] + }, + "time": { + "from": "now-3h", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "Prometheus sample app", + "uid": "JNG5OaDnk", + "version": 6 +} diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/prometheus-sample-app.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/prometheus-sample-app.yaml new file mode 100644 index 000000000..91e7d376c --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/ec2-eks-metrics-go-adot-ampamg/prometheus-sample-app.yaml @@ -0,0 +1,52 @@ +--- +apiVersion: v1 +kind: Namespace +metadata: + name: prom-sample-app + labels: + name: prometheus-sample-app +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: prometheus-sample-app + namespace: prom-sample-app + labels: + app: prometheus-sample-app +spec: + replicas: 1 + selector: + matchLabels: + app: prometheus-sample-app + template: + metadata: + labels: + app: prometheus-sample-app + spec: + containers: + - name: prometheus-sample-app + # change the following to your container image: + image: "ACCOUNTID.dkr.ecr.REGION.amazonaws.com/prometheus-sample-app:latest" + command: ["/bin/main", "-listen_address=0.0.0.0:8080", "-metric_count=10"] + ports: + - name: web + containerPort: 8080 +--- +apiVersion: v1 +kind: Service +metadata: + name: prometheus-sample-app-service + namespace: prom-sample-app + labels: + app: prometheus-sample-app + annotations: + scrape: "true" +spec: + ports: + - name: web + port: 8080 + targetPort: 8080 + protocol: TCP + selector: + app: prometheus-sample-app +--- diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg.md b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg.md new file mode 100644 index 000000000..fa7eb6a4a --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg.md @@ -0,0 +1,325 @@ +# Using AWS Distro for OpenTelemetry in EKS on Fargate with Amazon Managed Service for Prometheus + +In this recipe we show you how to instrument a [sample Go application](https://github.com/aws-observability/aws-otel-community/tree/master/sample-apps/prometheus-sample-app) and +use [AWS Distro for OpenTelemetry (ADOT)](https://aws.amazon.com/otel) to ingest metrics into +[Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) . +Then we're using [Amazon Managed Grafana](https://aws.amazon.com/grafana/) to visualize the metrics. + +We will be setting up an [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/) +on [AWS Fargate](https://aws.amazon.com/fargate/) cluster and use an +[Amazon Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/) repository +to demonstrate a complete scenario. + +:::note + This guide will take approximately 1 hour to complete. +::: +## Infrastructure +In the following section we will be setting up the infrastructure for this recipe. + +### Architecture + +The ADOT pipeline enables us to use the +[ADOT Collector](https://github.com/aws-observability/aws-otel-collector) to +scrape a Prometheus-instrumented application, and ingest the scraped metrics to +Amazon Managed Service for Prometheus. + +![Architecture](../images/adot-metrics-pipeline.png) + +The ADOT Collector includes two components specific to Prometheus: + +* the Prometheus Receiver, and +* the AWS Prometheus Remote Write Exporter. + +:::info + For more information on Prometheus Remote Write Exporter check out: + [Getting Started with Prometheus Remote Write Exporter for AMP](https://aws-otel.github.io/docs/getting-started/prometheus-remote-write-exporter). +::: + +### Prerequisites + +* The AWS CLI is [installed](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in your environment. +* You need to install the [eksctl](https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html) command in your environment. +* You need to install [kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) in your environment. +* You have [Docker](https://docs.docker.com/get-docker/) installed into your environment. + +### Create EKS on Fargate cluster + +Our demo application is a Kubernetes app that we will run in an EKS on Fargate +cluster. So, first create an EKS cluster using the +provided [cluster-config.yaml](./fargate-eks-metrics-go-adot-ampamg/cluster-config.yaml) +template file by changing `` to one of the +[supported regions for AMP](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html#AMP-supported-Regions). + +Make sure to set `` in your shell session, for example, in Bash: + +``` +export AWS_DEFAULT_REGION= +``` + +Create your cluster using the following command: + +``` +eksctl create cluster -f cluster-config.yaml +``` + +### Create ECR repository + +In order to deploy our application to EKS we need a container repository. +You can use the following command to create a new ECR repository in your account. +Make sure to set `` as well. + +``` +aws ecr create-repository \ + --repository-name prometheus-sample-app \ + --image-scanning-configuration scanOnPush=true \ + --region +``` + +### Set up AMP + +First, create an Amazon Managed Service for Prometheus workspace using the AWS CLI with: + +``` +aws amp create-workspace --alias prometheus-sample-app +``` + +Verify the workspace is created using: + +``` +aws amp list-workspaces +``` + +:::info + For more details check out the [AMP Getting started](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-getting-started.html) guide. +::: + +### Set up ADOT Collector + +Download [adot-collector-fargate.yaml](./fargate-eks-metrics-go-adot-ampamg/adot-collector-fargate.yaml) +and edit this YAML doc with the parameters described in the next steps. + +In this example, the ADOT Collector configuration uses an annotation `(scrape=true)` +to tell which target endpoints to scrape. This allows the ADOT Collector to distinguish +the sample app endpoint from `kube-system` endpoints in your cluster. +You can remove this from the re-label configurations if you want to scrape a different sample app. + +Use the following steps to edit the downloaded file for your environment: + +1\. Replace `` with your current region. + +2\. Replace `` with the remote write URL of your workspace. + +Get your AMP remote write URL endpoint by executing the following queries. + +First, get the workspace ID like so: + +``` +YOUR_WORKSPACE_ID=$(aws amp list-workspaces \ + --alias prometheus-sample-app \ + --query 'workspaces[0].workspaceId' --output text) +``` + +Now get the remote write URL endpoint URL for your workspace using: + +``` +YOUR_ENDPOINT=$(aws amp describe-workspace \ + --workspace-id $YOUR_WORKSPACE_ID \ + --query 'workspace.prometheusEndpoint' --output text)api/v1/remote_write +``` + +:::warning + Make sure that `YOUR_ENDPOINT` is in fact the remote write URL, that is, + the URL should end in `/api/v1/remote_write`. +::: +After creating deployment file we can now apply this to our cluster by using the following command: + +``` +kubectl apply -f adot-collector-fargate.yaml +``` + +:::info + For more information check out the [AWS Distro for OpenTelemetry (ADOT) + Collector Setup](https://aws-otel.github.io/docs/getting-started/prometheus-remote-write-exporter/eks#aws-distro-for-opentelemetry-adot-collector-setup). +::: +### Set up AMG + +Set up a new AMG workspace using the +[Amazon Managed Grafana – Getting Started](https://aws.amazon.com/blogs/mt/amazon-managed-grafana-getting-started/) guide. + +Make sure to add "Amazon Managed Service for Prometheus" as a datasource during creation. + +![Service managed permission settings](../images/amg-console-create-workspace-managed-permissions.jpg) + +## Application + +In this recipe we will be using a +[sample application](https://github.com/aws-observability/aws-otel-community/tree/master/sample-apps/prometheus-sample-app) +from the AWS Observability repository. + +This Prometheus sample app generates all four Prometheus metric types +(counter, gauge, histogram, summary) and exposes them at the `/metrics` endpoint. + +### Build container image + +To build the container image, first clone the Git repository and change +into the directory as follows: + +``` +git clone https://github.com/aws-observability/aws-otel-community.git && \ +cd ./aws-otel-community/sample-apps/prometheus +``` + +First, set the region (if not already done above) and account ID to what is applicable in your case. +Replace `` with your current region. For +example, in the Bash shell this would look as follows: + +``` +export AWS_DEFAULT_REGION= +export ACCOUNTID=`aws sts get-caller-identity --query Account --output text` +``` + +Next, build the container image: + +``` +docker build . -t "$ACCOUNTID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/prometheus-sample-app:latest" +``` + +:::note + If `go mod` fails in your environment due to a proxy.golang.or i/o timeout, + you are able to bypass the go mod proxy by editing the Dockerfile. + + Change the following line in the Docker file: + ``` + RUN GO111MODULE=on go mod download + ``` + to: + ``` + RUN GOPROXY=direct GO111MODULE=on go mod download + ``` +::: + +Now you can push the container image to the ECR repo you created earlier on. + +For that, first log in to the default ECR registry: + +``` +aws ecr get-login-password --region $AWS_DEFAULT_REGION | \ + docker login --username AWS --password-stdin \ + "$ACCOUNTID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" +``` + +And finally, push the container image to the ECR repository you created, above: + +``` +docker push "$ACCOUNTID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/prometheus-sample-app:latest" +``` + +### Deploy sample app + +Edit [prometheus-sample-app.yaml](./fargate-eks-metrics-go-adot-ampamg/prometheus-sample-app.yaml) +to contain your ECR image path. That is, replace `ACCOUNTID` and `AWS_DEFAULT_REGION` in the +file with your own values: + +``` + # change the following to your container image: + image: "ACCOUNTID.dkr.ecr.AWS_DEFAULT_REGION.amazonaws.com/prometheus-sample-app:latest" +``` + +Now you can deploy the sample app to your cluster using: + +``` +kubectl apply -f prometheus-sample-app.yaml +``` + +## End-to-end + +Now that you have the infrastructure and the application in place, we will +test out the setup, sending metrics from the Go app running in EKS to AMP and +visualize it in AMG. + +### Verify your pipeline is working + +To verify if the ADOT collector is scraping the pod of the sample app and +ingests the metrics into AMP, we look at the collector logs. + +Enter the following command to follow the ADOT collector logs: + +``` +kubectl -n adot-col logs adot-collector -f +``` + +One example output in the logs of the scraped metrics from the sample app +should look like the following: + +``` +... +Resource labels: + -> service.name: STRING(kubernetes-service-endpoints) + -> host.name: STRING(192.168.16.238) + -> port: STRING(8080) + -> scheme: STRING(http) +InstrumentationLibraryMetrics #0 +Metric #0 +Descriptor: + -> Name: test_gauge0 + -> Description: This is my gauge + -> Unit: + -> DataType: DoubleGauge +DoubleDataPoints #0 +StartTime: 0 +Timestamp: 1606511460471000000 +Value: 0.000000 +... +``` + +:::tip + To verify if AMP received the metrics, you can use [awscurl](https://github.com/okigan/awscurl). + This tool enables you to send HTTP requests from the command line with AWS Sigv4 authentication, + so you must have AWS credentials set up locally with the correct permissions to query from AMP. + In the following command replace `$AMP_ENDPOINT` with the endpoint for your AMP workspace: + + ``` + $ awscurl --service="aps" \ + --region="$AWS_DEFAULT_REGION" "https://$AMP_ENDPOINT/api/v1/query?query=adot_test_gauge0" + {"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"adot_test_gauge0"},"value":[1606512592.493,"16.87214000011479"]}]}} + ``` +::: +### Create a Grafana dashboard + +You can import an example dashboard, available via +[prometheus-sample-app-dashboard.json](./fargate-eks-metrics-go-adot-ampamg/prometheus-sample-app-dashboard.json), +for the sample app that looks as follows: + +![Screen shot of the Prometheus sample app dashboard in AMG](../images/amg-prom-sample-app-dashboard.png) + +Further, use the following guides to create your own dashboard in Amazon Managed Grafana: + +* [User Guide: Dashboards](https://docs.aws.amazon.com/grafana/latest/userguide/dashboard-overview.html) +* [Best practices for creating dashboards](https://grafana.com/docs/grafana/latest/best-practices/best-practices-for-creating-dashboards/) + +That's it, congratulations you've learned how to use ADOT in EKS on Fargate to +ingest metrics. + +## Cleanup + +First remove the Kubernetes resources and destroy the EKS cluster: + +``` +kubectl delete all --all && \ +eksctl delete cluster --name amp-eks-fargate +``` + +Remove the Amazon Managed Service for Prometheus workspace: + +``` +aws amp delete-workspace --workspace-id \ + `aws amp list-workspaces --alias prometheus-sample-app --query 'workspaces[0].workspaceId' --output text` +``` + +Remove the IAM role: + +``` +aws delete-role --role-name adot-collector-role +``` + +Finally, remove the Amazon Managed Grafana workspace by removing it via the AWS console. diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/adot-collector-fargate.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/adot-collector-fargate.yaml new file mode 100644 index 000000000..ee32e716a --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/adot-collector-fargate.yaml @@ -0,0 +1,154 @@ +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: adot-collector-conf + namespace: adot-col + labels: + app: aws-adot + component: adot-collector-conf +data: + adot-collector-config: | + receivers: + prometheus: + config: + global: + scrape_interval: 15s + scrape_timeout: 10s + scrape_configs: + - job_name: "kubernetes-service-endpoints" + kubernetes_sd_configs: + - role: endpoints + relabel_configs: + - source_labels: [__meta_kubernetes_service_annotation_scrape] + action: keep + regex: true + exporters: + prometheusremotewrite: + # replace this with your endpoint, in double quotes: + endpoint: + auth: + authenticator: sigv4auth + logging: + loglevel: debug + extensions: + health_check: + pprof: + endpoint: :1888 + zpages: + endpoint: :55679 + sigv4auth: + # replace this with your region, in double quotes: + region: + service: "aps" + service: + extensions: [pprof, zpages, health_check, sigv4auth] + pipelines: + metrics: + receivers: [prometheus] + exporters: [logging, prometheusremotewrite] +--- +kind: ClusterRole +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: adotcol-admin-role +rules: + - apiGroups: [""] + resources: + - nodes + - nodes/proxy + - services + - endpoints + - pods + verbs: ["get", "list", "watch"] + - apiGroups: + - extensions + resources: + - ingresses + verbs: ["get", "list", "watch"] + - nonResourceURLs: ["/metrics"] + verbs: ["get"] +--- +kind: ClusterRoleBinding +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: adotcol-admin-role-binding +subjects: + - kind: ServiceAccount + name: adot-collector + namespace: adot-col +roleRef: + kind: ClusterRole + name: adotcol-admin-role + apiGroup: rbac.authorization.k8s.io +--- +apiVersion: v1 +kind: Service +metadata: + name: adot-collector + namespace: adot-col + labels: + app: aws-adot + component: adot-collector +spec: + ports: + - name: metrics + port: 8888 + selector: + component: adot-collector + type: NodePort +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: adot-collector + namespace: adot-col + labels: + app: aws-adot + component: adot-collector +spec: + selector: + matchLabels: + app: aws-adot + component: adot-collector + minReadySeconds: 5 + template: + metadata: + labels: + app: aws-adot + component: adot-collector + spec: + serviceAccountName: adot-collector + containers: + - command: + - "/awscollector" + - "--config=/conf/adot-collector-config.yaml" + image: public.ecr.aws/aws-observability/aws-otel-collector:latest + name: adot-collector + resources: + limits: + cpu: 1 + memory: 2Gi + requests: + cpu: 200m + memory: 400Mi + ports: + - containerPort: 8888 + volumeMounts: + - name: adot-collector-config-vol + mountPath: /conf + livenessProbe: + httpGet: + path: / + port: 13133 + readinessProbe: + httpGet: + path: / + port: 13133 + volumes: + - configMap: + name: adot-collector-conf + items: + - key: adot-collector-config + path: adot-collector-config.yaml + name: adot-collector-config-vol diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/cluster-config.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/cluster-config.yaml new file mode 100644 index 000000000..7e46df154 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/cluster-config.yaml @@ -0,0 +1,46 @@ + +apiVersion: eksctl.io/v1alpha5 +kind: ClusterConfig +metadata: + name: amp-eks-fargate + region: + version: '1.21' +iam: + withOIDC: true + serviceAccounts: + - metadata: + name: adot-collector + namespace: adot-col + labels: {aws-usage: "application"} + attachPolicy: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: + - "aps:RemoteWrite" + - "aps:GetSeries" + - "aps:GetLabels" + - "aps:GetMetricMetadata" + - "aps:QueryMetrics" + - "logs:PutLogEvents" + - "logs:CreateLogGroup" + - "logs:CreateLogStream" + - "logs:DescribeLogStreams" + - "logs:DescribeLogGroups" + - "xray:PutTraceSegments" + - "xray:PutTelemetryRecord" + - "xray:GetSamplingRules" + - "xray:GetSamplingTargets" + - "xray:GetSamplingStatisticSummaries" + - "ssm:GetParameters" + Resource: "*" +fargateProfiles: + - name: defaultfp + selectors: + - namespace: prometheus + - namespace: kube-system + - namespace: adot-col + - namespace: prom-sample-app +cloudWatch: + clusterLogging: + enableTypes: ["*"] diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/prometheus-sample-app-dashboard.json b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/prometheus-sample-app-dashboard.json new file mode 100644 index 000000000..9be23393f --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/prometheus-sample-app-dashboard.json @@ -0,0 +1,524 @@ +{ + "__inputs": [ + { + "name": "DS_PROMETHEUS_RECIPES", + "label": "Prometheus Recipes", + "description": "", + "type": "datasource", + "pluginId": "prometheus", + "pluginName": "Prometheus" + }, + { + "name": "DS_AMP", + "label": "AMP", + "description": "", + "type": "datasource", + "pluginId": "prometheus", + "pluginName": "Prometheus" + } + ], + "__requires": [ + { + "type": "grafana", + "id": "grafana", + "name": "Grafana", + "version": "8.0.5" + }, + { + "type": "panel", + "id": "heatmap", + "name": "Heatmap", + "version": "" + }, + { + "type": "datasource", + "id": "prometheus", + "name": "Prometheus", + "version": "1.0.0" + }, + { + "type": "panel", + "id": "text", + "name": "Text", + "version": "" + }, + { + "type": "panel", + "id": "timeseries", + "name": "Time series", + "version": "" + } + ], + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": "-- Grafana --", + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "gnetId": null, + "graphTooltip": 0, + "id": null, + "iteration": 1633721800673, + "links": [], + "panels": [ + { + "datasource": null, + "gridPos": { + "h": 3, + "w": 5, + "x": 0, + "y": 0 + }, + "id": 2, + "options": { + "content": "For the source of the sample app, see the [aws-observability/aws-otel-community](https://github.com/aws-observability/aws-otel-community/tree/master/sample-apps/prometheus) repository.", + "mode": "markdown" + }, + "pluginVersion": "8.0.5", + "title": "Source", + "type": "text" + }, + { + "datasource": "${DS_PROMETHEUS_RECIPES}", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 12, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "smooth", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "line" + } + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "blue", + "value": null + }, + { + "color": "purple", + "value": 50 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 12, + "w": 24, + "x": 0, + "y": 3 + }, + "id": 4, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "right" + }, + "tooltip": { + "mode": "single" + } + }, + "targets": [ + { + "exemplar": true, + "expr": "rate(test_counter1[5m])", + "interval": "", + "legendFormat": "rate of counter 1", + "refId": "counter 1" + }, + { + "exemplar": true, + "expr": "rate(test_counter8[5m])", + "hide": false, + "interval": "", + "legendFormat": "rate of counter 8", + "refId": "counter 8" + } + ], + "title": "Counters", + "type": "timeseries" + }, + { + "datasource": "${DS_PROMETHEUS_RECIPES}", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 12, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "stepBefore", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "line" + } + }, + "mappings": [], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "blue", + "value": null + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "gauge 0/gauge 2" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "average of gauge 5" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-blue", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 12, + "w": 24, + "x": 0, + "y": 15 + }, + "id": 5, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "right" + }, + "tooltip": { + "mode": "single" + } + }, + "targets": [ + { + "exemplar": true, + "expr": "test_gauge0/test_gauge2*100", + "interval": "", + "legendFormat": "gauge 0/gauge 2", + "refId": "gauge ratio" + }, + { + "exemplar": true, + "expr": "avg(test_gauge5)/2", + "hide": false, + "interval": "", + "legendFormat": "average of gauge 5", + "refId": "A" + } + ], + "title": "Gauges", + "type": "timeseries" + }, + { + "datasource": "${DS_PROMETHEUS_RECIPES}", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 3, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "stepAfter", + "lineWidth": 1, + "pointSize": 2, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "p90" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "p50" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "p95" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-yellow", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 14, + "w": 12, + "x": 0, + "y": 27 + }, + "id": 7, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "list", + "placement": "right" + }, + "tooltip": { + "mode": "single" + } + }, + "targets": [ + { + "exemplar": true, + "expr": "histogram_quantile(0.5, sum(rate(test_histogram0_bucket[5m])) by (le))", + "format": "time_series", + "interval": "", + "legendFormat": "p50", + "refId": "p50" + } + ], + "title": "Histograms (percentiles)", + "type": "timeseries" + }, + { + "cards": { + "cardPadding": null, + "cardRound": null + }, + "color": { + "cardColor": "#b4ff00", + "colorScale": "sqrt", + "colorScheme": "interpolateViridis", + "exponent": 0.5, + "mode": "spectrum" + }, + "dataFormat": "timeseries", + "datasource": "${DS_PROMETHEUS_RECIPES}", + "gridPos": { + "h": 14, + "w": 12, + "x": 12, + "y": 27 + }, + "heatmap": {}, + "hideZeroBuckets": false, + "highlightCards": true, + "id": 8, + "legend": { + "show": false + }, + "reverseYBuckets": false, + "targets": [ + { + "exemplar": true, + "expr": "histogram_quantile(0.5, sum(rate(test_histogram1_bucket[5m])) by (le))", + "interval": "", + "legendFormat": "", + "refId": "A" + } + ], + "title": "Histograms (heatmap)", + "tooltip": { + "show": true, + "showHistogram": false + }, + "type": "heatmap", + "xAxis": { + "show": true + }, + "xBucketNumber": null, + "xBucketSize": null, + "yAxis": { + "decimals": null, + "format": "short", + "logBase": 1, + "max": null, + "min": null, + "show": true, + "splitFactor": null + }, + "yBucketBound": "auto", + "yBucketNumber": null, + "yBucketSize": null + } + ], + "refresh": false, + "schemaVersion": 30, + "style": "dark", + "tags": [], + "templating": { + "list": [ + { + "allValue": null, + "current": {}, + "datasource": "${DS_AMP}", + "definition": "", + "description": null, + "error": null, + "hide": 0, + "includeAll": false, + "label": null, + "multi": false, + "name": "query0", + "options": [], + "query": "", + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + } + ] + }, + "time": { + "from": "now-3h", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "Prometheus sample app", + "uid": "JNG5OaDnk", + "version": 6 +} diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/prometheus-sample-app.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/prometheus-sample-app.yaml new file mode 100644 index 000000000..91e7d376c --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-metrics-go-adot-ampamg/prometheus-sample-app.yaml @@ -0,0 +1,52 @@ +--- +apiVersion: v1 +kind: Namespace +metadata: + name: prom-sample-app + labels: + name: prometheus-sample-app +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: prometheus-sample-app + namespace: prom-sample-app + labels: + app: prometheus-sample-app +spec: + replicas: 1 + selector: + matchLabels: + app: prometheus-sample-app + template: + metadata: + labels: + app: prometheus-sample-app + spec: + containers: + - name: prometheus-sample-app + # change the following to your container image: + image: "ACCOUNTID.dkr.ecr.REGION.amazonaws.com/prometheus-sample-app:latest" + command: ["/bin/main", "-listen_address=0.0.0.0:8080", "-metric_count=10"] + ports: + - name: web + containerPort: 8080 +--- +apiVersion: v1 +kind: Service +metadata: + name: prometheus-sample-app-service + namespace: prom-sample-app + labels: + app: prometheus-sample-app + annotations: + scrape: "true" +spec: + ports: + - name: web + port: 8080 + targetPort: 8080 + protocol: TCP + selector: + app: prometheus-sample-app +--- diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg.md b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg.md new file mode 100644 index 000000000..201788cc8 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg.md @@ -0,0 +1,220 @@ +# Using AWS Distro for OpenTelemetry in EKS on Fargate with AWS X-Ray + +In this recipe we show you how to instrument a sample Go application +and use [AWS Distro for OpenTelemetry (ADOT)](https://aws.amazon.com/otel) to +ingest traces into [AWS X-Ray](https://aws.amazon.com/xray/) and visualize +the traces in [Amazon Managed Grafana](https://aws.amazon.com/grafana/). + +We will be setting up an [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/) +on [AWS Fargate](https://aws.amazon.com/fargate/) cluster and use an +[Amazon Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/) repository +to demonstrate a complete scenario. + +:::note + This guide will take approximately 1 hour to complete. +::: +## Infrastructure +In the following section we will be setting up the infrastructure for this recipe. + +### Architecture + +The ADOT pipeline enables us to use the +[ADOT Collector](https://github.com/aws-observability/aws-otel-collector) to +collect traces from an instrumented app and ingest them into X-Ray: + +![ADOT default pipeline](../images/adot-default-pipeline.png) + + +### Prerequisites + +* The AWS CLI is [installed](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in your environment. +* You need to install the [eksctl](https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html) command in your environment. +* You need to install [kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) in your environment. +* You have [Docker](https://docs.docker.com/get-docker/) installed into your environment. +* You have the [aws-observability/aws-o11y-recipes](https://github.com/aws-observability/aws-o11y-recipes/) + repo cloned into your local environment. + +### Create EKS on Fargate cluster + +Our demo application is a Kubernetes app that we will run in an EKS on Fargate +cluster. So, first create an EKS cluster using the +provided [cluster_config.yaml](./fargate-eks-xray-go-adot-amg/cluster-config.yaml). + +Create your cluster using the following command: + +``` +eksctl create cluster -f cluster-config.yaml +``` + +### Create ECR repository + +In order to deploy our application to EKS we need a container repository. We +will use the private ECR registry, but you can also use ECR Public, if you +want to share the container image. + +First, set the environment variables, such as shown here (substitute for your +region): + +``` +export REGION="eu-west-1" +export ACCOUNTID=`aws sts get-caller-identity --query Account --output text` +``` + +You can use the following command to create a new ECR repository in your account: + +``` +aws ecr create-repository \ + --repository-name ho11y \ + --image-scanning-configuration scanOnPush=true \ + --region $REGION +``` + +### Set up ADOT Collector + +Download [adot-collector-fargate.yaml](./fargate-eks-xray-go-adot-amg/adot-collector-fargate.yaml) +and edit this YAML doc with the parameters described in the next steps. + + +``` +kubectl apply -f adot-collector-fargate.yaml +``` + +### Set up Managed Grafana + +Set up a new workspace using the +[Amazon Managed Grafana – Getting Started](https://aws.amazon.com/blogs/mt/amazon-managed-grafana-getting-started/) guide +and add [X-Ray as a data source](https://docs.aws.amazon.com/grafana/latest/userguide/x-ray-data-source.html). + +## Signal generator + +We will be using `ho11y`, a synthetic signal generator available +via the [sandbox](https://github.com/aws-observability/observability-best-practices/tree/main/sandbox/ho11y) +of the recipes repository. So, if you haven't cloned the repo into your local +environment, do now: + +``` +git clone https://github.com/aws-observability/aws-o11y-recipes.git +``` + +### Build container image +Make sure that your `ACCOUNTID` and `REGION` environment variables are set, +for example: + +``` +export REGION="eu-west-1" +export ACCOUNTID=`aws sts get-caller-identity --query Account --output text` +``` +To build the `ho11y` container image, first change into the `./sandbox/ho11y/` +directory and build the container image : + +:::note + The following build step assumes that the Docker daemon or an equivalent OCI image + build tool is running. +::: + +``` +docker build . -t "$ACCOUNTID.dkr.ecr.$REGION.amazonaws.com/ho11y:latest" +``` + +### Push container image +Next, you can push the container image to the ECR repo you created earlier on. +For that, first log in to the default ECR registry: + +``` +aws ecr get-login-password --region $REGION | \ + docker login --username AWS --password-stdin \ + "$ACCOUNTID.dkr.ecr.$REGION.amazonaws.com" +``` + +And finally, push the container image to the ECR repository you created, above: + +``` +docker push "$ACCOUNTID.dkr.ecr.$REGION.amazonaws.com/ho11y:latest" +``` + +### Deploy signal generator + +Edit [x-ray-sample-app.yaml](./fargate-eks-xray-go-adot-amg/x-ray-sample-app.yaml) +to contain your ECR image path. That is, replace `ACCOUNTID` and `REGION` in the +file with your own values (overall, in three locations): + +``` + # change the following to your container image: + image: "ACCOUNTID.dkr.ecr.REGION.amazonaws.com/ho11y:latest" +``` + +Now you can deploy the sample app to your cluster using: + +``` +kubectl -n example-app apply -f x-ray-sample-app.yaml +``` + +## End-to-end + +Now that you have the infrastructure and the application in place, we will +test out the setup, sending traces from `ho11y` running in EKS to X-Ray and +visualize it in AMG. + +### Verify pipeline + +To verify if the ADOT collector is ingesting traces from `ho11y`, we make +one of the services available locally and invoke it. + +First, let's forward traffic as so: + +``` +kubectl -n example-app port-forward svc/frontend 8765:80 +``` + +With above command, the `frontend` microservice (a `ho11y` instance configured +to talk to two other `ho11y` instances) is available in your local environment +and you can invoke it as follows (triggering the creation of traces): + +``` +$ curl localhost:8765/ +{"traceId":"1-6193a9be-53693f29a0119ee4d661ba0d"} +``` + +:::tip + If you want to automate the invocation, you can wrap the `curl` call into + a `while true` loop. +::: +To verify our setup, visit the [X-Ray view in CloudWatch](https://console.aws.amazon.com/cloudwatch/home#xray:service-map/) +where you should see something like shown below: + +![Screen shot of the X-Ray console in CW](../images/x-ray-cw-ho11y.png) + +Now that we have the signal generator set up and active and the OpenTelemetry +pipeline set up, let's see how to consume the traces in Grafana. + +### Grafana dashboard + +You can import an example dashboard, available via +[x-ray-sample-dashboard.json](./fargate-eks-xray-go-adot-amg/x-ray-sample-dashboard.json) +that looks as follows: + +![Screen shot of the X-Ray dashboard in AMG](../images/x-ray-amg-ho11y-dashboard.png) + +Further, when you click on any of the traces in the lower `downstreams` panel, +you can dive into it and view it in the "Explore" tab like so: + +![Screen shot of the X-Ray dashboard in AMG](../images/x-ray-amg-ho11y-explore.png) + +From here, you can use the following guides to create your own dashboard in +Amazon Managed Grafana: + +* [User Guide: Dashboards](https://docs.aws.amazon.com/grafana/latest/userguide/dashboard-overview.html) +* [Best practices for creating dashboards](https://grafana.com/docs/grafana/latest/best-practices/best-practices-for-creating-dashboards/) + +That's it, congratulations you've learned how to use ADOT in EKS on Fargate to +ingest traces. + +## Cleanup + +First remove the Kubernetes resources and destroy the EKS cluster: + +``` +kubectl delete all --all && \ +eksctl delete cluster --name xray-eks-fargate +``` +Finally, remove the Amazon Managed Grafana workspace by removing it via the AWS console. diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/adot-collector-fargate.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/adot-collector-fargate.yaml new file mode 100644 index 000000000..83bee0b7d --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/adot-collector-fargate.yaml @@ -0,0 +1,58 @@ +--- +apiVersion: v1 +kind: Service +metadata: + name: adot-collector + namespace: adot-col + labels: + app: aws-adot + component: adot-collector +spec: + ports: + - name: metrics + port: 8888 + selector: + component: adot-collector + type: NodePort +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: adot-collector + namespace: adot-col + labels: + app: aws-adot + component: adot-collector +spec: + selector: + matchLabels: + app: aws-adot + component: adot-collector + minReadySeconds: 5 + template: + metadata: + labels: + app: aws-adot + component: adot-collector + spec: + serviceAccountName: adot-collector + containers: + - image: public.ecr.aws/aws-observability/aws-otel-collector:latest + name: adot-collector + resources: + limits: + cpu: 1 + memory: 2Gi + requests: + cpu: 200m + memory: 400Mi + ports: + - containerPort: 8888 + livenessProbe: + httpGet: + path: / + port: 13133 + readinessProbe: + httpGet: + path: / + port: 13133 diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/cluster-config.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/cluster-config.yaml new file mode 100644 index 000000000..59f47d164 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/cluster-config.yaml @@ -0,0 +1,39 @@ +apiVersion: eksctl.io/v1alpha5 +kind: ClusterConfig +metadata: + name: xray-eks-fargate + region: eu-west-1 + version: "1.21" +iam: + withOIDC: true + serviceAccounts: + - metadata: + name: adot-collector + namespace: adot-col + labels: {aws-usage: "application"} + attachPolicy: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: + - "logs:PutLogEvents" + - "logs:CreateLogGroup" + - "logs:CreateLogStream" + - "logs:DescribeLogStreams" + - "logs:DescribeLogGroups" + - "xray:PutTraceSegments" + - "xray:PutTelemetryRecord" + - "xray:GetSamplingRules" + - "xray:GetSamplingTargets" + - "xray:GetSamplingStatisticSummaries" + - "ssm:GetParameters" + Resource: "*" +fargateProfiles: + - name: defaultfp + selectors: + - namespace: example-app + - namespace: kube-system + - namespace: adot-col +cloudWatch: + clusterLogging: + enableTypes: ["*"] diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/x-ray-sample-app.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/x-ray-sample-app.yaml new file mode 100644 index 000000000..62c75ed74 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/x-ray-sample-app.yaml @@ -0,0 +1,139 @@ +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: frontend +spec: + selector: + matchLabels: + app: frontend + replicas: 1 + template: + metadata: + labels: + app: frontend + spec: + containers: + - name: ho11y + image: "ACCOUNTID.dkr.ecr.REGION.amazonaws.com/ho11y:latest" + ports: + - containerPort: 8765 + env: + - name: DISABLE_OM + value: "on" + - name: HO11Y_LOG_DEST + value: "stdout" + - name: OTEL_RESOURCE_ATTRIB + value: "frontend" + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: "adot-collector.adot-col:4317" + - name: HO11Y_INJECT_FAILURE + value: "enabled" + - name: DOWNSTREAM0 + value: "http://downstream0" + - name: DOWNSTREAM1 + value: "http://downstream1" + imagePullPolicy: Always +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: downstream0 +spec: + selector: + matchLabels: + app: downstream0 + replicas: 1 + template: + metadata: + labels: + app: downstream0 + spec: + containers: + - name: ho11y + image: "ACCOUNTID.dkr.ecr.REGION.amazonaws.com/ho11y:latest" + ports: + - containerPort: 8765 + env: + - name: DISABLE_OM + value: "on" + - name: HO11Y_LOG_DEST + value: "stdout" + - name: OTEL_RESOURCE_ATTRIB + value: "downstream0" + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: "adot-collector.adot-col:4317" + - name: DOWNSTREAM0 + value: "https://otel.help/" + imagePullPolicy: Always +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: downstream1 +spec: + selector: + matchLabels: + app: downstream1 + replicas: 1 + template: + metadata: + labels: + app: downstream1 + spec: + containers: + - name: ho11y + image: "ACCOUNTID.dkr.ecr.REGION.amazonaws.com/ho11y:latest" + ports: + - containerPort: 8765 + env: + - name: DISABLE_OM + value: "on" + - name: HO11Y_LOG_DEST + value: "stdout" + - name: OTEL_RESOURCE_ATTRIB + value: "downstream1" + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: "adot-collector.adot-col:4317" + - name: DOWNSTREAM0 + value: "https://o11y.news/" + - name: DOWNSTREAM1 + value: "DUMMY:187kB:42ms" + - name: DOWNSTREAM2 + value: "DUMMY:13kB:2ms" + imagePullPolicy: Always +--- +apiVersion: v1 +kind: Service +metadata: + name: frontend +spec: + ports: + - port: 80 + targetPort: 8765 + selector: + app: frontend +--- +apiVersion: v1 +kind: Service +metadata: + name: downstream0 +spec: + ports: + - port: 80 + targetPort: 8765 + selector: + app: downstream0 +--- +apiVersion: v1 +kind: Service +metadata: + name: downstream1 +spec: + ports: + - port: 80 + targetPort: 8765 + selector: + app: downstream1 +--- + diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/x-ray-sample-dashboard.json b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/x-ray-sample-dashboard.json new file mode 100644 index 000000000..b815cff94 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/fargate-eks-xray-go-adot-amg/x-ray-sample-dashboard.json @@ -0,0 +1,379 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": "-- Grafana --", + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "description": "AWS o11y recipes X-Ray example dashboard", + "editable": true, + "gnetId": null, + "graphTooltip": 0, + "id": 18, + "links": [ + { + "asDropdown": false, + "icon": "external link", + "includeVars": false, + "keepTime": false, + "tags": [], + "targetBlank": false, + "title": "Recipe …", + "tooltip": "", + "type": "link", + "url": "https://aws-observability.github.io/aws-o11y-recipes/recipes/fargate-eks-xray-go-adot-amg/" + } + ], + "panels": [ + { + "collapsed": false, + "datasource": null, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 19, + "panels": [], + "title": "Traces", + "type": "row" + }, + { + "datasource": "AWS X-Ray eu-west-1", + "description": "From X-Ray", + "gridPos": { + "h": 16, + "w": 15, + "x": 0, + "y": 1 + }, + "id": 12, + "pluginVersion": "7.5.5", + "targets": [ + { + "group": { + "FilterExpression": null, + "GroupName": "Default", + "InsightsConfiguration": { + "InsightsEnabled": true, + "NotificationsEnabled": true + } + }, + "query": "\n", + "queryType": "getServiceMap", + "refId": "A", + "region": "default" + } + ], + "title": "service map", + "type": "nodeGraph" + }, + { + "datasource": "AWS X-Ray eu-west-1", + "description": "From X-Ray", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Error Count" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Total Count" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 16, + "w": 9, + "x": 15, + "y": 1 + }, + "id": 13, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "single" + } + }, + "pluginVersion": "8.0.5", + "targets": [ + { + "columns": [ + "TotalCount" + ], + "group": { + "FilterExpression": null, + "GroupName": "Default", + "InsightsConfiguration": { + "InsightsEnabled": true, + "NotificationsEnabled": true + } + }, + "query": "service(\"frontend\")", + "queryType": "getTimeSeriesServiceStatistics", + "refId": "A", + "region": "default" + }, + { + "columns": [ + "ErrorStatistics.TotalCount" + ], + "group": { + "FilterExpression": null, + "GroupName": "Default", + "InsightsConfiguration": { + "InsightsEnabled": true, + "NotificationsEnabled": true + } + }, + "hide": false, + "query": "service(\"frontend\")", + "queryType": "getTimeSeriesServiceStatistics", + "refId": "B", + "region": "default" + } + ], + "title": "invocation counts", + "type": "timeseries" + }, + { + "datasource": "AWS X-Ray eu-west-1", + "description": "From X-Ray", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "displayMode": "auto", + "filterable": true + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Id" + }, + "properties": [ + { + "id": "custom.width", + "value": 302 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Response" + }, + "properties": [ + { + "id": "custom.width", + "value": 90 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Response Time" + }, + "properties": [ + { + "id": "custom.width", + "value": 147 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "URL" + }, + "properties": [ + { + "id": "custom.width", + "value": null + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Client IP" + }, + "properties": [ + { + "id": "custom.width", + "value": 140 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Annotations" + }, + "properties": [ + { + "id": "custom.width", + "value": 61 + } + ] + } + ] + }, + "gridPos": { + "h": 20, + "w": 24, + "x": 0, + "y": 17 + }, + "id": 21, + "options": { + "showHeader": true, + "sortBy": [ + { + "desc": false, + "displayName": "Client IP" + } + ] + }, + "pluginVersion": "8.0.5", + "targets": [ + { + "group": { + "FilterExpression": null, + "GroupName": "Default", + "InsightsConfiguration": { + "InsightsEnabled": true, + "NotificationsEnabled": true + } + }, + "query": "service(\"downstream0\") or service(\"downstream1\")", + "queryType": "getTraceSummaries", + "refId": "A", + "region": "default" + } + ], + "timeFrom": null, + "timeShift": null, + "title": "downstreams", + "type": "table" + } + ], + "refresh": false, + "schemaVersion": 30, + "style": "dark", + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-15m", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "X-Ray sample (ho11y)", + "uid": "X-M5Ssc7z", + "version": 5 +} diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/lambda-cw-metrics-go-amp.md b/docusaurus/observability-best-practices/docs/recipes/recipes/lambda-cw-metrics-go-amp.md new file mode 100644 index 000000000..3272364e6 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/lambda-cw-metrics-go-amp.md @@ -0,0 +1,156 @@ +# Exporting CloudWatch Metric Streams via Firehose and AWS Lambda to Amazon Managed Service for Prometheus + +In this recipe we show you how to instrument a [CloudWatch Metric Stream](https://console.aws.amazon.com/cloudwatch/home#metric-streams:streamsList) and use [Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/) and [AWS Lambda](https://aws.amazon.com/lambda) to ingest metrics into [Amazon Managed Service for Prometheus (AMP)](https://aws.amazon.com/prometheus/). + +We will be setting up a stack using [AWS Cloud Development Kit (CDK)](https://aws.amazon.com/cdk/) to create a Firehose Delivery Stream, Lambda, and a S3 Bucket to demonstrate a complete scenario. + +:::note + This guide will take approximately 30 minutes to complete. +::: +## Infrastructure +In the following section we will be setting up the infrastructure for this recipe. + +CloudWatch Metric Streams allow forwarding of the streaming metric data to a +HTTP endpoint or [S3 bucket](https://aws.amazon.com/s3). + +### Prerequisites + +* The AWS CLI is [installed](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in your environment. +* The [AWS CDK Typescript](https://docs.aws.amazon.com/cdk/latest/guide/work-with-cdk-typescript.html) is installed in your environment. +* Node.js and Go. +* The [repo](https://github.com/aws-observability/observability-best-practices/) has been cloned to your local machine. The code for this project is under `/sandbox/CWMetricStreamExporter`. + +### Create an AMP workspace + +Our demo application in this recipe will be running on top of AMP. +Create your AMP Workspace via the following command: + +``` +aws amp create-workspace --alias prometheus-demo-recipe +``` + +Ensure your workspace has been created with the following command: +``` +aws amp list-workspaces +``` + +:::info + For more details check out the [AMP Getting started](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-getting-started.html) guide. +::: +### Install dependencies + +From the root of the aws-o11y-recipes repository, change your directory to CWMetricStreamExporter via the command: + +``` +cd sandbox/CWMetricStreamExporter +``` + +This will now be considered the root of the repo, going forward. + +Change directory to `/cdk` via the following command: + +``` +cd cdk +``` + +Install the CDK dependencies via the following command: + +``` +npm install +``` + +Change directory back to the root of the repo, and then change directory +to `/lambda` using following command: + +``` +cd lambda +``` + +Once in the `/lambda` folder, install the Go dependencies using: + +``` +go get +``` + +All the dependencies are now installed. + +### Modify config file + +In the root of the repo, open `config.yaml` and modify the AMP workspace URL +by replacing the `{workspace}` with the newly created workspace id, and the +region your AMP workspace is in. + +For example, modify the following with: + +``` +AMP: + remote_write_url: "https://aps-workspaces.us-east-2.amazonaws.com/workspaces/{workspaceId}/api/v1/remote_write" + region: us-east-2 +``` + +Change the names of the Firehose Delivery Stream and S3 Bucket to your liking. + +### Deploy stack + +Once the `config.yaml` has been modified with the AMP workspace ID, it is time +to deploy the stack to CloudFormation. To build the CDK and the Lambda code, +in the root of the repo run the following commend: + +``` +npm run build +``` + +This build step ensures that the Go Lambda binary is built, and deploys the CDK +to CloudFormation. + +Accept the following IAM changes to deploy the stack: + +![Screen shot of the IAM Changes when deploying the CDK](../images/cdk-amp-iam-changes.png) + +Verify that the stack has been created by running the following command: + +``` +aws cloudformation list-stacks +``` + +A stack by the name `CDK Stack` should have been created. + +## Create CloudWatch stream + +Navigate to the CloudWatch consoloe, for example +`https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metric-streams:streamsList` +and click "Create metric stream". + +Select the metics needed, either all metrics of only from selected namespaces. + +Configure the Metric Stream by using an existing Firehose which was created by the CDK. +Change the output format to JSON instead of OpenTelemetry 0.7. +Modify the Metric Stream name to your liking, and click "Create metric stream": + +![Screen shot of the Cloudwatch Metric Stream Configuration](../images/cloudwatch-metric-stream-configuration.png) + +To verify the Lambda function invocation, navigate to the [Lambda console](https://console.aws.amazon.com/lambda/home) +and click the function `KinesisMessageHandler`. Click the `Monitor` tab and `Logs` subtab, and under `Recent Invocations` there should be entries of the Lambda function being triggered. + +:::note + It may take upto 5 minutes for invocations to show in the Monitor tab. +::: +That is it! Congratulations, your metrics are now being streamed from CloudWatch to Amazon Managed Service for Prometheus. + +## Cleanup + +First, delete the CloudFormation stack: + +``` +cd cdk +cdk destroy +``` + +Remove the AMP workspace: + +``` +aws amp delete-workspace --workspace-id \ + `aws amp list-workspaces --alias prometheus-sample-app --query 'workspaces[0].workspaceId' --output text` +``` + +Last but not least, remove the CloudWatch Metric Stream by removing it from the console. diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/metrics-explorer-filter-by-tags.md b/docusaurus/observability-best-practices/docs/recipes/recipes/metrics-explorer-filter-by-tags.md new file mode 100644 index 000000000..fdaf0d719 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/metrics-explorer-filter-by-tags.md @@ -0,0 +1,61 @@ +# Using Amazon CloudWatch Metrics explorer to aggregate and visualize metrics filtered by resource tags + +In this recipe we show you how to use Metrics explorer to filter, aggregate, and visualize metrics by resource tags and resource properties - [Use metrics explorer to monitor resources by their tags and properties][metrics-explorer]. + +There are number of ways to create visualizations with Metrics explorer; in this walkthrough we simply leverage the AWS Console. + +:::note + This guide will take approximately 5 minutes to complete. +::: +## Prerequisites + +* Access to AWS account +* Access to Amazon CloudWatch Metrics explorer via AWS Console +* Resource tags set for the relevant resources + + +## Metrics Explorer tag based queries and visualizations + +* Open the CloudWatch console + +* Under Metrics, click on the Explorer menu + +![Screen shot of metrics filtered by tag](../images/metrics-explorer-filter-by-tags/metrics-explorer-cw-menu.png) + + +* You can either choose from one of the Generic templates or a Service based templates list; in this example we use the EC2 Instances by type template + +![Screen shot of metrics filtered by tag](../images/metrics-explorer-filter-by-tags/metrics-explorer-templates-ec2-by-type.png) + + +* Choose metrics you would like to explore; remove obsolete once, and add other metrics you would like to see + +![Screen shot of the EC2 metrics](../images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-metrics.png) + + +* Under From, choose a resource tag or a resource property you are looking for; in the below example we show number of CPU and Network related metrics for different EC2 instances with Name: TeamX Tag + +![Screen shot of the Name tag example](../images/metrics-explorer-filter-by-tags/metrics-explorer-teamx-name-tag.png) + + +* Please note, you can combine time series using an aggregation function under Aggregated by; in the below example TeamX metrics are aggregated by Availability Zone + +![Screen shot of the EC2 dashboard filter by tag Name](../images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-by-tag-name-dashboard.png) + + +Alternatively, you could aggregate TeamX and TeamY by the Team Tag, or choose any other configuration that suits your needs + +![Screen shot of the EC2 dashboard filter by tag Team](../images/metrics-explorer-filter-by-tags/metrics-explorer-ec2-by-tag-team-dashboard.png) + + +## Dynamic visualizations +You can easily customize resulting visualizations by using From, Aggregated by and Split by options. Metrics explorer visualizations are dynamic, so any new tagged resource automatically appears in the explorer widget. + +## Reference + +For more information on Metrics explorer please refer to the following article: +https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Metrics-Explorer.html + +[metrics-explorer]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Metrics-Explorer.html diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/monitoring-hybridenv-amg.md b/docusaurus/observability-best-practices/docs/recipes/recipes/monitoring-hybridenv-amg.md new file mode 100644 index 000000000..a344bedf9 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/monitoring-hybridenv-amg.md @@ -0,0 +1,98 @@ +# Monitoring hybrid environments using Amazon Managed Service for Grafana + +In this recipe we show you how to visualize metrics from an Azure Cloud environment to [Amazon Managed Service for Grafana](https://aws.amazon.com/grafana/) (AMG) and create alert notifications in AMG to be sent to [Amazon Simple Notification Service](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) and Slack. + + +As part of the implementation, we will create an AMG workspace, configure the Azure Monitor plugin as the data source for AMG and configure the Grafana dashboard. We will be creating two notification channels: one for Amazon SNS and one for slack.We will also configure alerts in the AMG dashboard to be sent to the notification channels. + +:::note + This guide will take approximately 30 minutes to complete. +::: +## Infrastructure +In the following section we will be setting up the infrastructure for this recipe. + +### Prerequisites + +* The AWS CLI is [installed](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in your environment. +* You need to enable [AWS-SSO](https://docs.aws.amazon.com/singlesignon/latest/userguide/step1.html) + +### Architecture + + +First, create an AMG workspace to visualize the metrics from Azure Monitor. Follow the steps in the [Getting Started with Amazon Managed Service for Grafana](https://aws.amazon.com/blogs/mt/amazon-managed-grafana-getting-started/) blog post. After you create the workspace, you can assign access to the Grafana workspace to an individual user or a user group. By default, the user has a user type of viewer. Change the user type based on the user role. + +:::note + You must assign an Admin role to at least one user in the workspace. +::: +In Figure 1, the user name is grafana-admin. The user type is Admin. On the Data sources tab, choose the required data source. Review the configuration, and then choose Create workspace. +![azure-monitor-grafana-demo](../images/azure-monitor-grafana.png) + + + +### Configure the data source and custom dashboard + +Now, under Data sources, configure the Azure Monitor plugin to start querying and visualizing the metrics from the Azure environment. Choose Data sources to add a data source. +![datasources](../images/datasource.png) + +In Add data source, search for Azure Monitor and then configure the parameters from the app registration console in the Azure environment. +![Add data source](../images/datasource-addition.png) + +To configure the Azure Monitor plugin, you need the directory (tenant) ID and the application (client) ID. For instructions, see the [article](https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal) about creating an Azure AD application and service principal. It explains how to register the app and grant access to Grafana to query the data. + +![Azure-Monitor-metrics](../images/azure-monitor-metrics.png) + +After the data source is configured, import a custom dashboard to analyze the Azure metrics. In the left pane, choose the + icon, and then choose Import. + +In Import via grafana.com, enter the dashboard ID, 10532. + +![Importing-dashboard](../images/import-dashboard.png) + +This will import the Azure Virtual Machine dashboard where you can start analyzing the Azure Monitor metrics. In my setup, I have a virtual machine running in the Azure environment. + +![Azure-Monitor-Dashbaord](../images/azure-dashboard.png) + + +### Configure the notification channels on AMG + +In this section, you’ll configure two notifications channels and then send alerts. + +Use the following command to create an SNS topic named grafana-notification and subscribe an email address. + +``` +aws sns create-topic --name grafana-notification +aws sns subscribe --topic-arn arn:aws:sns:::grafana-notification --protocol email --notification-endpoint + +``` +In the left pane, choose the bell icon to add a new notification channel. +Now configure the grafana-notification notification channel. On Edit notification channel, for Type, choose AWS SNS. For Topic, use the ARN of the SNS topic you just created. For Auth Provider, choose the workspace IAM role. + +![Notification Channels](../images/notification-channels.png) + +### Slack notification channel +To configure a Slack notification channel, create a Slack workspace or use an existing one. Enable Incoming Webhooks as described in [Sending messages using Incoming Webhooks](https://api.slack.com/messaging/webhooks). + +After you’ve configured the workspace, you should be able to get a webhook URL that will be used in the Grafana dashboard. + +![Slack notification Channel](../images/slack-notification.png) + + + + + +### Configure alerts in AMG + +You can configure Grafana alerts when the metric increases beyond the threshold. With AMG, you can configure how often the alert must be evaluated in the dashboard and send the notification. In this example, configure an alert for CPU utilization for an Azure virtual machine. When the utilization exceeds a threshold, configure AMG to send notifications to both channels. + +In the dashboard, choose CPU utilization from the dropdown, and then choose Edit. On the Alert tab of the graph panel, configure how often the alert rule should be evaluated and the conditions that must be met for the alert to change state and initiate its notifications. + +In the following configuration, an alert is created if the CPU utilization exceeds 50%. Notifications will be sent to the grafana-alert-notification and slack-alert-notification channels. + +![Azure VM Edit panel](../images/alert-config.png) + +Now, you can sign in to the Azure virtual machine and initiate stress testing using tools like stress. When the CPU utilization exceeds the threshold, you will receive notifications on both channels. + +Now configure alerts for CPU utilization with the right threshold to simulate an alert that is sent to the Slack channel. + +## Conclusion + +In the recipe, we showed you how to deploy the AMG workspace, configure notification channels, collect metrics from Azure Cloud, and configure alerts on the AMG dashboard. Because AMG is a fully managed, serverless solution, you can spend your time on the applications that transform your business and leave the heavy lifting of managing Grafana to AWS. diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg.md b/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg.md new file mode 100644 index 000000000..e90214ea6 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg.md @@ -0,0 +1,323 @@ +# Using Amazon Managed Service for Prometheus to monitor App Mesh environment configured on EKS + +In this recipe we show you how to ingest [App Mesh](https://docs.aws.amazon.com/app-mesh/) Envoy +metrics in an [Amazon Elastic Kubernetes Service](https://aws.amazon.com/eks/) (EKS) cluster +to [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) (AMP) +and create a custom dashboard on [Amazon Managed Grafana](https://aws.amazon.com/grafana/) +(AMG) to monitor the health and performance of microservices. + +As part of the implementation, we will create an AMP workspace, install the App Mesh +Controller for Kubernetes and inject the Envoy container into the pods. We will be +collecting the Envoy metrics using [Grafana Agent](https://github.com/grafana/agent) +configured in the EKS cluster and write them to AMP. Finally, we will be creating +an AMG workspace and configure the AMP as the datasource and create a custom dashboard. + +:::note + This guide will take approximately 45 minutes to complete. +::: +## Infrastructure +In the following section we will be setting up the infrastructure for this recipe. + +### Architecture + + +![Architecture](../images/monitoring-appmesh-environment.png) + +The Grafana agent is configured to scrape the Envoy metrics and ingest them to AMP through the AMP remote write endpoint + +:::info + For more information on Prometheus Remote Write Exporter check out + [Getting Started with Prometheus Remote Write Exporter for AMP](https://aws-otel.github.io/docs/getting-started/prometheus-remote-write-exporter). +::: + +### Prerequisites + +* The AWS CLI is [installed](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in your environment. +* You need to install the [eksctl](https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html) command in your environment. +* You need to install [kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) in your environment. +* You have [Docker](https://docs.docker.com/get-docker/) installed into your environment. +* You need AMP workspace configured in your AWS account. +* You need to install [Helm](https://www.eksworkshop.com/beginner/060_helm/helm_intro/install/index.html). +* You need to enable [AWS-SSO](https://docs.aws.amazon.com/singlesignon/latest/userguide/step1.html). + +### Setup an EKS cluster + +First, create an EKS cluster that will be enabled with App Mesh for running the sample application. +The `eksctl` CLI will be used to deploy the cluster using the [eks-cluster-config.yaml](./servicemesh-monitoring-ampamg/eks-cluster-config.yaml). +This template will create a new cluster with EKS. + +Edit the template file and set your region to one of the available regions for AMP: + +* `us-east-1` +* `us-east-2` +* `us-west-2` +* `eu-central-1` +* `eu-west-1` + +Make sure to overwrite this region in your session, for example, in the Bash +shell: + +``` +export AWS_REGION=eu-west-1 +``` + +Create your cluster using the following command: + +``` +eksctl create cluster -f eks-cluster-config.yaml +``` +This creates an EKS cluster named `AMP-EKS-CLUSTER` and a service account +named `appmesh-controller` that will be used by the App Mesh controller for EKS. + +### Install App Mesh Controller + +Next, we will run the below commands to install the [App Mesh Controller](https://docs.aws.amazon.com/app-mesh/latest/userguide/getting-started-kubernetes.html) +and configure the Custom Resource Definitions (CRDs): + +``` +helm repo add eks https://aws.github.io/eks-charts +``` + +``` +helm upgrade -i appmesh-controller eks/appmesh-controller \ + --namespace appmesh-system \ + --set region=${AWS_REGION} \ + --set serviceAccount.create=false \ + --set serviceAccount.name=appmesh-controller +``` + +### Set up AMP +The AMP workspace is used to ingest the Prometheus metrics collected from Envoy. +A workspace is a logical Cortex server dedicated to a tenant. A workspace supports +fine-grained access control for authorizing its management such as update, list, +describe, and delete, and the ingestion and querying of metrics. + +Create a workspace using the AWS CLI: + +``` +aws amp create-workspace --alias AMP-APPMESH --region $AWS_REGION +``` + +Add the necessary Helm repositories: + +``` +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && \ +helm repo add kube-state-metrics https://kubernetes.github.io/kube-state-metrics +``` + +For more details on AMP check out the [AMP Getting started](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-getting-started.html) guide. + +### Scraping & ingesting metrics + +AMP does not directly scrape operational metrics from containerized workloads in a Kubernetes cluster. +You must deploy and manage a Prometheus server or an OpenTelemetry agent such as the +[AWS Distro for OpenTelemetry Collector](https://github.com/aws-observability/aws-otel-collector) +or the Grafana Agent to perform this task. In this receipe, we walk you through the +process of configuring the Grafana Agent to scrape the Envoy metrics and analyze them using AMP and AMG. + +#### Configure Grafana Agent + +The Grafana Agent is a lightweight alternative to running a full Prometheus server. +It keeps the necessary parts for discovering and scraping Prometheus exporters and +sending metrics to a Prometheus-compatible backend. The Grafana Agent also includes +native support for AWS Signature Version 4 (Sigv4) for AWS Identity and Access Management (IAM) +authentication. + +We now walk you through the steps to configure an IAM role to send Prometheus metrics to AMP. +We install the Grafana Agent on the EKS cluster and forward metrics to AMP. + +#### Configure permissions +The Grafana Agent scrapes operational metrics from containerized workloads running in the +EKS cluster and sends them to AMP. Data sent to AMP must be signed with valid AWS credentials +using Sigv4 to authenticate and authorize each client request for the managed service. + +The Grafana Agent can be deployed to an EKS cluster to run under the identity of a Kubernetes service account. +With IAM roles for service accounts (IRSA), you can associate an IAM role with a Kubernetes service account +and thus provide IAM permissions to any pod that uses the service account. + +Prepare the IRSA setup as follows: + +``` +kubectl create namespace grafana-agent + +export WORKSPACE=$(aws amp list-workspaces | jq -r '.workspaces[] | select(.alias=="AMP-APPMESH").workspaceId') +export ROLE_ARN=$(aws iam get-role --role-name EKS-GrafanaAgent-AMP-ServiceAccount-Role --query Role.Arn --output text) +export NAMESPACE="grafana-agent" +export REMOTE_WRITE_URL="https://aps-workspaces.$AWS_REGION.amazonaws.com/workspaces/$WORKSPACE/api/v1/remote_write" +``` + +You can use the [gca-permissions.sh](./servicemesh-monitoring-ampamg/gca-permissions.sh) +shell script to automate the following steps (note to replace the placeholder variable +`YOUR_EKS_CLUSTER_NAME` with the name of your EKS cluster): + +* Creates an IAM role named `EKS-GrafanaAgent-AMP-ServiceAccount-Rol`e with an IAM policy that has permissions to remote-write into an AMP workspace. +* Creates a Kubernetes service account named `grafana-agent` under the `grafana-agent` namespace that is associated with the IAM role. +* Creates a trust relationship between the IAM role and the OIDC provider hosted in your Amazon EKS cluster. + +You need `kubectl` and `eksctl` CLI tools to run the `gca-permissions.sh` script. +They must be configured to access your Amazon EKS cluster. + +Now create a manifest file, [grafana-agent.yaml](./servicemesh-monitoring-ampamg/grafana-agent.yaml), +with the scrape configuration to extract Envoy metrics and deploy the Grafana Agent. + +:::note + At time of writing, this solution will not work for EKS on Fargate + due to the lack of support for daemon sets there. +::: +The example deploys a daemon set named `grafana-agent` and a deployment named +`grafana-agent-deployment`. The `grafana-agent` daemon set collects metrics +from pods on the cluster and the `grafana-agent-deployment` deployment collects +metrics from services that do not live on the cluster, such as the EKS control plane. + +``` +kubectl apply -f grafana-agent.yaml +``` +After the `grafana-agent` is deployed, it will collect the metrics and ingest +them into the specified AMP workspace. Now deploy a sample application on the +EKS cluster and start analyzing the metrics. + +## Sample application + +To install an application and inject an Envoy container, we use the AppMesh controller for Kubernetes. + +First, install the base application by cloning the examples repo: + +``` +git clone https://github.com/aws/aws-app-mesh-examples.git +``` + +And now apply the resources to your cluster: + +``` +kubectl apply -f aws-app-mesh-examples/examples/apps/djapp/1_base_application +``` + +Check the pod status and make sure it is running: + +``` +$ kubectl -n prod get all + +NAME READY STATUS RESTARTS AGE +pod/dj-cb77484d7-gx9vk 1/1 Running 0 6m8s +pod/jazz-v1-6b6b6dd4fc-xxj9s 1/1 Running 0 6m8s +pod/metal-v1-584b9ccd88-kj7kf 1/1 Running 0 6m8s +``` + +Next, install the App Mesh controller and meshify the deployment: + +``` +kubectl apply -f aws-app-mesh-examples/examples/apps/djapp/2_meshed_application/ +kubectl rollout restart deployment -n prod dj jazz-v1 metal-v1 +``` + +Now we should see two containers running in each pod: + +``` +$ kubectl -n prod get all +NAME READY STATUS RESTARTS AGE +dj-7948b69dff-z6djf 2/2 Running 0 57s +jazz-v1-7cdc4fc4fc-wzc5d 2/2 Running 0 57s +metal-v1-7f499bb988-qtx7k 2/2 Running 0 57s +``` + +Generate the traffic for 5 mins and we will visualize it AMG later: + +``` +dj_pod=`kubectl get pod -n prod --no-headers -l app=dj -o jsonpath='{.items[*].metadata.name}'` + +loop_counter=0 +while [ $loop_counter -le 300 ] ; do \ +kubectl exec -n prod -it $dj_pod -c dj \ +-- curl jazz.prod.svc.cluster.local:9080 ; echo ; loop_counter=$[$loop_counter+1] ; \ +done +``` + +### Create an AMG workspace + +To create an AMG workspace follow the steps in the [Getting Started with AMG](https://aws.amazon.com/blogs/mt/amazon-managed-grafana-getting-started/) blog post. +To grant users access to the dashboard, you must enable AWS SSO. After you create the workspace, you can assign access to the Grafana workspace to an individual user or a user group. +By default, the user has a user type of viewer. Change the user type based on the user role. Add the AMP workspace as the data source and then start creating the dashboard. + +In this example, the user name is `grafana-admin` and the user type is `Admin`. +Select the required data source. Review the configuration, and then choose `Create workspace`. + +![Creating AMP Workspace](../images/workspace-creation.png) + +### Configure AMG datasource +To configure AMP as a data source in AMG, in the `Data sources` section, choose +`Configure in Grafana`, which will launch a Grafana workspace in the browser. +You can also manually launch the Grafana workspace URL in the browser. + +![Configuring Datasource](../images/configuring-amp-datasource.png) + +As you can see from the screenshots, you can view Envoy metrics like downstream +latency, connections, response code, and more. You can use the filters shown to +drill down to the envoy metrics of a particular application. + +### Configure AMG dashboard + +After the data source is configured, import a custom dashboard to analyze the Envoy metrics. +For this we use a pre-defined dashboard, so choose `Import` (shown below), and +then enter the ID `11022`. This will import the Envoy Global dashboard so you can +start analyzing the Envoy metrics. + +![Custom Dashboard](../images/import-dashboard.png) + +### Configure alerts on AMG +You can configure Grafana alerts when the metric increases beyond the intended threshold. +With AMG, you can configure how often the alert must be evaluated in the dashboard and send the notification. +Before you create alert rules, you must create a notification channel. + +In this example, configure Amazon SNS as a notification channel. The SNS topic must be +prefixed with `grafana` for notifications to be successfully published to the topic +if you use the defaults, that is, the [service-managed permissions](https://docs.aws.amazon.com/grafana/latest/userguide/AMG-manage-permissions.html#AMG-service-managed-account). + +Use the following command to create an SNS topic named `grafana-notification`: + +``` +aws sns create-topic --name grafana-notification +``` + +And subscribe to it via an email address. Make sure you specify the region and Account ID in the +below command: + +``` +aws sns subscribe \ + --topic-arn arn:aws:sns:::grafana-notification \ + --protocol email \ + --notification-endpoint +``` + +Now, add a new notification channel from the Grafana dashboard. +Configure the new notification channel named grafana-notification. For Type, +use AWS SNS from the drop down. For Topic, use the ARN of the SNS topic you just created. +For Auth provider, choose AWS SDK Default. + +![Creating Notification Channel](../images/alert-configuration.png) + +Now configure an alert if downstream latency exceeds five milliseconds in a one-minute period. +In the dashboard, choose Downstream latency from the dropdown, and then choose Edit. +On the Alert tab of the graph panel, configure how often the alert rule should be evaluated +and the conditions that must be met for the alert to change state and initiate its notifications. + +In the following configuration, an alert is created if the downstream latency exceeds the +threshold and notification will be sent through the configured grafana-alert-notification channel to the SNS topic. + +![Alert Configuration](../images/downstream-latency.png) + +## Cleanup + +1. Remove the resources and cluster: +``` +kubectl delete all --all +eksctl delete cluster --name AMP-EKS-CLUSTER +``` +2. Remove the AMP workspace: +``` +aws amp delete-workspace --workspace-id `aws amp list-workspaces --alias prometheus-sample-app --query 'workspaces[0].workspaceId' --output text` +``` +3. Remove the amp-iamproxy-ingest-role IAM role: +``` +aws delete-role --role-name amp-iamproxy-ingest-role +``` +4. Remove the AMG workspace by removing it from the console. diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg/eks-cluster-config.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg/eks-cluster-config.yaml new file mode 100644 index 000000000..d6f5766d1 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg/eks-cluster-config.yaml @@ -0,0 +1,37 @@ +export AMP_EKS_CLUSTER=AMP-EKS-CLUSTER +export AMP_ACCOUNT_ID= +export AWS_REGION= + + +cat << EOF > eks-cluster-config.yaml +--- +apiVersion: eksctl.io/v1alpha5 +kind: ClusterConfig +metadata: + name: $AMP_EKS_CLUSTER + region: $AWS_REGION + version: '1.18' +iam: + withOIDC: true + serviceAccounts: + - metadata: + name: appmesh-controller + namespace: appmesh-system + labels: {aws-usage: "application"} + attachPolicyARNs: + - "arn:aws:iam::aws:policy/AWSAppMeshFullAccess" +managedNodeGroups: +- name: default-ng + minSize: 1 + maxSize: 3 + desiredCapacity: 2 + labels: {role: mngworker} + iam: + withAddonPolicies: + certManager: true + cloudWatch: true + appMesh: true +cloudWatch: + clusterLogging: + enableTypes: ["*"] +EOF diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg/gca-permissions.sh b/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg/gca-permissions.sh new file mode 100755 index 000000000..692427140 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg/gca-permissions.sh @@ -0,0 +1,108 @@ +##!/bin/bash +CLUSTER_NAME=YOUR_EKS_CLUSTER_NAME +AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) +OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///") + +SERVICE_ACCOUNT_IAM_ROLE=EKS-GrafanaAgent-AMP-ServiceAccount-Role +SERVICE_ACCOUNT_IAM_ROLE_DESCRIPTION="IAM role to be used by a K8s service account with write access to AMP" +SERVICE_ACCOUNT_IAM_POLICY=AWSManagedPrometheusWriteAccessPolicy +SERVICE_ACCOUNT_IAM_POLICY_ARN=arn:aws:iam::$AWS_ACCOUNT_ID:policy/$SERVICE_ACCOUNT_IAM_POLICY +# +# Setup a trust policy designed for a specific combination of K8s service account and namespace to sign in from a Kubernetes cluster which hosts the OIDC Idp. +# If the IAM role already exists, then add this new trust policy to the existing trust policy +# +echo "Creating a new trust policy" +read -r -d '' NEW_TRUST_RELATIONSHIP < TrustPolicy.json +# +# Setup the permission policy grants write permissions for all AWS StealFire workspaces +# +read -r -d '' PERMISSION_POLICY < PermissionPolicy.json + +# +# Create an IAM permission policy to be associated with the role, if the policy does not already exist +# +SERVICE_ACCOUNT_IAM_POLICY_ID=$(aws iam get-policy --policy-arn $SERVICE_ACCOUNT_IAM_POLICY_ARN --query 'Policy.PolicyId' --output text) +if [ "$SERVICE_ACCOUNT_IAM_POLICY_ID" = "" ]; +then + echo "Creating a new permission policy $SERVICE_ACCOUNT_IAM_POLICY" + aws iam create-policy --policy-name $SERVICE_ACCOUNT_IAM_POLICY --policy-document file://PermissionPolicy.json +else + echo "Permission policy $SERVICE_ACCOUNT_IAM_POLICY already exists" +fi + +# +# If the IAM role already exists, then just update the trust policy. +# Otherwise create one using the trust policy and permission policy +# +SERVICE_ACCOUNT_IAM_ROLE_ARN=$(aws iam get-role --role-name $SERVICE_ACCOUNT_IAM_ROLE --query 'Role.Arn' --output text) +if [ "$SERVICE_ACCOUNT_IAM_ROLE_ARN" = "" ]; +then + echo "$SERVICE_ACCOUNT_IAM_ROLE role does not exist. Creating a new role with a trust and permission policy" + # + # Create an IAM role for Kubernetes service account + # + SERVICE_ACCOUNT_IAM_ROLE_ARN=$(aws iam create-role \ + --role-name $SERVICE_ACCOUNT_IAM_ROLE \ + --assume-role-policy-document file://TrustPolicy.json \ + --description "$SERVICE_ACCOUNT_IAM_ROLE_DESCRIPTION" \ + --query "Role.Arn" --output text) + # + # Attach the trust and permission policies to the role + # + aws iam attach-role-policy --role-name $SERVICE_ACCOUNT_IAM_ROLE --policy-arn $SERVICE_ACCOUNT_IAM_POLICY_ARN +else + echo "$SERVICE_ACCOUNT_IAM_ROLE_ARN role already exists. Updating the trust policy" + # + # Update the IAM role for Kubernetes service account with a with the new trust policy + # + aws iam update-assume-role-policy --role-name $SERVICE_ACCOUNT_IAM_ROLE --policy-document file://TrustPolicy.json +fi +echo $SERVICE_ACCOUNT_IAM_ROLE_ARN + +# EKS cluster hosts an OIDC provider with a public discovery endpoint. +# Associate this Idp with AWS IAM so that the latter can validate and accept the OIDC tokens issued by Kubernetes to service accounts. +# Doing this with eksctl is the easier and best approach. +# +eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve diff --git a/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg/grafana-agent.yaml b/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg/grafana-agent.yaml new file mode 100644 index 000000000..76d96c2cf --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/recipes/servicemesh-monitoring-ampamg/grafana-agent.yaml @@ -0,0 +1,390 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + annotations: + eks.amazonaws.com/role-arn: ${ROLE_ARN} + name: grafana-agent + namespace: ${NAMESPACE} +--- +apiVersion: v1 +data: + agent.yml: | + prometheus: + configs: + - host_filter: true + name: agent + remote_write: + - sigv4: + enabled: true + region: ${REGION} + url: ${REMOTE_WRITE_URL} + scrape_configs: + - job_name: 'appmesh-envoy' + metrics_path: /stats/prometheus + kubernetes_sd_configs: + - role: pod + relabel_configs: + - source_labels: [__meta_kubernetes_pod_container_name] + action: keep + regex: '^envoy$' + - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] + action: replace + regex: ([^:]+)(?::\d+)?;(\d+) + replacement: \${1}:9901 + target_label: __address__ + - action: labelmap + regex: __meta_kubernetes_pod_label_(.+) + - source_labels: [__meta_kubernetes_namespace] + action: replace + target_label: namespace + - source_labels: ['app'] + action: replace + target_label: service + - source_labels: [__meta_kubernetes_pod_name] + action: replace + target_label: kubernetes_pod_name + - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + job_name: kubernetes-pods + kubernetes_sd_configs: + - role: pod + relabel_configs: + - action: drop + regex: "false" + source_labels: + - __meta_kubernetes_pod_annotation_prometheus_io_scrape + - action: keep + regex: .*-metrics + source_labels: + - __meta_kubernetes_pod_container_port_name + - action: replace + regex: (https?) + replacement: \$1 + source_labels: + - __meta_kubernetes_pod_annotation_prometheus_io_scheme + target_label: __scheme__ + - action: replace + regex: (.+) + replacement: \$1 + source_labels: + - __meta_kubernetes_pod_annotation_prometheus_io_path + target_label: __metrics_path__ + - action: replace + regex: (.+?)(\:\d+)?;(\d+) + replacement: \$1:\$3 + source_labels: + - __address__ + - __meta_kubernetes_pod_annotation_prometheus_io_port + target_label: __address__ + - action: drop + regex: "" + source_labels: + - __meta_kubernetes_pod_label_name + - action: replace + replacement: \$1 + separator: / + source_labels: + - __meta_kubernetes_namespace + - __meta_kubernetes_pod_label_name + target_label: job + - action: replace + source_labels: + - __meta_kubernetes_namespace + target_label: namespace + - action: replace + source_labels: + - __meta_kubernetes_pod_name + target_label: pod + - action: replace + source_labels: + - __meta_kubernetes_pod_container_name + target_label: container + - action: replace + separator: ':' + source_labels: + - __meta_kubernetes_pod_name + - __meta_kubernetes_pod_container_name + - __meta_kubernetes_pod_container_port_name + target_label: instance + - action: labelmap + regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+) + replacement: __param_\$1 + - action: drop + regex: Succeeded|Failed + source_labels: + - __meta_kubernetes_pod_phase + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: false + - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + job_name: ${NAMESPACE}/kube-state-metrics + kubernetes_sd_configs: + - namespaces: + names: + - ${NAMESPACE} + role: pod + relabel_configs: + - action: keep + regex: kube-state-metrics + source_labels: + - __meta_kubernetes_pod_label_name + - action: replace + separator: ':' + source_labels: + - __meta_kubernetes_pod_name + - __meta_kubernetes_pod_container_name + - __meta_kubernetes_pod_container_port_name + target_label: instance + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: false + - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + job_name: ${NAMESPACE}/node-exporter + kubernetes_sd_configs: + - namespaces: + names: + - ${NAMESPACE} + role: pod + relabel_configs: + - action: keep + regex: node-exporter + source_labels: + - __meta_kubernetes_pod_label_name + - action: replace + source_labels: + - __meta_kubernetes_pod_node_name + target_label: instance + - action: replace + source_labels: + - __meta_kubernetes_namespace + target_label: namespace + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: false + - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + job_name: kube-system/kubelet + kubernetes_sd_configs: + - role: node + relabel_configs: + - replacement: kubernetes.default.svc.cluster.local:443 + target_label: __address__ + - replacement: https + target_label: __scheme__ + - regex: (.+) + replacement: /api/v1/nodes/\${1}/proxy/metrics + source_labels: + - __meta_kubernetes_node_name + target_label: __metrics_path__ + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: false + - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + job_name: kube-system/cadvisor + kubernetes_sd_configs: + - role: node + metric_relabel_configs: + - action: drop + regex: container_([a-z_]+); + source_labels: + - __name__ + - image + - action: drop + regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s) + source_labels: + - __name__ + relabel_configs: + - replacement: kubernetes.default.svc.cluster.local:443 + target_label: __address__ + - regex: (.+) + replacement: /api/v1/nodes/\${1}/proxy/metrics/cadvisor + source_labels: + - __meta_kubernetes_node_name + target_label: __metrics_path__ + scheme: https + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: false + global: + scrape_interval: 15s + wal_directory: /var/lib/agent/data + server: + log_level: info +kind: ConfigMap +metadata: + name: grafana-agent + namespace: ${NAMESPACE} +--- +apiVersion: v1 +data: + agent.yml: | + prometheus: + configs: + - host_filter: false + name: agent + remote_write: + - sigv4: + enabled: true + region: ${REGION} + url: ${REMOTE_WRITE_URL} + scrape_configs: + - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token + job_name: default/kubernetes + kubernetes_sd_configs: + - role: endpoints + metric_relabel_configs: + - action: drop + regex: apiserver_admission_controller_admission_latencies_seconds_.* + source_labels: + - __name__ + - action: drop + regex: apiserver_admission_step_admission_latencies_seconds_.* + source_labels: + - __name__ + relabel_configs: + - action: keep + regex: apiserver + source_labels: + - __meta_kubernetes_service_label_component + scheme: https + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + insecure_skip_verify: false + server_name: kubernetes + global: + scrape_interval: 15s + wal_directory: /var/lib/agent/data + server: + log_level: info +kind: ConfigMap +metadata: + name: grafana-agent-deployment + namespace: ${NAMESPACE} +--- +apiVersion: rbac.authorization.k8s.io/v1beta1 +kind: ClusterRole +metadata: + name: grafana-agent +rules: +- apiGroups: + - "" + resources: + - nodes + - nodes/proxy + - services + - endpoints + - pods + verbs: + - get + - list + - watch +- nonResourceURLs: + - /metrics + verbs: + - get +--- +apiVersion: rbac.authorization.k8s.io/v1beta1 +kind: ClusterRoleBinding +metadata: + name: grafana-agent +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: grafana-agent +subjects: +- kind: ServiceAccount + name: grafana-agent + namespace: ${NAMESPACE} +--- +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: grafana-agent + namespace: ${NAMESPACE} +spec: + minReadySeconds: 10 + selector: + matchLabels: + name: grafana-agent + template: + metadata: + labels: + name: grafana-agent + spec: + containers: + - args: + - -config.file=/etc/agent/agent.yml + - -prometheus.wal-directory=/tmp/agent/data + command: + - /bin/agent + env: + - name: HOSTNAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + image: grafana/agent:v0.11.0 + imagePullPolicy: IfNotPresent + name: agent + ports: + - containerPort: 80 + name: http-metrics + securityContext: + privileged: true + runAsUser: 0 + volumeMounts: + - mountPath: /etc/agent + name: grafana-agent + serviceAccount: grafana-agent + tolerations: + - effect: NoSchedule + operator: Exists + volumes: + - configMap: + name: grafana-agent + name: grafana-agent + updateStrategy: + type: RollingUpdate +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: grafana-agent-deployment + namespace: ${NAMESPACE} +spec: + minReadySeconds: 10 + replicas: 1 + revisionHistoryLimit: 10 + selector: + matchLabels: + name: grafana-agent-deployment + template: + metadata: + labels: + name: grafana-agent-deployment + spec: + containers: + - args: + - -config.file=/etc/agent/agent.yml + - -prometheus.wal-directory=/tmp/agent/data + command: + - /bin/agent + env: + - name: HOSTNAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + image: grafana/agent:v0.11.0 + imagePullPolicy: IfNotPresent + name: agent + ports: + - containerPort: 80 + name: http-metrics + securityContext: + privileged: true + runAsUser: 0 + volumeMounts: + - mountPath: /etc/agent + name: grafana-agent-deployment + serviceAccount: grafana-agent + volumes: + - configMap: + name: grafana-agent-deployment + name: grafana-agent-deployment diff --git a/docusaurus/observability-best-practices/docs/recipes/telemetry.md b/docusaurus/observability-best-practices/docs/recipes/telemetry.md new file mode 100644 index 000000000..650fe8347 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/telemetry.md @@ -0,0 +1,43 @@ +# Telemetry + +Telemetry is all about how the signals are collected from various sources, +including your own app and infrastructure and routed to destinations where +they are consumed: + +![telemetry concept](images/telemetry.png) + +:::info + See the [Data types](../signals/logs) section for a detailed breakdown of the best practices for each type of telemetry. +::: +Let's further dive into the concepts introduced in above figure. + +## Sources + +We consider sources as something where signals come from. There are two types of sources: + +1. Things under your control, that is, the application source code, via instrumentation. +1. Everything else you may use, such as managed services, not under your (direct) control. + These types of sources are typically provided by AWS, exposing signals via an API. + +## Agents + +In order to transport signals from the sources to the destinations, you need +some sort of intermediary we call agent. These agents receive or pull signals +from the sources and, typically via configuration, determine where signals +shoud go, optionally supporting filtering and aggregation. + +:::note + "Agents? Routing? Shipping? Ingesting?" + There are many terms out there people use to refer to the process of + getting the signals from sources to destinations including routing, + shipping, aggregation, ingesting etc. and while they may mean slightly + different things, we will use them here interchangeably. Canonically, + we will refer to those intermediary transport components as agents. +::: + +## Destinations + +Where signals end up, for consumption. No matter if you want to store signals +for later consumption, if you want to dashboard them, set an alert if a certain +condition is true, or correlate signals. All of those components that serve +you as the end-user are destinations. diff --git a/docusaurus/observability-best-practices/docs/recipes/troubleshooting.md b/docusaurus/observability-best-practices/docs/recipes/troubleshooting.md new file mode 100644 index 000000000..3f028d4f3 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/troubleshooting.md @@ -0,0 +1,8 @@ +# Troubleshooting + +We include troubleshooting recipes for various situations and dimensions in this section. + +- [Troubleshooting performance bottleneck in DynamoDB][ddb-troubleshooting] + + +[ddb-troubleshooting]: https://observability.workshop.aws/en/scaleup.html diff --git a/docusaurus/observability-best-practices/docs/recipes/workshops.md b/docusaurus/observability-best-practices/docs/recipes/workshops.md new file mode 100644 index 000000000..54253a4bb --- /dev/null +++ b/docusaurus/observability-best-practices/docs/recipes/workshops.md @@ -0,0 +1,9 @@ +# Workshops + +This section contains workshops to which you can return for samples +and demonstrations around o11y systems and tooling. + +- [One Observability Workshop](https://observability.workshop.aws/en/) +- [EKS Workshop](https://www.eksworkshop.com/) +- [ECS Workshop](https://www.ecsworkshop.com/) +- [App Runner Workshop](https://www.apprunnerworkshop.com/) diff --git a/docusaurus/observability-best-practices/docs/signals/alarms.md b/docusaurus/observability-best-practices/docs/signals/alarms.md new file mode 100644 index 000000000..8aaff4e05 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/signals/alarms.md @@ -0,0 +1,63 @@ +# Alarms + +An alarm refers to the state of a probe, monitor, or change in a value over or under a given threshold. A simple example would be an alarm that sends an email when a disk is full or a web site is down. More sophisticated alarms are entirely programmatic and used to drive complex interactions such as auto-scaling or creating of entire server clusters. + +Regardless of the use case though, an alarm indicates the current *state* of a metric. This state can be `OK`, `WARNING`, `ALERT`, or `NO DATA`, depending on the system in question. + +Alarms reflect this state for a period of time and are built on top of a timeseries. As such, they are derived *from* a time series. This graph below shows two alarms: one with a warning threshold, and another that is indicative of average values across this timeseries. As the volume of traffic in this shows, the alarms for the warning threshold should be in a breach state when it dips below the defined value. + +![Timeseries with two alarms](../images/cwalarm2.png) + +:::info + The purpose of an alarm can be either to trigger an action (either human or progammatic), or to be informational (that a threshold is breached). Alarms provide insight into performance of a metric. +::: +## Alert on things that are actionable + +Alarm fatigue is when people get so many alerts that they have learned to ignore them. This is not an indication of a well-monitored system! Rather this is an anti-pattern. + +:::info + Create alarms for things that are actionable, and you should always work from your [objectives](../guides/#monitor-what-matters) backwards. +::: + +For example, if you operate a web site that requires fast response times, create an alert to be delivered when your response times are exceeding a given threshold. And if you have identified that poor performance is tied to high CPU utilization then alert on this datapoint *proactively* before it becomes an issue. However, there may no need to alert on all CPU utilization *everywhere* in your environment if it does not *endanger your outcomes*. + +:::info + If an alarm does not need alert you, or trigger an automated process, then there is no need to have it alert you. You should remove the notifications from alarms that are superfluous. +::: + +## Beware of the "everything is OK alarm" + +Likewise, a common pattern is the "everything is OK" alarm, when operators are so used to getting constant alerts that they only notice when things suddenly go silent! This is a very dangerous mode to operate in, and a pattern that works against [operational excellence](../faq/#what-is-operational-excellence). + +:::warning + The "everything is OK alarm" usually requries a human to interpret it! This makes patterns like self-healing applications impossible.[^1] +::: +## Fight alarm fatigue with aggregation + +Observability is a *human* problem, not a technology problem. And as such, your alarm strategy should focus on reducing alarms rather than creating more. As you implement telemetry collection, it is natural to have more alerts from your environment. Be cautious though to only [alert on things that are actionable](../signals/alarms/#alert-on-things-that-are-actionable). If the condition that caused the alert is not actionable then there is no need to report on it. + +This is best shown by example: if you have five web servers that use a single database for their backend, what happens to your web servers if the database is down? The answer for many people is that they get *at least six* alerts - *five* for the web servers and *one* for the database! + +![Six alarms](../images/alarm3.png) + +But there are only two alerts that make sense to deliver: + +1. The web site is down, and +1. The database is the cause + +![Two alarms](../images/alarm4.png) + +:::info + Distilling your alerts into aggregates makes it easier for people to understand, and then easier to create runbooks and automation for. +::: +## Use your existing ITSM and support processes + +Regardless of your monitoring and observability platform, they must integrate into your current toolchain. + +:::info + Create trouble tickets and issues using a programmatic integration from your alerts into these tools, removing human effort and streamlining processes along the way. +::: +This allows you to derive important operatonal data such as [DORA metrics](https://en.wikipedia.org/wiki/DevOps). + + +[^1]: See https://aws.amazon.com/blogs/apn/building-self-healing-infrastructure-as-code-with-dynatrace-aws-lambda-and-aws-service-catalog/ for more about this pattern. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/signals/anomalies.md b/docusaurus/observability-best-practices/docs/signals/anomalies.md new file mode 100644 index 000000000..9d1836cce --- /dev/null +++ b/docusaurus/observability-best-practices/docs/signals/anomalies.md @@ -0,0 +1,3 @@ +# Anomalies + +WIP \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/signals/events.md b/docusaurus/observability-best-practices/docs/signals/events.md new file mode 100644 index 000000000..c6d9f7ea1 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/signals/events.md @@ -0,0 +1,67 @@ +# Events + +## What do we mean by events? +Many architectures are event driven these days. In event driven architectures, events are signals from different systems which we capture and pass onto other systems. An event is typically a change in state, or an update. + +For example, in an eCommerce system you may have an event when an item is added to the cart. This event could be captured and passed on to the shopping cart part of the system to update the number of items and cost of the cart, along with the item details. + +:::info + For some customers an event may be a *milestone*, such as a the completion of a purchase. There is a case to be made for treating the aggregate moment of a workflow conclusion as an event, but for our purposes we do not consider a milestone itself to be an event. +::: +## Why are events useful? +There are two main ways in which events can be useful in your Observability solution. One is to visualize events in the context of other data, and the other is to enable you to take action based on an event. + +:::info + Events are intended to give valuable information, either to people or machines, about changes and actions in your workload. +::: + +## Visualizing events +There are many event signals which are not directly from your application, but may have an impact on your application performance, or provide additional insight into root cause. Dashboards are the most common mechanism for visualizing your events, though some analytics or business intelligence tools also work in this context. Even email or instant messaging applications can receive visualizations readily. + + +Consider a timechart of application performance, such as time to place an order on your web front end. The time chart lets you see there has been a step change in the response time a few days ago. It might be useful to know if there have been any recent deployments. Consider being able to see a timechart of recent deployments alongside, or superimposed on the same chart? + +![Visualizing events](images/visualizing_events.png) + +:::tip + Consider which events might be useful to you to understand the wider context. The events that are important to you might be code deployments, infrastructure change events, adding new data (such as publishing new items for sale, or bulk adding new users), or modifying or adding functionality (such as changing the way people add items to their cart). +::: + +:::info + Visualize events along with other important metric data so you can [correlate events](../signals/metrics/#correlate-with-operational-metric-data). +::: + +## Taking action on events +In the Observability world, a triggered alarm is a common event. This event would likely contain an identifier for the alarm, the alarm state (such as `IN ALARM`, or `OK`), and details of what triggered this. In many cases this alarm event will be detected and an email notification sent. This is an example of an action on an alarm. + +Alarm notification is critical in observability. This is how we let the right people know there is an issue. However, when action on events mature in your observability solution, it can automatically remediate the issue without human intervention. + + +### But what action to take? +We cannot automate action without first understanding what action will ease the detected issue. At the start of your Observability journey, this may often not be obvious. However, the more experience you have remediating issues, the more you can fine tune your alarms to catch areas where there is a known action. There may be built in actions in the alarm service you have, or you may need to capture the alarm event yourself and script the resolution. + +:::info + Auto-scaling systems, such as a [horizontal pod autoscaling](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) are just an implementation of this principal. Kubernetes simply abstracts this automation for you. +::: +Having access to data on alarm frequency and resolution will help you decide if there is a possibility for automation. Whilst wider scope alarms based on issue symptoms are great at capturing issues, you may find you need more specific criteria to link to auto remediation. + +As you do this, consider integrating this with your incident management/ticketing/ITSM tool. Many organizations track incidents, and associated resolutions and metrics such as Mean Time to Resolve (MTTR). If you do this, consider also capturing your *automated* resolutions in a similar manner. This lets you understand the type and proportion of issues which are automatically remediated, but also allows you to look for underlying patterns and issues. + +:::tip + Just because someone didn't have to manually fix an issue, doesn't mean you shouldn't be looking at the underlying cause. +::: +For example, consider a server restart every time it becomes unresponsive. The restart allows the system to continue functioning, but what is causing the unresponsiveness. How often this happens, and if there is a pattern (for example that matches with report generation, or high users, or system backups), will determine the priority and resources you put into understanding and fixing the root cause. +:::info + Consider delivery of *every* event related to your [key performance indicators](../signals/metrics/#know-your-key-performance-indicatorskpis-and-measure-them) into a message bus for consumption. And note that some observability solutions do this transparently without explicit configuration requirements. +::: +## Getting your events into your Observability platform +Once you have identified the events which are important to you, you'll need to consider how best to get them into your Observability platform. +Your platform may have a specific way to capture events, or you may have to bring them in as logs or metric data. + +:::note + One simple way to get the information in is to write the events to a log file and ingest them in the same way as you do your other log events. +::: + +Explore how your system will let you visualize these. Can you identify events which are related to your application? Can you combine data onto a single chart? Even if there is nothing specific, you should at least be able to create a timechart alongside your other data to visually correlate. Keep the time axis the same, and consider stacking these vertically for easy comparison. + +![Visualizing events as stacked charts](images/visualizing_events_stacked.png) diff --git a/docusaurus/observability-best-practices/docs/signals/images/logs1.graffle b/docusaurus/observability-best-practices/docs/signals/images/logs1.graffle new file mode 100644 index 000000000..652c316aa Binary files /dev/null and b/docusaurus/observability-best-practices/docs/signals/images/logs1.graffle differ diff --git a/docusaurus/observability-best-practices/docs/signals/images/logs1.png b/docusaurus/observability-best-practices/docs/signals/images/logs1.png new file mode 100644 index 000000000..575035d85 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/signals/images/logs1.png differ diff --git a/docusaurus/observability-best-practices/docs/signals/images/logs2.png b/docusaurus/observability-best-practices/docs/signals/images/logs2.png new file mode 100644 index 000000000..17b167114 Binary files /dev/null and b/docusaurus/observability-best-practices/docs/signals/images/logs2.png differ diff --git a/docusaurus/observability-best-practices/docs/signals/images/visualizing_events.png b/docusaurus/observability-best-practices/docs/signals/images/visualizing_events.png new file mode 100644 index 000000000..95c6f99ea Binary files /dev/null and b/docusaurus/observability-best-practices/docs/signals/images/visualizing_events.png differ diff --git a/docusaurus/observability-best-practices/docs/signals/images/visualizing_events_stacked.png b/docusaurus/observability-best-practices/docs/signals/images/visualizing_events_stacked.png new file mode 100644 index 000000000..2e53e439d Binary files /dev/null and b/docusaurus/observability-best-practices/docs/signals/images/visualizing_events_stacked.png differ diff --git a/docusaurus/observability-best-practices/docs/signals/logs.md b/docusaurus/observability-best-practices/docs/signals/logs.md new file mode 100644 index 000000000..633bf7448 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/signals/logs.md @@ -0,0 +1,137 @@ +# Logs + +Logs are a series of messages that are sent by an application, or an appliance, that are represented by one or more lines of details about an event, or sometimes about the health of that application. Typically, logs are delivered to a file, though sometimes they are sent to a collector that performs analysis and aggregation. There are many full-featured log aggregators, frameworks, and products that aim to make the task of generating, ingesting, and managing log data at any volume – from megabytes per day to terabytes per hour. + +Logs are emitted by a single application at a time and usually pertain to the scope of that *one application* - though developers are free to have logs be as complex and nuanced as they desire. For our purposes we consider logs to be a fundamentally different signal from [traces](../signals/traces), which are composed of events from more than one application or a service, and with context about the connection between services such as response latency, service faults, request parameters etc. + +Data in logs can also be aggregate over a period of time. For example, they may be statistical (e.g. number of requests served over the previous minute). They can be structured, free-form, verbose, and in any written language. + +The primary use cases for logging are describing, + +* an event, including its status and duration, and other vital statistics +* errors or warnings related to that event (e.g. stack traces, timeouts) +* application launches, start-up and shutdown messages + +:::note + Logs are intended to be *immutable*, and many log management systems include mechanisms to protect against, and detect attempts, to modify log data. +::: +Regardless of your requirements for logging, these are the best practices that we have identified. + +## Structured logging is key to success + +Many systems will emit logs in a semi-structured format. For example, an Apache web server may write logs like this, with each line pertaining to a single web request: + + 192.168.2.20 - - [28/Jul/2006:10:27:10 -0300] "GET /cgi-bin/try/ HTTP/1.0" 200 3395 + 127.0.0.1 - - [28/Jul/2006:10:22:04 -0300] "GET / HTTP/1.0" 200 2216 + +Whereas a Java stack trace may be a single event that spans multiple lines and is less structured: + + Exception in thread "main" java.lang.NullPointerException + at com.example.myproject.Book.getTitle(Book.java:16) + at com.example.myproject.Author.getBookTitles(Author.java:25) + at com.example.myproject.Bootstrap.main(Bootstrap.java:14) + +And a Python error log event may look like this: +``` + Traceback (most recent call last): + File "e.py", line 7, in + raise TypeError("Again !?!") + TypeError: Again !?! +``` +Of these three examples, only the first one is easily parsed by both humans *and* a log aggregation system. Using structured logs makes it easy to process log data quickly and effectively, giving both humans and machines the data they need to immediately find what they are looking for. + +The most commonly understood log format is JSON, wherein each component to an event is represented as a key/value pair. In JSON, the python example above may be rewritten to look like this: +``` + { + "level", "ERROR" + "file": "e.py", + "line": 7, + "error": "TypeError(\"Again !?!\")" + } +``` +The use of structured logs makes your data transportable from one log system to another, simplifies development, and make operational diagnosis faster (with less errors). Also, using JSON embeds the schema of the log message along with the actual data, which enables sophisticated log analysis systems to index your messages automatically. + +## Use log levels appropriately + +There are two types of logs: those that have a *level* and those that are a series of events. For those that have a level, these are a critical component to a successful logging strategy. Log levels vary slightly from one framework to another, but generally they follow this structure: + +| Level | Description | +| ----- | ----------- | +| `DEBUG` | Fine-grained informational events that are most useful to debug an application. These are usually of value to devlopers and are very verbose. | +| `INFO` | Informational messages that highlight the progress of the application at coarse-grained level. | +| `WARN` | Potentially harmful situations that indicate a risk to an application. These can trigger an alarm in an applicaiton. | +| `ERROR` | Error events that might still allow the application to continue running. These are likely to trigger an alarm that requires attention. | +| `FATAL` | Very severe error events that will presumably cause an application to abort. | + +:::info + Implicitly logs that have no explicit level may be considered as `INFO`, though this behaviour may vary between applications. +::: +Other common log levels are `CRITICAL` and `NONE`, depending on your needs, programming language, and framework. `ALL` and `NONE` are also common, though not found in every application stack. + +Log levels are crucial for informing your monitoring and observability solution about the health of your environment, and log data should easily express this data using a logical value. + +:::tip + Logging too much data at `WARN` will fill your monitoring system with data that is of limited value, and then you may lose important data in the sheer volume of messages. +::: +![Logs flowchart](./images/logs1.png) + +:::info + Using a standardized log level strategy makes automation easier, and helps developers get to the root cause of issues quickly. +::: + +:::warning + Without a standard approach to log levels, [filtering your logs](#filter-logs-close-to-the-source) is a major challenge. +::: +## Filter logs close to the source + +Wherever possible, reduce the volume of logs as close to the source as possible. There are many reasons to follow this best practice: + +* Ingesting logs always costs time, money, and resources. +* Filtering sensitive data (e.g. personally identifiable data) from downstream systems reduces risk exposure from data leakage. +* Downstream systems may not have the same operational concerns as the sources of data. For example, `INFO` logs from an application may be of no interest to a monitoring and alerting system that watches for `CRITCAL` or `FATAL` messages. +* Log systems, and networks, need not be placed under undue stress and traffic. + +:::info + Filter your log close to the source to keep your costs down, decrease risk of data exposure, and focus each component on the [things that matter](../guides/#monitor-what-matters). +::: + +:::tip + Depending on your architecture, you may wish to use infrastructure as code (IaC) to deploy changes to your application *and* environment in one operation. This approach allows you to deploy your log filter patterns along with applications, giving them the same rigor and treatment. +::: +## Avoid double-ingestion antipatterns + +A common pattern that administrators pursue is copying all of their logging data into a single system with the goal querying all of their logs all from a single location. There are some manual workflow advantages to doing so, however this pattern introduces additional cost, complexity, points of failure, and operational overhead. + +![Double log ingestion](./images/logs2.png) + +:::info + Where possible, use a combination of [log levels](#use-log-levels-appropriately) and [log filtering](#filter-logs-close-to-the-source) to avoid a wholesale propagation of log data from your environments. +::: + +:::info + Some organizations or workloads require [log shipping](https://en.wikipedia.org/wiki/Log_shipping) in order to meet regulatory requirements, store logs in a secure location, provide non-reputability, or achieve other objectives. This is a common use case for re-ingesting log data. Note that a proper application of [log levels](#use-log-levels-appropriately) and [log filtering](#filter-logs-close-to-the-source) is still appropriate to reduce the volume of superfluous data entering these log archives. +::: +## Collect metric data from your logs + +Your logs contain [metrics](../signals/metrics/) that are just waiting to be collected! Even ISV solutions or applications that you have not written yourself will emit valuable data into their logs that you can extract meaningful insights into overall workload health from. Common examples include: + +* Slow query time from databases +* Uptime from web servers +* Transaction processing time +* Counts of `ERROR` or `WARNING` events over time +* Raw count of packages that are available for upgrade + +:::tip + This data is less useful when locked in a static log file. The best practice is to identify key metric data and then publish it into your metric system where it can be correlated with other signals. +::: +## Log to `stdout` + +Where possible, applications shouould log to `stdout` rather than to a fixed location such as a file or socket. This enables log agents to collect and route your log events based on rules that make sense for your own observability solution. While not possible for all applications, this is the best practice for containerized workloads. + +:::note + While applications should be generic and simple in their logging practices, remaining loosely coupled from logging solutions, the transmission of log data does still require a [log collector](../tools/logs/) to send data from `stdout` to a file. The important concept is to avoid application and business logic being dependant on your logging infrastructure - in other words, you should work to separate your concerns. +::: + +:::info + Decoupling your application from your log management lets you adapt and evolve your solution without code changes, thereby minimizing the potential [blast radius](../faq/#what-is-a-blast-radius) of changes made to your environment. +::: \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/signals/metrics.md b/docusaurus/observability-best-practices/docs/signals/metrics.md new file mode 100644 index 000000000..dcaab0595 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/signals/metrics.md @@ -0,0 +1,49 @@ +# Metrics + +Metrics are a series of numerical values that are kept in order with the time that they are created. They are used to track everything from the number of servers in your environment, their disk usage, number of requests they handle per second, or the latency in completing these requests. + +But metrics are not limited to infrastructure or application monitoring. Rather, they can be used for any kind of business or workload to track sales, call queues, and customer satisfaction. In fact, metrics are most useful when combining both operational data and business metrics, giving a well-rounded view and observable system. + +It might be worth looking into [the OpenTelemetry documentation page](https://opentelemetry.io/docs/concepts/signals/metrics/) that provides some additional context on Metrics. + +## Know your Key Performance Indicators(KPIs), and measure them! + +The *most* important thing with metrics is to *measure the right things*. And what those are will be different for everyone. An e-commerce application may have sales per hour as a critical KPI, whereas a bakery would like be more interested in the number of croissants made per day. + +:::warning + There is no singular, entirely complete, and comprehensive source for your business KPIs. You must understand your project or application well enough to know what your *output goals* are. +::: +Your first step is to name your high-level goals, and most likely those goals are not expressed as a single metric that comes from your infrastructure alone. In the e-commerce example above, once you identify the *meta* goal which is measuring *sales per hour*, you then can backtrack to detailed metrics such as time spent to search a product before purchase, time taken to complete the checkout process, latency of product search results and so on. This will guide us to be intentional about collecting relevant information to observe the system. + +:::info + Having identified your KPIs, you can now *work backwards* to see what metrics in your workload impact them. +::: +## Correlate with operational metric data + +If high CPU utilization on your web server causes slow response times, which in turn makes for dissatisfied customers and ultimately lower revenue, then measuring your CPU utilization has a direct impact on your business outcomes and should *absolutely* be measured! + +Or conversely, if you have an application that performs batch processing on ephemeral cloud resources (such as an Amazon EC2 fleet, or similar in other cloud provider environments), then you may *want* to have CPU as utilized as possible in order to accomplish the most cost-effective means of completing the batch. + +In either case, you need to have your operational data (e.g. CPU utilization) be in the same system as your business metrics so you can correlate the two. + +:::info + Store your business metrics and operational metrics in a system where you can correlate them together and draw conclusions based on observed impacts to both. +::: +## Know what good looks like! + +Understanding what a healthy baseline is can be challenging. Many people have to stress test their workloads to understand what healthy metrics look like. However, depending on your needs you may be able to observe existing operational metrics to draw safe conclusions about healthy thresholds. + +A healthy workload is one that has a balance of meeting your KPI objectives while remaining resilient, available, and cost-effective. + +:::info + Your KPIs *must* have an identified healthy range so you can create [alarms](../signals/alarms/) when performance falls below, or above, what is required. +::: +## Use anomaly detection algorithms + +The challenge with [knowing what good looks like](#know-what-good-looks-like) is that it may be impractical to know the healthy thresholds for *every* metric in your system. A Relational Database Management System(RDBMS) can emit dozens of performance metrics, and when coupled with a microservices architecture you can potentially have hundreds of metrics that can impact your KPIs. + +Watching such a large number of datapoints and individually identifying their upper and lower thresholds may not always be practical for humans to do. But machine learning is *very* good at this sort of repetitive task. Leverage automation and machine learning wherever possible as it can help identify issues that you would otherwise not even know about! + +:::info + Use machine learning algorithms and anomaly detection models to automatically calculate your workload's performance thresholds. +::: \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/signals/traces.md b/docusaurus/observability-best-practices/docs/signals/traces.md new file mode 100644 index 000000000..db5d2ba0d --- /dev/null +++ b/docusaurus/observability-best-practices/docs/signals/traces.md @@ -0,0 +1,58 @@ +# Traces + +Traces represent an entire journey of the requests as they traverse through different components of an application. + +Unlike logs or metrics, *traces* are composed of events from more than one application or a service, and with context about the connection between services such as response latency, service faults, request parameters, and metadata. + +:::tip + There is conceptual similarity between [logs](../signals/logs/) and traces, however a trace is intended to be considered in a cross-service context, whereas logs are typically limited to the execution of a single service or application. +::::::tip +Today's developers are leaning towards building modular and distributed applications. Some call these [Service Oriented Architecture](https://en.wikipedia.org/wiki/Service-oriented_architecture), others will refer to them as [microservices](https://aws.amazon.com/microservices/). Regardless of the name, when something goes wrong in these loosely coupled applications, just looking at logs or events may not be sufficient to track down the root cause of an incident. Having full visibility into request flow is essential and this is where traces add value. Through a series of causally related events that depict end-to-end request flow, traces help you gain that visibility. + +Traces are an essential pillar of observability because they provide the basic information on the flow of the request as it comes and leaves the system. + +:::tip + Common use cases for traces include performance profiling, debugging production issues, and root cause analysis of failures. +::: +## Instrument all of your integration points + +When all of your workload functionality and code is at one place, it is easy to look at the source code to see how a request is passed across different functions. At a system level you know which machine the app is running and if something goes wrong, you can find the root cause quickly. Imagine doing that in a microservices-based architecture where different components are loosely coupled and are running in an distributed environment. Logging into numerous systems to see their logs from each interconnected request would be impractical, if not impossible. + +This is where observability can help. Instrumentation is a key step towards increasing that observability. In broader terms Instrumentation is measuring the events in your application using code. + +A typical instrumentation approach would be to assign a unique trace identifier for each request entering the system and carry that trace id as it passes through different components while adding additional metadata. + +:::info + Every connection from one service to another should be instrumented to emit traces to a central collector. This approach helps you see into otherwise opaque aspects of your workload. +::: +:::info + Instrumenting your application can be a largely automated process when using an auto-instrumentation agent or library. +::: + +## Transaction time and status matters, so measure it! + +A well instrumented application can produce end to end trace, which can be viewed aseither a waterfall graph like this: + +![WaterFall Trace](../images/waterfall-trace.png) + +Or a service map: + +![servicemap Trace](../images/service-map-trace.png) + +It is important that you measure the transaction times and response codes to every interaction. This will help in calculating the overall processing times and track it for compliance with your SLAs, SLOs, or business KPIs. + +:::info + Only by understanding and recording the response times and status codes of your interactions can you see the contributing factors overall request patterns and workload health. +::: +## Metadata, annotations, and labels are your best friend + +Traces are persisted and assigned a unique ID, with each trace broken down into *spans* or *segments* (depending on your tooling) that record each step within the request’s path. A span indicates the entities with which the trace interacts, and, like the parent trace, each span is assigned a unique ID and time stamp and can include additional data and metadata as well. This information is useful for debugging because it gives you the exact time and location a problem occurred. + +This is best explained through a practical example. An e-commerce application may be divided into many domains: authentication, authorization, shipping, inventory, payment processing, fulfillment, product search, recommendations, and many more. Rather than search through traces from all of these interconnected domains though, labelling your trace with a customer ID allows you to search for only interactions that are specific to this one person. This helps you to narrow your search instantly when diagnosing an operational issue. + +:::info + While the naming convention may vary between vendors, each trace can be augmented with metadata, labels, or annotations, and these are searchable across your entire workload. Adding them does require code on your part, but greatly increases the observability of your workload. +::: +:::warning + Traces are not logs, so be frugal with what metadata you include in your traces. And trace data is not intended for forensics and auditing, even with a high sample rate. +::: \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/tools/adot-traces.md b/docusaurus/observability-best-practices/docs/tools/adot-traces.md new file mode 100644 index 000000000..90f25e13d --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/adot-traces.md @@ -0,0 +1,3 @@ +# Tracing with ADOT + +todo \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/tools/alarms.md b/docusaurus/observability-best-practices/docs/tools/alarms.md new file mode 100644 index 000000000..00d90e6d1 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/alarms.md @@ -0,0 +1,46 @@ +# Alarms + +Amazon CloudWatch alarms allows you to define thresholds around CloudWatch Metrics and Logs and receive notifications based on the rules configured in the CloudWatch. + +**Alarms on CloudWatch metrics:** + +CloudWatch alarms allows you to define thresholds on CloudWatch metrics and receive notifications when the metrics fall outside range. Each metric can trigger multiple alarms, and each alarm can have many actions associated with it. There are two different ways you could setup metric alarms based on CloudWatch metrics. + +1. **Static threshold**: A static threshold represents a hard limit that the metric should not violate. You must define the range for the static threshold like upper limit and the lower limit to understand the behaviour during the normal operations. If the metric value falls below or above the static threshold you may configure the CloudWatch to generate the alarm. + +2. **Anomaly detection**: Anomaly detection is generally identified as rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well-defined notion of normal behaviour. CloudWatch anomaly detection analyzes past metric data and creates a model of expected values. The expected values take into account the typical hourly, daily, and weekly patterns in the metric. You can apply the anomaly detection for each metric as required and CloudWatch applies a machine-learning algorithm to define the upper limit and lower limit for each of the enabled metrics and generate an alarm only when the metrics fall out of the expected values. + +:::tip + Static thresholds are best used for metrics that you have a firm understanding of, such as identified performance breakpoints in your workload, or absolute limits on infrastructure components. +::: +:::info + Use an anomaly detection model with your alarms when you do not have visibility into the performance of a particular metric over time, or when the metric value has not been observed under load-testing or anomalous traffic previously. +::: +![CloudWatch Alarm types](../images/cwalarm1.png) + +You can follow the instructions below on how to setup of Static and Anomaly based alarms in CloudWatch. + +[Static threshold alarms](https://catalog.us-east-1.prod.workshops.aws/workshops/31676d37-bbe9-4992-9cd1-ceae13c5116c/en-US/alarms/mericalarm) + +[CloudWatch anomaly Detection based alarms](https://catalog.us-east-1.prod.workshops.aws/workshops/31676d37-bbe9-4992-9cd1-ceae13c5116c/en-US/alarms/adalarm) + +:::info + To reduce the alarm fatigue or reduce the noise from the number of alarms generated, you have two advanced methods to configure the alarms: + + 1. **Composite alarms**: A composite alarm includes a rule expression that takes into account the alarm states of other alarms that have been created. The composite alarm goes into `ALARM` state only if all conditions of the rule are met. The alarms specified in a composite alarm's rule expression can include metric alarms and other composite alarms. Composite alarms help to [fight alarm fatigue with aggregation](../signals/alarms/#fight-alarm-fatigue-with-aggregation). + + 2. **Metric math based alarms**: Metric math expressions can be used to build more meaningful KPIs and alarms on them. You can combine multiple metrics and create a combined utilization metric and alarm on them. +::: + +These instructions below guide you on how to setup of Composite alarms and Metric math based alarms. + +[Composite Alarms](https://catalog.us-east-1.prod.workshops.aws/workshops/31676d37-bbe9-4992-9cd1-ceae13c5116c/en-US/alarms/compositealarm) + +[Metric Math alarms](https://aws.amazon.com/blogs/mt/create-a-metric-math-alarm-using-amazon-cloudwatch/) + +**Alarms on CloudWatch Logs** + +You can create alarms based on the CloudWatch Logs uses CloudWatch Metric filter. Metric filters turn the log data into numerical CloudWatch metrics that you can graph or set an alarm on. Once you have setup the metrics you could use either the static or anomaly based alarms on the CloudWatch metrics generated from the CloudWatch Logs. + +You can find an example on how to setup [metric filter on CloudWatch logs](https://aws.amazon.com/blogs/mt/quantify-custom-application-metrics-with-amazon-cloudwatch-logs-and-metric-filters/). + diff --git a/docusaurus/observability-best-practices/docs/tools/alerting_and_incident_management.md b/docusaurus/observability-best-practices/docs/tools/alerting_and_incident_management.md new file mode 100644 index 000000000..3b16e1839 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/alerting_and_incident_management.md @@ -0,0 +1 @@ +# Alerting and incident management \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/tools/amp.md b/docusaurus/observability-best-practices/docs/tools/amp.md new file mode 100644 index 000000000..82420b0c2 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/amp.md @@ -0,0 +1,11 @@ +# Amazon Managed Service for Prometheus + +[Prometheus](https://prometheus.io/) is a popular open source monitoring tool that provides wide ranging metrics capabilities and insights about resources such as compute nodes and application related performance data. + +Prometheus uses a *pull* model to collect data, where as CloudWatch uses a *push* model. Prometheus and CloudWatch are used for some overlapping use cases, though their operating models are very different and are priced differently. + +[Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) is widely used in containerized applications hosted in Kubernetes and [Amazon ECS](https://aws.amazon.com/ecs/). + +You can add Prometheus metric capabilities on your EC2 instance or ECS/EKS cluster using the [CloudWatch agent](../tools/cloudwatch_agent/) or [AWS Distro for OpenTelemetry](https://aws-otel.github.io/). The CloudWatch agent with Prometheus support discovers and collects Prometheus metrics to monitor, troubleshoot, and alarm on application performance degradation and failures faster. This also reduces the number of monitoring tools required to improve observability. + +CloudWatch Container Insights monitoring for Prometheus automates the discovery of Prometheus metrics from containerized systems and workloads https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ ContainerInsights-Prometheus.html \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/tools/cloudwatch-dashboard.md b/docusaurus/observability-best-practices/docs/tools/cloudwatch-dashboard.md new file mode 100644 index 000000000..ef252be07 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/cloudwatch-dashboard.md @@ -0,0 +1,259 @@ +# CloudWatch Dashboard + +## Introduction + +Getting to know the inventory details of resources in AWS accounts, the resources performance and health checks is important for a stable resource management. Amazon CloudWatch dashboards are customizable home pages in CloudWatch console that can be used to monitor your resources in a single view, even if those resources are cross-account or spread across different regions. + +[Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) enable customers to create reusable graphs and visualize cloud resources and applications in a unified view. Through CloudWatch dashboards customers can graph metrics and logs data side by side in a unified view to quickly get the context and move from diagnosing the problem to understanding the root cause & by reducing the mean time to recover or resolve (MTTR). For example, customers can visualize current utilization of key metrics like CPU utilization & memory and compare them to allocated capacity. Customers can also correlate log pattern of a specific metric and set alarms to alert on performance and operational issues. CloudWatch dashboard also helps customers display the current status of alarms to quickly visualize & get their attention for action. Sharing of CloudWatch dashboards allow customers to easily share displayed dashboard information to teams and or stakeholders who are internal or external to the organizations. + +## Widgets + +#### Default Widgets + +Widgets form the building blocks of CloudWatch dashboards that display important information & near real time details of resources and application metrics and logs in AWS environment. Customers can customize dashboards to their desired experience by adding, removing, rearranging, or resizing widgets according to their requirements. + +The types of graphs that you can add to your dashboard include Line, Number, Gauge, Stacked area, Bar and Pie. + +There are default widget types like **Line, Number, Gauge, Stacked area, Bar, Pie** which are of **Graph** type and other widgets like **Text, Alarm Status, Logs table, Explorer** are also available for customers to choose for adding Metrics or Logs data to build dashboards. + +![Default Widgets](../images/cw_dashboards_widgets.png) + +**Additional References:** + +- AWS Observability Workshop on [Metric Number Widgets](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/metrics-number) +- AWS Observability Workshop on [Text Widgets](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/text-widget) +- AWS Observability Workshop on [Alarm Widgets](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/alarm-widgets) +- Documentation on [Creating and working with widgets on CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-and-work-with-widgets.html) + +#### Custom Widgets + +Customers can also choose to [add custom widget](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-and-work-with-widgets.html) in CloudWatch dashboards to experience custom visualizations, display information from multiple sources or add custom controls like buttons to take actions directly in a CloudWatch Dashboard. Custom Widgets are completely serverless powered by Lambda functions, enabling complete control over the content, layout and interactions. Custom Widget is an easy way to build custom data view or tool on a dashboard which doesn’t need complicated web framework to learn. If you can write code in Lambda and create HTML then you can create a useful custom widget. + +![Custom Widgets](../images/cw_dashboards_custom-widgets.png) + +**Additional References:** + +- AWS Observability Workshop on [custom widgets](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/custom-widgets) +- [CloudWatch Custom Widgets Samples](https://github.com/aws-samples/cloudwatch-custom-widgets-samples#what-are-custom-widgets) on GitHub +- Blog: [Using Amazon CloudWatch dashboards custom widgets](https://aws.amazon.com/blogs/mt/introducing-amazon-cloudwatch-dashboards-custom-widgets/) + +## Automatic Dashboards + +Automatic Dashboards are available in all AWS public regions which provide an aggregated view of the health and performance of all AWS resources under Amazon CloudWatch. This helps customers quickly get started with monitoring, resource-based view of metrics and alarms, and easily drill-down to understand the root cause of performance issues. Automatic Dashboards are pre-built with AWS service recommended [best practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/cloudwatch-dashboards-visualizations.html), remain resource aware, and dynamically update to reflect the latest state of important performance metrics. Automatic service dashboards display all the standard CloudWatch metrics for a service, graph all resources used for each service metric and help customers quickly identify outlier resources across accounts that can help identify resources with high or low utilization, which can help optimize costs. + +![Automatic Dashboards](../images/automatic-dashboard.png) + +**Additional References:** + +- AWS Observability Workshop on [Automatic dashboards](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/autogen-dashboard) +- [Monitor AWS Resources Using Amazon CloudWatch Dashboards](https://www.youtube.com/watch?v=I7EFLChc07M) on YouTube + +#### Container Insights in Automatic dashboards + +[CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) collect, aggregate, and summarize metrics and logs from containerized applications and microservices. Container Insights is available for Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and Kubernetes platforms on Amazon EC2. Container Insights supports collecting metrics from clusters deployed on Fargate for both Amazon ECS and Amazon EKS. CloudWatch automatically collects metrics for many resources, such as CPU, memory, disk, and network & also provides diagnostic information, such as container restart failures, to help isolate issues and resolve them quickly. + +CloudWatch creates aggregated metrics at the cluster, node, pod, task, and service level as CloudWatch metrics using [embedded metric format](https://aws-observability.github.io/observability-best-practices/guides/signal-collection/emf/), which are performance log events that use a structured JSON schema that enables high-cardinality data to be ingested and stored at scale. The metrics that Container Insights collects are available in [CloudWatch automatic dashboards](https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/cloudwatch-dashboards-visualizations.html#use-automatic-dashboards), and also viewable in the Metrics section of the CloudWatch console. + +![Container Insights](../images/Container_Insights_CW_Automatic_DB.png) + +#### Lambda Insights in Automatic dashboards + +[CloudWatch Lambda Insights](https://docs.aws.amazon.com/lambda/latest/dg/monitoring-insights.html) is a monitoring and troubleshooting solution for serverless applications such as AWS Lambda, which creates dynamic, [automatic dashboards](https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/cloudwatch-dashboards-visualizations.html#use-automatic-dashboards) for Lambda functions. It also collects, aggregates and summarizes system-level metrics, including CPU time, memory, disk, and network and diagnostic information such as cold starts and Lambda worker shutdowns to help isolate and quickly resolve issues with Lambda functions. [Lambda Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Lambda-Insights.html) is a Lambda extension provided as a layer at the function level which when enabled uses [embedded metric format](https://aws-observability.github.io/observability-best-practices/guides/signal-collection/emf/) to extract metrics from the log events and doesn’t require any agents. + +![Lambda Insights](../images/Lambda_Insights_CW_Automatic_DB.png) + +## Custom Dashboards + +Customers can also create [Custom Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create_dashboard.html) as many additional dashboards as they want with different widgets and customize it accordingly. Dashboards can be configured for cross-region & cross account view and can be added to a favorites list. + +![Custom Dashboards](../images/CustomDashboard.png) + +Customers can add automatic or custom dashboards to the [favorite list](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add-dashboard-to-favorites.html) in the CloudWatch console so that its quick and easy to access them from the navigation pane in the console page. + +**Additional References:** + +- AWS Observability Workshop on [CloudWatch dashboard](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/create) +- AWS Well-Architected Labs on Performance Efficiency for [monitoring with CloudWatch Dashboards](https://www.wellarchitectedlabs.com/performance-efficiency/100_labs/100_monitoring_windows_ec2_cloudwatch/) + +#### Adding Contributor Insights to CloudWatch dashboards + +CloudWatch provides [Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html) to analyze log data and create time series that display contributor data, where you can see metrics about the top-N contributors, the total number of unique contributors, and their usage. This helps you find top talkers and understand who or what is impacting system performance. For example, customers can find bad hosts, identify the heaviest network users, or find URLs that generate the most errors. + +Contributor Insights reports can be added to any [new or existing Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights-ViewReports.html) in CloudWatch console. + +![Contributor Insights](../images/Contributor_Insights_CW_DB.png) + +#### Adding Application Insights to CloudWatch dashboards + +[CloudWatch Application Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-application-insights.html) facilitates observability for applications hosted on AWS and their underlying AWS resources which enhances visibility into the health of applications that it provides helps reduce mean time to repair (MTTR) to troubleshoot application issues. Application Insights provides automated dashboards that show potential problems with monitored applications, which help customers quickly isolate ongoing issues with the applications and infrastructure. + +The ‘Export to CloudWatch’ option inside Application Insights as shown below adds a dashboard in CloudWatch console which helps customers easily monitor their critical application for insights. + +![Application Insights](../images/Application_Insights_CW_DB.png) + +#### Adding Service Map to CloudWatch dashboards + +[CloudWatch ServiceLens](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ServiceLens.html) enhances the observability of services and applications by integrating traces, metrics, logs, alarms, and other resource health information in one place. ServiceLens integrates CloudWatch with AWS X-Ray to provide an end-to-end view of the application to help customers more efficiently pinpoint performance bottlenecks and identify impacted users. A [service map](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/servicelens_service_map.html) displays service endpoints and resources as nodes and highlights the traffic, latency, and errors for each node and its connections. Each displayed node provide detailed insights about the correlated metrics, logs, and traces associated with that part of the service. + +‘Add to dashboard’ option inside Service Map as shown below adds a new dashboard or to an existing dashboard in CloudWatch console which helps customers easily trace their application for insights. + +![Service Map](../images/Service_Map_CW_DB.png) + +#### Adding Metrics Explorer to CloudWatch dashboards + +[Metrics explorer](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Metrics-Explorer.html) in CloudWatch is a tag-based tool that enables customers to filter, aggregate and visualize metrics by tags and resource properties to enhance observability for AWS services. Metrics explorer gives flexible and dynamic troubleshooting experience, so that customers can create multiple graphs at a time and use these graphs to build application health dashboards. Metrics explorer visualizations are dynamic, so if a matching resource is created after you create a metrics explorer widget and add it to a CloudWatch dashboard, the new resource automatically appears in the explorer widget. + +‘[Add to dashboard](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_metrics_explorer_dashboard.html)’ option inside Metrics Explorer as shown below adds a new dashboard or to an existing dashboard in CloudWatch console which helps customers easily get more graph insights into their AWS Services and resources. + +![Metrics Explorer](../images/Metrics_Explorer_CW_DB.png) + +## What to visualize using CloudWatch dashboards + +Customer can create dashboards at account and application-level to monitor workloads and applications across regions and accounts. Customers can quickly get started with CloudWatch automatic dashboards, which are AWS service-level dashboards preconfigured with service-specific metrics. It is recommended to create application and workload-specific dashboards that focus on key metrics and resources that are relevant and critical to the application or workload in your production environment. + +#### Visualizing metrics data + +Metrics data can be added to CloudWatch dashboards through Graph widgets like **Line, Number, Gauge, Stacked area, Bar, Pie**, supported by statistics on metrics through **Average, Minimum, Maximum, Sum, and SampleCount**. [Statistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.html) are metric data aggregations over specified periods of time. + +![Metrics Data Visual](../images/graph_widget_metrics.png) + +[Metric math](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) enables to query multiple CloudWatch metrics and use math expressions to create new time series based on these metrics. Customers can visualize the resulting time series on the CloudWatch console and add them to dashboards. Customers also perform metric math programmatically using the [GetMetricDataAPI](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricData.html) operation. + +**Additional Reference:** + +- [Monitoring your IoT fleet using CloudWatch](https://aws.amazon.com/blogs/iot/monitoring-your-iot-fleet-using-cloudwatch/) + +#### Visualizing logs data + +Customers can achieve [visualizations of logs data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_Insights-Visualizing-Log-Data.html) in CloudWatch dashboards using bar charts, line charts, and stacked area charts to more efficiently identify patterns. CloudWatch Logs Insights generates visualizations for queries that use the stats function and one or more aggregation functions that can produce bar charts. If the query uses bin() function to [group the data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_Insights-Visualizing-Log-Data.html#CWL_Insights-Visualizing-ByFields) by one field over time, then line charts and stacked area charts can be used for visualization. + +[Time series data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_Insights-Visualizing-Log-Data.html#CWL_Insights-Visualizing-TimeSeries) can be visualized using the characteristics if the query contains one or more aggregation of status functions or if the query uses the bin() function to group the data by one field. + +A sample query with count() as stats function is shown below + +```java +filter @message like /GET/ +| parse @message '_ - - _ "GET _ HTTP/1.0" .*.*.*' as ip, timestamp, page, status, responseTime, bytes +| stats count() as request_count by status +``` + +For the above query, the results are shown below in the CloudWatch Logs Insights. + +![CloudWatch Logs Insights](../images/widget_logs_1.png) + +Visualization of the query results as a pie chart is shown below. + +![CloudWatch Logs Insights Visualization](../images/widget_logs_2.png) + +**Additional Reference:** + +- AWS Observability Workshop on [displaying log results](https://catalog.workshops.aws/observability/en-US/aws-native/logs/logsinsights/displayformats) in CloudWatch dashboard. +- [Visualize AWS WAF logs with an Amazon CloudWatch dashboard](https://aws.amazon.com/blogs/security/visualize-aws-waf-logs-with-an-amazon-cloudwatch-dashboard/) + +#### Visualizing alarms + +Metric alarm in CloudWatch watches a single metric or the result of a math expression based on CloudWatch metrics. The alarm performs one or more actions based on the value of the metric or expression relative to a threshold over a time period. [CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_remove_alarm_dashboard.html) can be added with a single alarm in a widget, which displays the graph of the alarm's metric and also displays the alarm status. Also, an alarm status widget can be added to CloudWatch dashboard which displays the status of multiple alarms in a grid. Only the alarm names and current status are displayed, Graphs are not displayed. + +A sample metric alarm status captured in a alarm widget inside a CloudWatch dashboard is shown below. + +![CloudWatch Alarms](../images/widget_alarms.png) + +## Cross-account & Cross-region + +Customers having multiple AWS accounts can set up [CloudWatch cross-account](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_crossaccount_dashboard.html) observability and then create rich cross-account dashboards in central monitoring accounts, through which they can seamlessly search, visualize, and analyze metrics, logs, and traces without account boundaries. + +Customers can also create [cross-account cross-region](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_xaxr_dashboard.html) dashboards, which summarize CloudWatch data from multiple AWS accounts and multiple regions into a single dashboard. From this high-level dashboard customers can get a unified view of the entire application, and also drill down into more specific dashboards without having to sign in & out of accounts or switch between regions. + +**Additional References:** + +- [How to auto add new cross-account Amazon EC2 instances in a central Amazon CloudWatch dashboard](https://aws.amazon.com/blogs/mt/how-to-auto-add-new-cross-account-amazon-ec2-instances-in-a-central-amazon-cloudwatch-dashboard/) +- [Deploy Multi-Account Amazon CloudWatch Dashboards](https://aws.amazon.com/blogs/mt/deploy-multi-account-amazon-cloudwatch-dashboards/) +- [Create Cross Account & Cross Region CloudWatch Dashboards](https://www.youtube.com/watch?v=eIUZdaqColg) on YouTube + +## Sharing dashboards + +CloudWatch dashboards can be shared with people across teams, with stakeholders and with people external to your organization who do not have direct access to your AWS account. These [shared dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-dashboard-sharing.html) can even be displayed on big screens in team areas, monitoring or network operations centers (NOC) or embed them in Wikis or public webpages. + +There are three ways to share dashboards to make it easy and secure. + +- a dashboard can be [shared publicly](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-dashboard-sharing.html#share-cloudwatch-dashboard-public) so that anyone having the link can view the dashboard. +- a dashboard can be [shared to specific email addresses](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-dashboard-sharing.html#share-cloudwatch-dashboard-email-addresses) of the people who can view the dashboard. Each of these users creates their own password that they enter to view the dashboard. +- dashboards can be shared within AWS accounts with access through a [single sign-on (SSO) provider](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-dashboard-sharing.html#share-cloudwatch-dashboards-setup-SSO). + +**Things to note while sharing dashboards publicly** + +Sharing of CloudWatch dashboards publicly is not recommended if the dashboard contains any sensitive or confidential information. Whenever possible, it is recommended to make use of authentication through username/password or single sign-on (SSO) while sharing dashboards. + +When dashboards are made publicly accessible, CloudWatch generates a link to a web page which hosts the dashboard. Anyone viewing the web page will also be able to see the contents of the publicly shared dashboard. The web page provides temporary credentials through the link to call APIs to query alarms and contributor insights rules in the Dashboard which you share, and to all metrics and the names and tags of all EC2 instances in your account even if they are not shown in the Dashboard which you share. We recommend you to consider whether it is appropriate to make this information publicly available. + +Please note that when you enable sharing of dashboards publicly to the web page, the following Amazon Cognito resources will be created in your account: Cognito user pool; Cognito app client; Cognito Identity pool and IAM role. + +**Things to note while sharing dashboards using credentials (Username and password protected dashboard)** + +Sharing of CloudWatch dashboards is not recommended if the dashboard contains any sensitive or confidential information which you would not wish to share with the users with whom you are sharing the dashboard. + +When dashboards are enabled for sharing, CloudWatch generates a link to a web page which hosts the dashboard. The users that you specified above will be granted the following permissions: CloudWatch read-only permissions to alarms and contributor insights rules in the Dashboard which you share, and to all metrics and the names and tags of all EC2 instances in your account even if they are not shown in the Dashboard which you share. We recommend you to consider whether it is appropriate to make this information available to the users with whom you are sharing. + +Please note that when you enable sharing of dashboards for users you specify for access to the web page, the following Amazon Cognito resources will be created in your account: Cognito user pool; Cognito users; Cognito app client; Cognito Identity pool and IAM role. + +**Things to note while sharing dashboards using SSO Provider** + +When CloudWatch dashboards are shared using Single Sign-on (SSO), users registered with the selected SSO provider will be granted permissions to access all dashboards in the account where it is shared. Also, when the sharing of dashboards is disabled in this method, all dashboards are automatically unshared. + +**Additional References:** + +- AWS Observability Workshop on [sharing dashboards](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/sharingdashboard) +- Blog: [Share your Amazon CloudWatch Dashboards with anyone using AWS Single Sign-On](https://aws.amazon.com/blogs/mt/share-your-amazon-cloudwatch-dashboards-with-anyone-using-aws-single-sign-on/) +- Blog: [Communicate monitoring information by sharing Amazon CloudWatch dashboards](https://aws.amazon.com/blogs/mt/communicate-monitoring-information-by-sharing-amazon-cloudwatch-dashboards/) + +## Live data + +CloudWatch dashboards also display [live data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-live-data.html) through metric widgets if the metrics from your workloads are constantly published. Customers can choose to enable live data for a whole dashboard, or for individual widgets on a dashboard. + +If live data is turned **off**, only data points with an aggregation period of at least one minute in the past are shown. For example, when using 5-minute periods, the data point for 12:35 would be aggregated from 12:35 to 12:40, and displayed at 12:41. + +If live data is turned **on**, the most recent data point is shown as soon as any data is published in the corresponding aggregation interval. Each time you refresh the display, the most recent data point may change as new data within that aggregation period is published. + +## Animated Dashboard + +[Animated dashboard](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-animated-dashboard.html) replays CloudWatch metric data that was captured over time, which helps customers see trends, make presentations, or analyze issues after they occur. Animated widgets in the dashboard include line widgets, stacked area widgets, number widgets, and metrics explorer widgets. Pie graphs, bar charts, text widgets, and logs widgets are displayed in the dashboard but are not animated. + +## API/CLI support for CloudWatch Dashboard + +Apart from accessing CloudWatch dashboard through the AWS Management Console customers can also access the service via API, AWS command-line interface (CLI) and AWS SDKs. CloudWatch API for dashboards help in automating through AWS CLI or integrating with software/products so that you can spend less time managing or administering the resources and applications. + +- [ListDashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_ListDashboards.html): Returns a list of the dashboards for your account +- [GetDashboard](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetDashboard.html): Displays the details of the dashboard that you specify. +- [DeleteDashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_DeleteDashboards.html): Deletes all dashboards that you specify. +- [PutDashboard](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutDashboard.html): Creates a dashboard if it does not already exist, or updates an existing dashboard. If you update a dashboard, the entire contents are replaced with what you specify here. + +CloudWatch API Reference for [Dashboard Body Structure and Syntax](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/CloudWatch-Dashboard-Body-Structure.html) + +The AWS Command Line Interface (AWS CLI) is an open source tool that enables customers to interact with AWS services using commands in command-line shell, that implement functionality equivalent to that provided by the browser-based AWS Management Console from the command prompt in terminal program. + +CLI Support: + +- [list-dashboards](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/list-dashboards.html) +- [get-dashboard](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/get-dashboard.html) +- [delete-dashboards](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/delete-dashboards.html) +- [put-dashboard](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-dashboard.html) + +**Additional Reference:** AWS Observability Workshop on [CloudWatch dashboards and AWS CLI](https://catalog.workshops.aws/observability/en-US/aws-native/dashboards/createcli) + +## Automation of CloudWatch Dashboard + +For automating creation of CloudWatch dashboards, customers can use Infrastructure as a Code (IaaC) tools like CloudFormation or Terraform that help set up AWS resources so that customers can spend less time managing those resources and more time focusing on applications that run in AWS. + +[AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-cloudwatch-dashboard.html) supports creating dashboards through templates. The AWS::CloudWatch::Dashboard resource specifies an Amazon CloudWatch dashboard. + +[Terraform](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_dashboard) also has modules which support creating CloudWatch dashboards through IaaC automation. + +Manually creating dashboards using desired widgets is straight forward. However, it can require some effort to update the resource sources if the content is based on the dynamic information, such as EC2 instances that are created or removed during scale-out and scale-in events in the Auto Scaling group. Please refer to the blog post if you wish to automatically [create and update your Amazon CloudWatch dashboards using Amazon EventBridge and AWS Lambda](https://aws.amazon.com/blogs/mt/update-your-amazon-cloudwatch-dashboards-automatically-using-amazon-eventbridge-and-aws-lambda/). + +**Additional Reference Blogs:** + +- [Automating Amazon CloudWatch dashboard creation for Amazon EBS volume KPIs](https://aws.amazon.com/blogs/storage/automating-amazon-cloudwatch-dashboard-creation-for-amazon-ebs-volume-kpis/) +- [Automate creation of Amazon CloudWatch alarms and dashboards with AWS Systems Manager and Ansible](https://aws.amazon.com/blogs/mt/automate-creation-of-amazon-cloudwatch-alarms-and-dashboards-with-aws-systems-manager-and-ansible/) +- [Deploying an automated Amazon CloudWatch dashboard for AWS Outposts using AWS CDK](https://aws.amazon.com/blogs/compute/deploying-an-automated-amazon-cloudwatch-dashboard-for-aws-outposts-using-aws-cdk/) + +**Product FAQs** on [CloudWatch dashboard](https://aws.amazon.com/cloudwatch/faqs/#Dashboards) diff --git a/docusaurus/observability-best-practices/docs/tools/cloudwatch_agent.md b/docusaurus/observability-best-practices/docs/tools/cloudwatch_agent.md new file mode 100644 index 000000000..1842f7fc8 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/cloudwatch_agent.md @@ -0,0 +1,59 @@ +# CloudWatch Agent + + +## Deploying the CloudWatch agent + +The CloudWatch agent can be deployed as a single installation, using a distributed configuration file, layering multiple configuration files, and entirely though automation. Which approach is appropriate for you depends on your needs. [^1] + +:::info + Deployment to Windows and Linux hosts both have the capability to store and retrieve their configurations into [Systems Manager Parameter Store](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance-fleet.html). Treating the deployment of CloudWatch agent configuration through this automated mechanism is a best practice. +::: + +:::tip + Alternatively, the configuration files for the CloudWatch agent can be deployed through the automation tool of your choice ([Ansible](https://www.ansible.com), [Puppet](https://puppet.com), etc.). The use of Systems Manager Parameter Store is not required, though it does simplify management. +::: +## Deployment outside of AWS + +The use of the CloudWatch agent is *not* limited to within AWS, and is supported both on-premises and in other cloud environments. There are two additional considerations that must be heeded when using the CloudWatch agent outside of AWS though: + +1. Setting up IAM credentials[^2] to allow agent to make required API calls. Even in EC2 there is no unauthenticated access to the CloudWatch APIs[^5]. +1. Ensure agent has connectivity to CloudWatch, CloudWatch Logs, and other AWS endpoints[^3] using a route that meets your requirements. This can be either through the Internet, using [AWS Direct Connect](https://aws.amazon.com/directconnect/), or through a [private endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/concepts.html) (typically called a *VPC endpoint*). + +:::info + Transport between your environment(s) and CloudWatch needs to match your governance and security requirements. Broadly speaking, using private endpoints for workloads outside of AWS meets the needs of customers in even the most strictly regulated industries. However, the majority of customers will be served well through our public endpoints. +::: +## Use of private endpoints + +In order to push metrics and logs, the CloudWatch agent must have connectivity to the *CloudWatch*, and *CloudWatch Logs* endpoints. There are several ways to achieve this based on where the agent is installed. + +### From a VPC + +a. You can make use of *VPC Endpoints* (for CloudWatch and CloudWatch Logs) in order to establish fully private and secure connection between your VPC and CloudWatch for the agent running on EC2. With this approach, agent traffic never traverses the internet. + +b. Another alternative is to have a public [NAT gateway](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) through which private subnets can connect to the internet, but cannot receive unsolicited inbound connections from the internet. + +:::note + Please note with this approach agent traffic will be logically routed via internet. +::: +c. If you don’t have requirement to establish private or secure connectivity beyond the existing TLS and [Sigv4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html) mechanisms, the easiest option is to have [Internet Gateway](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Internet_Gateway.html) to provide connectivity to our endpoints. + +### From on-premises or other cloud environments + +a. Agents running outside of AWS can establish connectivity to CloudWatch public endpoints over the internet(via their own network setup) or Direct Connect [Public VIF](https://docs.aws.amazon.com/directconnect/latest/UserGuide/WorkingWithVirtualInterfaces.html). + +b. If you require that agent traffic not route through the internet you can leverage [VPC Interface endpoints](https://docs.aws.amazon.com/vpc/latest/userguide/vpce-interface.html), powered by AWS PrivateLink, to extend the private connectivity all the way to your on-premises network using Direct Connect Private VIF or VPN. Your traffic is not exposed to the internet, eliminating threat vectors. + +:::success + You can add [ephemeral AWS access tokens](https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-on-premises-temp-credentials/) for use by the CloudWatch agent by using credentials obtained from the [AWS Systems Manager agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html). +::: + +[^1]: See [Getting started with open source Amazon CloudWatch Agent](https://aws.amazon.com/blogs/opensource/getting-started-with-open-source-amazon-cloudwatch-agent/) for a blog that gives guidance for CloudWatch agent use and deployment. + + +[^2]: [Guidance on setting credentials for agents running on-premises and in other cloud environments](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-commandline-fleet.html#install-CloudWatch-Agent-iam_user-first) + +[^3]: [How to verify connectivity to the CloudWatch endpoints](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-commandline-fleet.html#install-CloudWatch-Agent-internet-access-first-cmd) + +[^4]: [A blog for on-premises, private connectivity](https://aws.amazon.com/blogs/networking-and-content-delivery/hybrid-networking-using-vpc-endpoints-aws-privatelink-and-amazon-cloudwatch-for-financial-services/) + +[^5]: Use of all AWS APIs related to observability is typically accomplished by an [instance profile](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html) - a mechanism to grant temporary access credentials to instances and containers running in AWS. diff --git a/docusaurus/observability-best-practices/docs/tools/collector-arch.md b/docusaurus/observability-best-practices/docs/tools/collector-arch.md new file mode 100644 index 000000000..e69de29bb diff --git a/docusaurus/observability-best-practices/docs/tools/dashboards.md b/docusaurus/observability-best-practices/docs/tools/dashboards.md new file mode 100644 index 000000000..8dfb44010 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/dashboards.md @@ -0,0 +1,161 @@ +# Dashboards + +Dashboards are an important part of your Observability soluution. They enable you to produce a curated visualization of your data. They enable you see a history of your data, and see it alongside other related data. They also allow you to provide context. They help you understand the bigger picture. + +Often people gather their data and create alarms, and then stop. However, alarms only show a point in time, and usually for a single metric, or small set of data. Dashboards help you see the behaviour over time. + +![Sample dashboard](../images/dashboard1.png) + +## A practical example: consider an alarm for high CPU +You know the machine is running with higher than desired CPU. Do you need to act, and how quickly? What might help you decide? + +* What does normal CPU look like for this instance/application? +* Is this a spike, or a trend of increasing CPU? +* Is it impacting performance? If not, how long before it will does? +* Is this a regular occurrance? And does it usually recover on its own? + +### See the history of the data + +Now consider a dashboard, with a historic timechart of the CPU. Even with only this single metric, you can see if this is a spike, or an upward trend. You can also see how quickly it is trending upwards, and so make some decisions on the priority for action. + +### See the impact on the workflow + +But what does this machine do? How important is this in our overall context? Imagine we now add a visualization of the workflow performance, be it response time, throughput, errors, or some other measure. Now we can see if the high CPU is having an impact on the workflow or users this instance is supporting. + +### See the history of the alarm + +Consider adding a visualization which shows how often the alarm has triggered in the last month, and combining that with looking further back to see if this is a regular occurrance. For example, is a backup job triggering the spike? Knowing the pattern of reoccurance can help you understand the underlying issue, and make longer term decisions on how to stop the alarm reoccurring altogether. + +### Add context + +Finally, add some context to the dashboard. Include a brief description of the reason this dashboard exists, the workflow it relates to, what to do when there is an issue, links to documentation, and who to contact. + +:::info + Now we have a *story*, which helps the dashboard user to see what is happening, understand the impact, and make appropriate data driven decisions on what action and the urgency of it. +::: +### Don't try to visualize everything all at once + +We often talk about alarm fatigue. Too many alarms, without identifiable actions and priorities, can overload your team and lead to inefficiencies. Alarms should be for things which are important to you, and actionable. + +Dashboards are more flexible here. They don't demand your attention in the same way, so you have more freedom to visualize things that you may not be certain are important yet, or that support your exploration. Still, don't over do it! Everything can suffer from too much of a good thing. + +Dashboards should provide a picture of something that is important to you. In the same was as deciding what data to ingest, you need to think about what matters to you for dashboards. +For your dashboards, think about + +* Who will be viewing this? + * What is their background and knowledge? + * How much context do they need? +* What questions are they trying to answer? +* What actions will they be taking as a result of seeing this data? + +:::tip + Sometimes it can be hard to know what your dashboard story should be, and how much to include. So where could you start to design your dashboard? Lets look at two ways: *KPI driven*, or *incident driven*. +::: + +#### Design your dashboard: KPI driven + +One way to understand this is to work back from your KPIs. This is usually a very user driven approach. +For [layout](#layout), typically we are working top down, getting to more detail as we move further down a dashboard, or navigate to lower level dashboards. + +First, **understand your KPIs**. What they mean. This wil help you decide how you want to visualize these. +Many KPIs are shown as a single number. For example, what percentage of customers are successfully completing a specific workflow, and in what time? But over what time period? You may well meet your KPI if you average over a week, but still have smaller periods of time within this that breach your standards. Are these breaches important to you? Do they impact your customer experience. If so, you may consider different periods and time charts to see your KPIs. And maybe not everyone needs to see the detail, so perhaps you move the breakdown of KPIs to a separate dashboard, for a separate audience. + +Next, **what contribute to those KPIs?** What workflows need to be running in order for those actions to happen? Can you measure these? + +Identify the main components and add visualizations of their performance. When a KPI breeches, you should be able to quickly look and see where in the workflow the main impact is. + +And you can keep going down - what impacts the perfomance of those workflows? Remember your audience as you decide the level of depth. + +Consider the example of an e-commerce system with a KPI for the number of order placed. +For an order to be placed, users must be able to perform the following action: search for products, add them to their cart, add their delivery details, and pay for the order. +For each of these workflows, you might consider checking key components are functioning. For example by using RUM or Synthetics to get data on action success and see if the user is being impacted by an issue. You might consider a measurement of throughput, latency, failed action percentages to see if the performance of each action is as expected. You might consider measurements of the underlying infrastructure to see what might be impacting performance. + +However, don't put all of your information on the same dashboard. Again, consider your user audience. + +:::info + Create layers of dashboards that allow drilldown and provide the right context for the right users. +::: +#### Design your dashboard: Incident driven + +For many people, incident resolution is a key driver for observability. You have been alerted to an issue, by a user, or by an Observability alarm, and you need to quickly find a fix and potentially a root cause of the issue. + +:::info + Start by looking at your recent incidents. Are there common patterns? Which were the most impactful for your company? Which ones repeat? +::: +In this case, we're designing a dashboard for those trying to understand the severity, identify the root cause and fix the incident. + +Think back to the specific indcident. + +* How did you verify the incident was as reported? + * What did you check? Endpoints? Errors? +* How did you understand the impact, and therefore priority of the issue? +* What did you look at for cause of the issue? + +Application Performance Monitoring (APM) can help here, with [Synthetics](../tools/synthetics/) for regular baseline and testing of endpoints and workflows, and [RUM](../tools/rum/) for the actual customer experience. You can use this data to quickly visualize which workflows are impacted, and by how much. + +Visualizations which show the error count over time, and the top # errors, can help you to focus on the right area, and show you specific details of errors. This is where we are often using log data, and dynamic visualizations of error codes and reasons. + +It can be very useful here to have some kind of filtering or drilldown, to get to the specifics as quickly as possible. Think about ways to implement this without too much overhead. For example, having a single dashboard which you can filter to get closer to the details. + +### Layout + +The layout of your dashboard is also important. + +:::info + Typically the most significant visualizations for your user want to be top left, or otherwise aligned with a natural *beginning* of page navigation. +::: + +You can use layout to help tell the story. For example, you may use a top-down layout, where the further down you scroll, the more details you see. Or perhaps a left-right display would be useful with higher level services on the left, and their dependencies as you move to the right. + +### Create dynamic content + +Many of your workloads will be designed to grow or shrink as demand dictates, and your dashboards need to take this into account. For example you may have your instances in an autoscaling group, and when you hit a certain load, additional instances are added. + +:::info + A dashboard showing data from specific instances, specified by some kind of ID, will not allow the data from those new instances to be seen. Add metadata to your resources and data, so you can create your visualizations to capture all instances with a specific metadata value. This way they will reflect the actual state. +::: +Another example of dynamic visualizations might be being able to find the top 10 errors occurring now, and how they have behaved over recent history. You want to be able to see a table, or a chart, without knowledge of which errors might occur. + +### Think about symptoms first over causes + +When you observe symptoms, you are considering about the impact this has on your users and systems. Many underlying causes might give the same symptoms. This enables you to capture more issues, including unknown issues. As you understand causes, your lower level dashboards may be more specific to these to help you quickly diagnose and fix issues. + +:::tip + Don't capture the specific JavaScript error that impacted the users last week. Capture the *impact* on the workflow it disrupted, and then show the top count of JavaScript errors over recent history, or which have dramatically increased in recent history. +::: +### Use top/bottom N + +Most of the time there is no need to visualize *all* of your operational metrics at the same time. A large fleet of EC2 instances is a good example of this: there is no need or value in having the disk IOPS or CPU utilization for an entire farm of hundreds of servers displayed simultaneously. This creates an anti-pattern where you can spend more time trying to dig-through your metrics than seeing the best (or worst) performing resources. + +:::info + Use your dashboards to show the ten or 20 of any given metric, and then focus on the [symptoms](#think-about-symptoms-first-over-causes) this reveals. +::: +[CloudWatch metrics](../tools/metrics/) allows you to search for the top N for any time series. For example, this query will return the busiest 20 EC2 instances by CPU utilization: + +``` +SORT(SEARCH('{AWS/EC2,InstanceId} MetricName="CPUUtilization"', 'Average', 300), SUM, DESC, 10) +``` + +Use this approach, or similar with [CloudWatch Metric Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html) to identify the top or bottom performing metrics in your dashboards. + +### Show KPIs with thresholds visually + +Your KPIs should have a warning or error threshold, and dashboards can show this using a horizontal annotation. This will appear as a high water mark on a widget. Showing this visually can give human operators a forewarning if business outcomes or infrastructure are in jeopardy. + +![Image of a horizonal annotation](../images/horizontal-annotation.png) + +:::info + Horizontal annotations are a critical part of a well-developed dashboard. +::: +### The importance of context + +People can easily misinterpret data. Their background and current context will colour how they view the data. + +So make sure you include *text* within your dashboard. What is this data for, and who? What does it mean? Link to documentation on the application, who supports it, the troubleshooting docs. You can also uses text displays to divide your dashboard display. se them on the left to set left-right context. Use them as full horizontal displays to divide your dashboard vertically. + +:::info + Having links to IT support, operations on-call, or business owners can give teams a fast path to contact people who can help support when issues occur. +::: +:::tip + Hyperlinks to ticketing systems is also a very useful addition for dashboards. +::: \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/tools/emf.md b/docusaurus/observability-best-practices/docs/tools/emf.md new file mode 100644 index 000000000..ec9536d19 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/emf.md @@ -0,0 +1,32 @@ +# Embedded Metric Format + +The CloudWatch embedded metric format(EMF) is a JSON specification used to instruct CloudWatch Logs to automatically extract metric values embedded in structured log events. You can use CloudWatch to graph and create alarms on the extracted metric values. With EMF, you can push the metric related data in terms of CloudWatch logs which gets discovered as metric in CloudWatch. + +Below is a sample EMG for mat encamp and JSON schema : +``` + { + "_aws": { + "Timestamp": 1574109732004, + "CloudWatchMetrics": [ + { + "Namespace": "lambda-function-metrics", + "Dimensions": [ + [ + "functionVersion" + ] + ], + "Metrics": [ + { + "Name": "time", + "Unit": "Milliseconds" + } + ] + } + ] + }, + "functionVersion": "$LATEST", + "time": 100, + "requestId": "989ffbf8-9ace-4817-a57c-e4dd734019ee" + } +``` +Thus, with help of EMF you can send high cardinality metrics without the need of making manual PutMetricData API calls. \ No newline at end of file diff --git a/docusaurus/observability-best-practices/docs/tools/internet_monitor.md b/docusaurus/observability-best-practices/docs/tools/internet_monitor.md new file mode 100644 index 000000000..850aeccf4 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/internet_monitor.md @@ -0,0 +1,116 @@ +# Internet Monitor + +:::warning + As of this writing, [Internet Monitor](https://aws.amazon.com/blogs/aws/cloudwatch-internet-monitor-end-to-end-visibility-into-internet-performance-for-your-applications/) is available in **preview** in the CloudWatch console. The scope of features for general availability may change from what you experience today. +::: +[Collecting telemetry from all tiers of your workload](../guides/#collect-telemetry-from-all-tiers-of-your-workload) is a best practice, and one that can be a challenge. But what are the tiers of your workload? For some it may be web, application, and database servers. Other people might view their workload as front end and back end. And those operating web applications can use [Real User Monitoring](../tools/rum)(RUM) to observe the health of these apps as experienced by end users. + +But what about the traffic between the client and the datacenter or cloud services provider? And for applications that are not served as web pages and therefore cannot use RUM? + +![Network telemetry from Internet-traversing applications](../images/internet_monitor.png) + +Internet Monitor works at the networking level and evaluates the health of observed traffic, correlated against [AWS existing knowledge](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-IM-inside-internet-monitor.html) of known Internet issues. In short, if there is an Internet Service Provider (ISP) that has a performance or availability issue **and** if your application has traffic that uses this ISP for client/server communication, then Internet Monitor can proactively inform you about this impact to your workload. Additionally, it can make recommendations to you based on your selected hosting region and use of [CloudFront](https://aws.amazon.com/cloudfront/) as a Content Delivery Network[^1]. + +:::tip + Internet Monitor only evaluates traffic from networks that your workloads traverse. For example, if an ISP in another country is impacted, but your users do not use that carrier, then you will not have visibility into that issue. +::: + +## Create monitors for applications that traverse the Internet + +The way that Internet Monitor operates is by watching for traffic that comes either into your CloudFront distributions or to your VPCs from impacted ISPs. This allows you to make decisions about application behaviour, routing, or user notification that helps offset business issues that arise as a result of network problems that are outside of your control. + +![Intersection of your workload and ISP issues](../images/internet_monitor_2.png) + +:::info + Only create monitors that watch traffic which traverses the Internet. Private traffic, such as between two hosts in a private network ([RFC1918](https://www.arin.net/reference/research/statistics/address_filters/)) cannot be monitored using Internet Monitor. +::: +:::info + Prioritize traffic from mobile applications where applicable. Customers roaming between providers, or in remote geographical locations, may have different or unexpected experiences that you should be aware of. +::: +## Enable actions through EventBridge and CloudWatch + +Observed issues will be published through [EventBridge](https://aws.amazon.com/eventbridge/) using a [schema](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-IM-EventBridge-integration.html) that contains the souce identified as `aws.internetmonitor`. EventBridge can be used to automatically create issues in your ticket management system, page your support teams, or even trigger automation that can alter your workload to mitigate some scenarios. + +```json +{ + "source": ["aws.internetmonitor"] +} +``` + +Likewise, extensive details of traffic are available in [CloudWatch Logs](../tools/logs) for observed cities, countries, metros, and subdivisions. This allows you to create highly-targeted actions which can notify impacted customers proactively about issues local to them. Here is an example of a country-level observation about a single provider: + +```json +{ + "version": 1, + "timestamp": 1669659900, + "clientLocation": { + "latitude": 0, + "longitude": 0, + "country": "United States", + "subdivision": "", + "metro": "", + "city": "", + "countryCode": "US", + "subdivisionCode": "", + "asn": 00000, + "networkName": "MY-AWESOME-ASN" + }, + "serviceLocation": "us-east-1", + "percentageOfTotalTraffic": 0.36, + "bytesIn": 23, + "bytesOut": 0, + "clientConnectionCount": 0, + "internetHealth": { + "availability": { + "experienceScore": 100, + "percentageOfTotalTrafficImpacted": 0, + "percentageOfClientLocationImpacted": 0 + }, + "performance": { + "experienceScore": 100, + "percentageOfTotalTrafficImpacted": 0, + "percentageOfClientLocationImpacted": 0, + "roundTripTime": { + "p50": 71, + "p90": 72, + "p95": 73 + } + } + }, + "trafficInsights": { + "timeToFirstByte": { + "currentExperience": { + "serviceName": "VPC", + "serviceLocation": "us-east-1", + "value": 48 + }, + "ec2": { + "serviceName": "EC2", + "serviceLocation": "us-east-1", + "value": 48 + } + } + } +} +``` + +:::info + Values such as `percentageOfTotalTraffic` can reveal powerful insights about where your customers access your workloads from and can be used for advanced analytics. +::: + +:::warning + Note that log groups created by Internet Monitor will have a default retention period set to *never expire*. AWS does not delete your data without your consent, so be sure to set a retention period that makes sense for your needs. +::: +:::info + Each monitor will create at least 10 discrete CloudWatch metrics. These should be used for creating [alarms](../tools/alarms) just as you would with any other operational metric. +::: +## Utilize traffic optimization suggestions + +Internet Monitor features traffic optimization recommendations that can advise you on where to best place your workloads so as to have the best customer experiences. For those workloads that are global, or have global customers, this feature is particularly valuable. + +![Internet Monitor console](../images/internet_monitor_3.png) + +:::info + Pay close attention to the current, predicted, and lowest time-to-first-byte (TTFB) values in the traffic optimization suggestions view as these can indicate potentially poor end-user experiences that are otherwise difficult to observe. +::: +[^1]: See [https://aws.amazon.com/blogs/aws/cloudwatch-internet-monitor-end-to-end-visibility-into-internet-performance-for-your-applications/](https://aws.amazon.com/blogs/aws/cloudwatch-internet-monitor-end-to-end-visibility-into-internet-performance-for-your-applications/) for our launch blog about this new feature. diff --git a/docusaurus/observability-best-practices/docs/tools/logs/dataprotection/data-protection-policies.md b/docusaurus/observability-best-practices/docs/tools/logs/dataprotection/data-protection-policies.md new file mode 100644 index 000000000..3bf58408e --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/logs/dataprotection/data-protection-policies.md @@ -0,0 +1,124 @@ +# CloudWatch Logs Data Protection Policies for SLG/EDU + +Although logging data is beneficial in general, however, masking them is useful for organizations who have strict regulations such as the Health Insurance Portability and Accountability Act (HIPAA), General Data Privacy Regulation (GDPR), Payment Card Industry Data Security Standard (PCI-DSS), and Federal Risk and Authorization Management Program (FedRAMP). + +[Data Protection policies](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch-logs-data-protection-policies.html) in CloudWatch Logs enables customers to define and apply data protection policies that scan log data-in-transit for sensitive data and mask sensitive data that is detected. + +These policies leverage pattern matching and machine learning models to detect sensitive data and helps you audit and mask those data that appears in events ingested by CloudWatch log groups in your account. + +The techniques and criteria used to select sensitive data are referred to as [matching data identifiers](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch-logs-data-protection-policies.html). Using these managed data identifiers, CloudWatch Logs can detect: + +- Credentials such as private keys or AWS secret access keys +- Device identifiers such as IP addresses or MAC addresses +- Financial information such as bank account number, credit card numbers or credit card verification code +- Protected Health Information (PHI) such as Health Insurance Card Number (EHIC) or Personal health Number +- Personally Identifiable Information (PII) such as driver’s licenses, social security numbers or taxpayer identification numbers + +:::note + Sensitive data is detected and masked when it is ingested into the log group. When you set a data protection policy, log events ingested to the log group before that time are not masked. +::: +Let us expand on some of the data types mentioned above and see some examples: + + +## Data Types + +### Credentials + +Credentials are sensitive data types which are used to verify who you are and whether you have permission to access the resources that you are requesting. AWS uses these credentials like private keys and secret access keys to authenticate and authorize your requests. + +Using CloudWatch Logs Data Protection policies, sensitive data that matches the data identifiers you have selected is masked. (We will see a masked example at the end of the section). + +![The CloudWatch Logs Data Protection for Credentials1](../../../images/cwl-dp-credentials.png) + + +![The CloudWatch Logs Data Protection for Credentials2](../../../images/cwl-dp-cred-sensitive.png) + + + +:::tip + Data classification best practices start with clearly defined data classification tiers and requirements, which meet your organizational, legal, and compliance standards. + + As a best practice, use tags on AWS resources based on the data classification framework to implement compliance in accordance with your organization data governance policies. +::: + +:::tip + To avoid sensitive data in your log events, best practice is to exclude them in your code in the first place and log only necessary information. +::: + + +### Financial Information + +As defined by the Payment Card Industry Data Security Standard (PCI DSS), bank account, routing numbers, debit and credit card numbers, credit card magnetic strip data are considered as sensitive financial information. + +To detect sensitive data, CloudWatch Logs scans for the data identifiers that you specify regardless of the geo-location the log group is located once you set a data protection policy. + +![The CloudWatch Logs Data Protection for Financial](../../../images/cwl-dp-fin-info.png) + +:::info + Check the full list of [financial data types and data identifiers](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/protect-sensitive-log-data-types-financial.html) +::: + + +### Protected Health Information (PHI) + +PHI includes a very wide set of personally identifiable health and health-related data, including insurance and billing information, diagnosis data, clinical care data like medical records and data sets and lab results such as images and test results. + +CloudWatch Logs scan and detect the health information from the chosen log group and mask that data. + +![The CloudWatch Logs Data Protection for PHI](../../../images/cwl-dp-phi.png) + +:::info + Check the full list of [phi data types and data identifiers](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/protect-sensitive-log-data-types-health.html) +::: + +### Personally Identifiable Information (PII) + +PII is a textual reference to personal data that could be used to identify an individual. PII examples include addresses, bank account numbers, and phone numbers. + +![The CloudWatch Logs Data Protection for PHI](../../../images/cwl-dp-pii.png) + +:::info + Check the full list of [pii data types and data identifiers](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/protect-sensitive-log-data-types-pii.html) +::: + +## Masked Logs + +Now if you go to your log group where you set your data protection policy, you will see that data protection is `On` and the console also displays a count of sensitive data. + +![The CloudWatch Logs Data Protection for PHI](../../../images/cwl-dp-loggroup.png) + +Now, clicking on `View in Log Insights` will take you to the Log Insights console. Running the below query to check the logs events in a log stream will give you a list of all the logs. + +``` +fields @timestamp, @message +| sort @timestamp desc +| limit 20 +``` + +Once you expand a query, you will see the masked results as shown below: + +![The CloudWatch Logs Data Protection for PHI](../../../images/cwl-dp-masked.png) + +:::important + When you create a data protection policy, then by default, sensitive data that matches the data identifiers you've selected is masked. Only users who have the `logs:Unmask` IAM permission can view unmasked data. +::: + +:::tip + Use [AWS IAM and Access Management(IAM)](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/auth-and-access-control-cw.html) to administer and restrict access to sensitive data in CloudWatch. +::: + +:::tip + Regular monitoring and auditing of your cloud environment are equally important in safeguarding sensitive data. It becomes a critical aspect when applications generate a large volume of data and manual and thereby, it is recommended not to log an excessive amount of data. Read this AWS Prescriptive Guidance for [Logging Best Practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/logging-monitoring-for-application-owners/logging-best-practices.html) +::: + +:::tip + Log Group Data is always encrypted in CloudWatch Logs. Alternatively, you can also use [AWS Key Management Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) to encrypt your log data. +::: + +:::tip + For resiliency and scalability, set up CloudWatch alarms and automate remediation using AWS Amazon EventBridge and AWS Systems Manager. +::: + + +[^1]: Check our AWS blog [Protect Sensitive Data with Amazon CloudWatch Logs](https://aws.amazon.com/blogs/aws/protect-sensitive-data-with-amazon-cloudwatch-logs/) to get started. + diff --git a/docusaurus/observability-best-practices/docs/tools/logs/index.md b/docusaurus/observability-best-practices/docs/tools/logs/index.md new file mode 100644 index 000000000..74c21b4ba --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/logs/index.md @@ -0,0 +1,164 @@ +# Logging + +The selection of logging tools is tied to your requirements for data transmission, filtering, retention, capture, and integration with the applications that generate your data. When using Amazon Web Services for observability (regardless whether you host [on-premises](../../faq#what-is-a-cloud-first-approach) or in another cloud environment), you can leverage the [CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) or another tool such as [Fluentd](https://www.fluentd.org/) to emit logging data for analysis. + +Here we will expand on the best practices for implementing the CloudWatch agent for logging, and the use of CloudWatch Logs within the AWS console or APIs. + +:::info + The CloudWatch agent can also be used for delivery of [metric data](../../signals/metrics/) to CloudWatch. See the [metrics](../../tools/metrics/) page for implementation details. It can also be used to collect [traces](../../signals/traces.md) from OpenTelemetry or X-Ray client SDKs, and send them to [AWS X-Ray](../../tools/xray.md). +::: +## Collecting logs with the CloudWatch agent + +### Forwarding + +When taking a [cloud first approach](../../faq#what-is-a-cloud-first-approach) to observability, as a rule, if you need to log into a machine to get its logs, you then have an anti-pattern. Your workloads should emit their logging data outside of their confines in near real time to a log analysis system, and latency between that transmission and the original event represents a potential loss of point-in-time information should a disaster befall your workload. + +As an architect you will have to determine what your acceptable loss for logging data is and adjust the CloudWatch agent's [`force_flush_interval`](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html#CloudWatch-Agent-Configuration-File-Logssection) to accommodate this. + +The `force_flush_interval` instructs the agent to send logging data to the data plane at a regular cadence, unless the buffer size is reached, in which case it will send all buffered logs immediately. + +:::tip + Edge devices may have very different requirements from low-latency, in-AWS workloads, and may need to have much longer `force_flush_interval` settings. For example, an IoT device on a low-bandwidth Internet connection may only need to flush logs every 15 minutes. +::: +:::info + Containerized or stateless workloads may be especially sensitive to log flush requirements. Consider a stateless Kubernetes application or EC2 fleet that can be scaled-in at any moment. Loss of logs may take place when these resources are suddenly terminated, leaving no way to extract logs from them in the future. The standard `force_flush_interval` is usually appropriate for these scenarios, but can be lowered if required. +::: +### Log groups + +Within CloudWatch Logs, each collection of logs that logically applies to an application should be delivered to a single [log group](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogsConcepts.html). Within that log group you want to have *commonality* among the source systems that create the log streams within. + +Consider a LAMP stack: the logs from Apache, MySQL, your PHP application, and hosting Linux operating system would each belong to a separate log group. + +This grouping is vital as it allows you to treat groups with the same retention period, encryption key, metric filters, subscription filters, and Contributor Insights rules. + +:::info + There is no limitation on the number of log streams in a log group, and you can search through the entire compliment of logs for your application in a single CloudWatch Logs Insights query. Having a separate log stream for each pod in a Kubernetes service, or for every EC2 instance in your fleet, is a standard pattern. +::: +:::info + The default retention period for a log group is *indefinite*. The best practice is to set the retention period at the time of creating the log group. + + While you can set this in the CloudWatch console at any time, the best practice is to do so either in-tandem with the log group creation using infrastructure as code (CloudFormation, Cloud Development Kit, etc.) or using the `retention_in_days` setting inside of the CloudWatch agent configuration. + + Either approach lets you set the log retention period proactively, and aligned with your project's data retention requirements. +::: + +:::info + Log group data is always encrypted in CloudWatch Logs. By default, CloudWatch Logs uses `server-side` encryption for the log data at rest. As an alternative, you can use AWS Key Management Service for this encryption. [Encryption using AWS KMS](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) is enabled at the log group level, by associating a KMS key with a log group, either when you create the log group or after it exists. This can be configured using infrastructure as code (CloudFormation, Cloud Development Kit, etc.). + + Using AWS Key Management Service to manage keys for CloudWatch Logs requires additional configuration and granting permissions to the keys for your users.[^1] +::: +### Log formatting + +CloudWatch Logs has the capability to automatically discover log fields and index JSON data upon ingestion. This feature facilitates ad hoc queries and filtering, enhancing the usability of log data. However, it's important to note that automatic indexing is only applicable to structured data. Unstructured logging data won't be automatically indexed but can still be delivered to CloudWatch Logs. + +Unstructured logs can still be searched or queried using a regular expression with `parse` command. + +:::info + The two best practices for log formats when using CloudWatch Logs: + + 1. Use a structured log formatter such as [Log4j](https://logging.apache.org/log4j/2.x/), [`python-json-logger`](https://pypi.org/project/python-json-logger/), or your framework's native JSON emitter. + 2. Send a single line of logging per event to your log destination. + + Note that when sending multiple lines of JSON logging, each line will be interpreted as a single event. +::: +### Handling `stdout` + +As discussed in our [log signals](../../signals/logs/#log-to-stdout) page, the best practice is to decouple logging systems from their generating applications. However to send data from `stdout` to a file is a common pattern for many (if not most) platforms. Container orchestration systems such as Kubernetes or [Amazon Elastic Container Service](https://aws.amazon.com/ecs/) manage this delivery of `stdout` to a log file automatically, allowing for collection of each log from a collector. The CloudWatch agent then reads this file in real time and forwards the data to a log group on your behalf. + +:::info + Use the pattern of simplified application logging to `stdout`, with collection by an agent, as much as possible. +::: +### Filtering logs + +There are many reasons to filter your logs such as preventing the persistent storage of personal data, or only capturing data that is of a particular log level. In any event, the best practice is to perform this filtering as close to the originating system as possible. In the case of CloudWatch, this will mean *before* data is delivered into CloudWatch Logs for analysis. The CloudWatch agent can perform this filtering for you. + +:::info + Use the [`filters`](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html#CloudWatch-Agent-Configuration-File-Logssection) feature to `include` log levels that you want and `exclude` patterns that are known not to be desirable, e.g. credit card numbers, phone numbers, etc. +::: +:::tip + Filtering out certain forms of known data that can potentially leak into your logs can be time-consuming and error prone. However, for workloads that handle specific types of known undesirable data (e.g. credit card numbers, Social Security numbers), having a filter for these records can prevent a potentially damaging compliance issue in the future. For example, dropping all records that contain a Social Security number can be as simple as this configuration: + + ``` + "filters": [ + { + "type": "exclude", + "expression": "\b(?!000|666|9\d{2})([0-8]\d{2}|7([0-6]\d))([-]?|\s{1})(?!00)\d\d\2(?!0000)\d{4}\b" + } + ] + ``` +::: + +### Multi-line logging + +The best practice for all logging is to use [structured logging](../../signals/logs/#structured-logging-is-key-to-success) with a single line emitted for every discrete log event. However, there are many legacy and ISV-supported applications that do not have this option. For these workloads, CloudWatch Logs will interpret each line as a unique event unless they are emitted using a multi-line-aware protocol. The CloudWatch agent can perform this with the [`multi_line_start_pattern`](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html#CloudWatch-Agent-Configuration-File-Logssection) directive. + +:::info + Use the `multi_line_start_pattern` directive to ease the burden of ingesting muli-line logging into CloudWatch Logs. +::: +### Configuring logging class + +CloudWatch Logs offers two [classes](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatch_Logs_Log_Classes.html) of log groups: + +- The CloudWatch Logs Standard log class is a full-featured option for logs that require real-time monitoring or logs that you access frequently. + +- The CloudWatch Logs Infrequent Access log class is a new log class that you can use to cost-effectively consolidate your logs. This log class offers a subset of CloudWatch Logs capabilities including managed ingestion, storage, cross-account log analytics, and encryption with a lower ingestion price per GB. The Infrequent Access log class is ideal for ad-hoc querying and after-the-fact forensic analysis on infrequently accessed logs. + +:::info + Use the `log_group_class` directive to specify which log group class to use for the new log group. Valid values are **STANDARD** and **INFREQUENT_ACCESS**. If you omit this field, the default of **STANDARD** is used by the agent. +::: +## Search with CloudWatch Logs + +### Manage costs with query scoping + +With data delivered into CloudWatch Logs, you can now search through it as required. Be aware that CloudWatch Logs charges per gigabyte of data scanned. There are strategies for keeping your query scope under control, which will result in reduced data scanned. + +:::info + When searching your logs ensure that your time and date range is appropriate. CloudWatch Logs allows you to set relative or absolute time ranges for scans. *If you are only looking for entries from the day before, then there is no need to include scans of logs from today!* +::: + +:::info + You can search multiple log groups in a single query, but doing so will cause more data to be scanned. When you have identified the log group(s) you need to target, reduce your query scope to match. +::: + +:::tip + You can see how much data each query actually scans directly from the CloudWatch console. This approach can help you create queries that are efficient. + + ![Preview of the CloudWatch Logs console](../../images/cwl1.png) +::: + +### Share Successful Queries with Others + +While the [CloudWatch Logs query syntax](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html) is not complex, writing certain queries from scratch can still be time-consuming. Sharing well-written queries with other users within the same AWS account can streamline the investigation of application logs. This can be achieved directly from the [AWS Management Console](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_Insights-Saving-Queries.html) or programmatically using [CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-logs-querydefinition.html) or [AWS CDK](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_logs.CfnQueryDefinition.html). Doing so reduces the amount of rework required for others who need to analyze log data. + +:::info + Save queries that are often repeated into CloudWatch Logs so they can be prepopulated for your users. + + ![The CloudWatch Logs query editor page](../../images/cwl2.png) +::: + +### Pattern analysis + +CloudWatch Logs Insights uses machine learning algorithms to find patterns when you query your logs. A pattern is a shared text structure that recurs among your log fields. Patterns are useful for analyzing large log sets because a large number of log events can often be compressed into a few patterns.[^2] + +:::info + Use pattern to automatically cluster your log data into patterns. + + ![The CloudWatch Logs query pattern example](../../images/pattern_analysis.png) +::: + +### Compare (diff) with previous time ranges + +CloudWatch Logs Insights enables comparison of log event changes over time, aiding in error detection and trend identification. Comparison queries reveal patterns, facilitating quick trend analysis, with the ability to examine sample raw log events for deeper investigation. Queries are analyzed against two time periods: the selected period and an equal-length comparison period.[^3] + +:::info + Compare changes in your log events over time using `diff` command. + + ![The CloudWatch Logs query difference example](../../images/diff-query.png) +::: + +[^1]: See [How to search through your AWS Systems Manager Session Manager console logs – Part 1](https://aws.amazon.com/blogs/mt/how-to-search-through-your-aws-systems-manager-session-manager-console-logs-part-1/) for a practical example of CloudWatch Logs log group encryption with access privileges. + +[^2]: See [CloudWatch Logs Insights Pattern Analysis](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData_Patterns.html) for more detailed insights. + +[^3]: See [CloudWatch Logs Insigts Compare(diff) with previous ranges](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData_Compare.html) for more information. + diff --git a/docusaurus/observability-best-practices/docs/tools/logs/logs-insights-examples.md b/docusaurus/observability-best-practices/docs/tools/logs/logs-insights-examples.md new file mode 100644 index 000000000..b495f9d23 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/logs/logs-insights-examples.md @@ -0,0 +1,187 @@ +# CloudWatch Logs Insights Example Queries + +[CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) provides a powerful platform for analyzing and querying CloudWatch log data. It allows you interactively search through your log data using a SQL like query language with a few simple but powerful commands. + +CloudWatch Logs insights provides out of the box example queries for the following categories: + +- Lambda +- VPC Flow Logs +- CloudTrail +- Common Queries +- Route 53 +- AWS AppSync +- NAT Gateway + +In this section of the best practices guide we provide some example queries for other types of logs that are not currently included in the out of the box examples. This list will evolve and change over time and you can submit your own examples for review by leaving an [issue](https://github.com/aws-observability/observability-best-practices/issues) on the git hub. + +## API Gateway + +### Last 20 Messages containing an HTTP Method Type + +``` +filter @message like /$METHOD/ +| fields @timestamp, @message +| sort @timestamp desc +| limit 20 +``` + +This query will return the last 20 log messages containing a specific HTTP method sorted in descending timestamp order. Substitute **METHOD** for the method you are querying for. Here is an example of how to use this query: + +``` +filter @message like /POST/ +| fields @timestamp, @message +| sort @timestamp desc +| limit 20 +``` + +:::tip + You can change the $limit value in order to return a different amount of messages. +::: + +### Top 20 Talkers Sorted by IP + +``` +fields @timestamp, @message +| stats count() by ip +| sort ip asc +| limit 20 +``` + +This query will return the top 20 talkers sorted by IP. This can be useful for detecting malicious activity against your API. + +As a next step you could then add an additional filter for method type. For example, this query would show the top talkers by IP but only the "PUT" method call: + +``` +fields @timestamp, @message +| filter @message like /PUT/ +| stats count() by ip +| sort ip asc +| limit 20 +``` + +## CloudTrail Logs + +### API throttling errors grouped by error category + +``` +stats count(errorCode) as eventCount by eventSource, eventName, awsRegion, userAgent, errorCode +| filter errorCode = 'ThrottlingException' +| sort eventCount desc +``` + +This query allows you to see API throttling errors grouped by category and displayed in descending order. + +:::tip + In order to use this query you would first need to ensure you are [sending CloudTrail logs to CloudWatch.](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/send-cloudtrail-events-to-cloudwatch-logs.html) +::: + +### Root account activity in line graph + +``` +fields @timestamp, @message, userIdentity.type +| filter userIdentity.type='Root' +| stats count() as RootActivity by bin(5m) +``` + +With this query you can visualize root account activity in a line graph. This query aggregates the root activity over time, counting the occurrences of root activity within each 5-minute interval. +:::tip + [Visualize log data in graphs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_Insights-Visualizing-Log-Data.html) +::: + +## VPC Flow Logs + +### Filtering flow logs for selected source IP address with action as REJECT. + +``` +fields @timestamp, @message, @logStream, @log | filter srcAddr like '$SOURCEIP' and action = 'REJECT' +| sort @timestamp desc +| limit 20 +``` + +This query will return the last 20 log messages containing a 'REJECT' from the $SOURCEIP. This can be used to detect if traffic is explicitly rejected, or if the issue is some type of client side network configuration problem. + +:::tip + Ensure that you substitute the value of the IP address you are interested in for '$SOURCEIP' +::: + +``` +fields @timestamp, @message, @logStream, @log | filter srcAddr like '10.0.0.5' and action = 'REJECT' +| sort @timestamp desc +| limit 20 +``` + +### Grouping network traffic by Availability Zones + +``` +stats sum(bytes / 1048576) as Traffic_MB by azId as AZ_ID +| sort Traffic_MB desc +``` + +This query retrieves network traffic data grouped by Availability Zone (AZ). It calculates the total traffic in megabytes (MB) by summing the bytes and converting them to MB. The results are then sorted in descending order based on the traffic volume in each AZ. + + +### Grouping network traffic by flow direction + +``` +stats sum(bytes / 1048576) as Traffic_MB by flowDirection as Flow_Direction +| sort by Bytes_MB desc +``` + +This query is designed to analyze network traffic grouped by flow direction. (Ingress or Egress) + + +### Top 10 data transfers by source and destination IP addresses + +``` +stats sum(bytes / 1048576) as Data_Transferred_MB by srcAddr as Source_IP, dstAddr as Destination_IP +| sort Data_Transferred_MB desc +| limit 10 +``` + +This query retrieves the top 10 data transfers by source and destination IP addresses. This query allows for identifying the most significant data transfers between specific source and destination IP addresses. + +## Amazon SNS Logs + +### Count of SMS message failures by reasons + +``` +filter status = "FAILURE" +| stats count(*) by delivery.providerResponse as FailureReason +| sort delivery.providerResponse desc +``` + +The query above lists the count of Delivery failures sorted by reason in descending order. This query can be used to find the reasons for delivery failure. + +### SMS message failures due to Invalid Phone Number + +``` +fields notification.messageId as MessageId, delivery.destination as PhoneNumber +| filter status = "FAILURE" and delivery.providerResponse = "Invalid phone number" +| limit 100 +``` + +This query returns the message that fails to deliver due to Invalid Phone Number. This can be used to identify phone numbers that need to be corrected. + +### Message failure statistics by SMS Type + +``` +fields delivery.smsType +| filter status = "FAILURE" +| stats count(notification.messageId), avg(delivery.dwellTimeMs), sum(delivery.priceInUSD) by delivery.smsType +``` + +This query returns the count, average dwell time and spend for each SMS type (Transactional or Promotional). This query can be used to establish thresholds to trigger corrective actions. The query can be modified to filter only certain SMS Type, if only that SMS Type warrants corrective action. + +### SNS failure notifications statistics + +``` +fields @MessageID +| filter status = "FAILURE" +| stats count(delivery.deliveryId) as FailedDeliveryCount, avg(delivery.dwellTimeMs) as AvgDwellTime, max(delivery.dwellTimeMs) as MaxDwellTime by notification.messageId as MessageID +| limit 100 +``` + +This query returns the count, average dwell time and spend for each failed Message. This query can be used to establish thresholds to trigger corrective actions. + + + diff --git a/docusaurus/observability-best-practices/docs/tools/metrics.md b/docusaurus/observability-best-practices/docs/tools/metrics.md new file mode 100644 index 000000000..50bf35577 --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/metrics.md @@ -0,0 +1,90 @@ +# Metrics + +Metrics are data about the performance of your system. Having all the metrics related to system or the resources available in a centralised place grants you the ability to compare metrics, analyse performance, and make better strategic decisions like scaling-up or scaling-in resources. Metrics are also important for the knowing the health of the resources and take proactive measures. + +Metric data is foundational and used to drive [alarms](../signals/alarms/), anomaly detection, [events](../signals/events/), [dashboards](../tools/dashboards) and more. + +## Vended metrics + +[CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) collects data about the performance of your systems. By default, most AWS services provide free metrics for their resources. This includes [Amazon EC2](https://aws.amazon.com/ec2/) instances, [Amazon RDS](https://aws.amazon.com/rds/), [Amazon S3](https://aws.amazon.com/s3/?p=pm&c=s3&z=4) buckets, and many more. + +We refer to these metrics as *vended metrics*. There is no charge for the collection of vended metrics in your AWS account. + +:::info + For a complete list of AWS services that emit metrics to CloudWatch see [this page](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-services-cloudwatch-metrics.html). +::: +## Querying metrics + +You can utilise the [metric math](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) feature in CloudWatch to query multiple metrics and use math expressions to analyse the metrics for more granularity. For example, you can write a metric math expression to find out the Lambda error rate by query as: + + Errors/Requests + +Below you see an example of how this can appear in the CloudWatch console: + +![Metric math example](../images/metrics1.png) + +:::info + Use metric math to get the most value from your data and derive values from the performance of separate data sources. +::: +CloudWatch also supports conditional statements. For example, to return a value of `1` for each timeseries where latency is over a specific threshold, and `0` for all other data points, a query would resemble this: + + IF(latency>threshold, 1, 0) + +In the CloudWatch console we can use this logic to create boolean values, which in turn can trigger [CloudWatch alarms](../tools/alarms) or other actions. This can enable automatic actions from derived datapoints. An example from the CloudWatch console is below: + +![Alarm creation from a derived value](../images/metrics2.png) + +:::info + Use conditional statements to trigger alarms and notifications when performance exceeds thresholds for derived values. +::: +You can also use a `SEARCH` function to show the top `n` for any metric. When visualizing the best or worst performing metrics across a large number timeseries (e.g. thousands of servers) this approach allows you to see only the data that matters most. Here is an example of a search returning the top two CPU-consuming EC2 instances, averaged over the last five minutes: +``` + SLICE(SORT(SEARCH('{AWS/EC2,InstanceId} MetricName="CPUUtilization"', 'Average', 300), MAX, DESC),0, 2) +``` +And a view of the same in the CloudWatch console: + +![Search query in CloudWatch metrics](../images/metrics3.png) + +:::info + Use the `SEARCH` approach to rapidly display the valuable or worst performing resources in your environment, and then display these in [dashboards](../tools/dashboards). +::: +## Collecting metrics + +If you would like to have additional metrics like memory or disk space utilization for your EC2 instances, you use the [CloudWatch agent](../tools/cloudwatch_agent/) to push this data to CloudWatch on your behalf. Or if you have custom processing data which needs to be visualised in graphical manner, and you want this data to be present as CloudWatch metric, then you can use [`PutMetricData` API](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html) to publish custom metrics to CloudWatch. + +:::info + Use one of the [AWS SDKs](https://aws.amazon.com/developer/tools/) to push metric data to CloudWatch rather than the bare API. +::: +`PutMetricData` API calls are charges on number of queries. The best practise to use the `PutMetricData` API optimally. Using the Values and Counts method in this API, enables you to publish up to 150 values per metric with one `PutMetricData` request, and supports retrieving percentile statistics on this data. Thus, instead of making separate API calls for each of the datapoint, you should group all your datapoints together and then push to CloudWatch in a single `PutMetricData` API call. This approach benefits the user in two ways: + +1. CloudWatch pricing +1. `PutMetricData` API throttling can be prevented + +:::info + When using `PutMetricData`, the best practice is to batch your data into single `PUT` operations whenever possible. +::: +:::info + If large volumes of metrics are emitted into CloudWatch then consider using [Embedded Metric Format](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Manual.html) as an alternative approach. Note that Embedded Metric Format does not use, nor charge, for the use of `PutMetricData`, though it does incur billing from the use of [CloudWatch Logs](../tools/logs/). +::: +## Anomaly detection + +CloudWatch has an [anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) feature that augments your observability strategy by learning what *normal* is based on recorded metrics. The use of anomaly detection is a [best practice](../signals/metrics/#use-anomaly-detection-algorithms) for any metric signal collection system. + +Anomaly detection builds a model over a two-week period of time. + +:::warning + Anomaly detection only builds its model from the time of creation forward. It does not project backwards in time to find previous outliers. +::: + +:::warning + Anomaly detection does not know what *good* is for a metric, only what *normal* is based on standard deviation. +::: + +:::info + The best practice is to train your anomaly detection models to only analyze the times of day that normal behaviour is expected. You can define time periods to exclude from training (such as nights, weekends, or holidays). +::: +An example of an anomaly detection band can be seen here, with the band in grey. + +![Anomaly detection band](../images/metrics4.png) + +Setting exclusion windows for anomaly detection can be done with the CloudWatch console, [CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-cloudwatch-anomalydetector-configuration.html), or using one of the AWS SDKs. diff --git a/docusaurus/observability-best-practices/docs/tools/observability_accelerator.md b/docusaurus/observability-best-practices/docs/tools/observability_accelerator.md new file mode 100644 index 000000000..72481262e --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/observability_accelerator.md @@ -0,0 +1,10 @@ +# AWS Observability Accelerator + +The AWS Observability Accelerator is a set of opinionated modules to help you set up observability for your AWS environments with AWS Native services and AWS-managed observability services such as Amazon Managed Service for Prometheus,Amazon Managed Grafana, AWS Distro for OpenTelemetry (ADOT) and Amazon CloudWatch. + +We provide curated metrics, logs, traces collection, cloudwatch dashboard, alerting rules and Grafana dashboards for your EKS infrastructure, Java/JMX, NGINX based workloads and your custom applications. + +AWS Observability Accelerator provide shared artifacts (documentation, dashboards, alerting rules) for the [Terraform](https://github.com/aws-observability/terraform-aws-observability-accelerator) and [CDK](https://github.com/aws-observability/cdk-aws-observability-accelerator) projects. + +Checkout the project documentation for [Terraform](https://aws-observability.github.io/terraform-aws-observability-accelerator/) and [CDK](https://aws-observability.github.io/cdk-aws-observability-accelerator/) projects for more information. + diff --git a/docusaurus/observability-best-practices/docs/tools/rum.md b/docusaurus/observability-best-practices/docs/tools/rum.md new file mode 100644 index 000000000..3a22123cc --- /dev/null +++ b/docusaurus/observability-best-practices/docs/tools/rum.md @@ -0,0 +1,111 @@ +# Real User Monitoring + +With CloudWatch RUM, you can perform real user monitoring to collect and view client-side data about your web application performance from actual user sessions in near real time. The data that you can visualize and analyze includes page load times, client-side errors, and user behavior. When you view this data, you can see it all aggregated together, and also see breakdowns by the browsers and devices that your customers use. + +![RUM application monitor dashboard showing device breakdown](../images/rum2.png) + +## Web client + +The CloudWatch RUM web client is developed and built using Node.js version 16 or higher. The code is [publicly available](https://github.com/aws-observability/aws-rum-web) on GitHub. You can use the client with [Angular](https://github.com/aws-observability/aws-rum-web/blob/main/docs/cdn_angular.md) and [React](https://github.com/aws-observability/aws-rum-web/blob/main/docs/cdn_react.md) applications. + +CloudWatch RUM is designed to create no perceptible impact to your application’s load time, performance, and unload time. + +:::note + End user data that you collect for CloudWatch RUM is retained for 30 days and then automatically deleted. If you want to keep the RUM events for a longer time, you can choose to have the app monitor send copies of the events to CloudWatch Logs in your account. +::: +:::tip + If avoiding potential interruption by ad blockers is a concern for your web application then you may wish to host the web client on your own content delivery network, or even inside your own web site. Our [documentation on GitHub](https://github.com/aws-observability/aws-rum-web/blob/main/docs/cdn_installation.md) provides guidance on hosting the web client from your own origin domain. +::: + +## Authorize Your Application + +To use CloudWatch RUM, your application must have authorization through one of three options. + +1. Use authentication from an existing identity provider that you have already set up. +1. Use an existing Amazon Cognito identity pool +1. Let CloudWatch RUM create a new Amazon Cognito identity pool for the application + +:::info + Letting CloudWatch RUM create a new Amazon Cognito identity pool for the application requires the least effort to set up. It's the default option. +::: +:::tip + CloudWatch RUM can configured to separate unauthenticated users from authenticated users. See [this blog post](https://aws.amazon.com/blogs/mt/how-to-isolate-signed-in-users-from-guest-users-within-amazon-cloudwatch-rum/) for details. +::: +## Data Protection & Privacy + +The CloudWatch RUM client can use cookies to help collect end user data. This is useful for the user journey feature, but is not required. See [our detailed documentation for privacy related information](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM-privacy.html).[^1] + +:::tip + While the collection of web application telemetry using RUM is safe and does not expose personally identifiable information (PII) to you through the console or CloudWatch Logs, be mindful that you can collect [custom attribute](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM-custom-metadata.html) through the web client. Be careful not to expose sensitive data using this mechanism. +::: + +## Client Code Snippet + +While the code snippet for the CloudWatch RUM web client will be automatically generated, you can also manually modify the code snippet to configure the client to your requirements. +:::info + Use a cookie consent mechanism to dynamically enable cookie creation in singe page applications. See [this blog post](https://aws.amazon.com/blogs/mt/how-and-when-to-enable-session-cookies-with-amazon-cloudwatch-rum/) for more information. +::: +### Disable URL Collection + +Prevent the collection of resource URLs that might contain personal information. + +:::info + If your application uses URLs that contain personally identifiable information (PII), we strongly recommend that you disable the collection of resource URLs by setting `recordResourceUrl: false` in the code snippet configuration, before inserting it into your application. +::: + +### Enable Active Tracing + +Enable end-to-end tracing by setting `addXRayTraceIdHeader: true` in the web client. This causes the CloudWatch RUM web client to add an X-Ray trace header to HTTP requests. + +If you enable this optional setting, XMLHttpRequest and fetch requests made during user sessions sampled by the app monitor are traced. You can then see traces and segments from these user sessions in the RUM dashboard, the CloudWatch ServiceLens console, and the X-Ray console. + +Click the checkbox to enable active tracing when setting up your application monitor in the AWS Console to have the setting automatically enabled in your code snippet. + +![Active tracing setup for RUM application monitor](../images/rum1.png) + +### Inserting the Snippet + +Insert the code snippet that you copied or downloaded in the previous section inside the `` element of your application. Insert it before the `` element or any other `