Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog Log Management #5

Open
2 tasks
CumpsD opened this issue Nov 10, 2020 · 0 comments
Open
2 tasks

Datadog Log Management #5

CumpsD opened this issue Nov 10, 2020 · 0 comments
Assignees
Labels
monitoring Everything related to monitoring. research Topic being actively researched.

Comments

@CumpsD
Copy link
Contributor

CumpsD commented Nov 10, 2020

Logging

Context

We use Datadog for our monitoring requirements. One of these is log management. Our philosophy during development has been to get as much information as possible to Datadog and deal with it there. This means we forward our CloudWatch Fargate logs to Datadog, as well as our Lambda logs and all available AWS services logs (API Gateway, S3, ...).

Since the logging UI of Datadog has a really good search, it has served us very well during development to troubleshoot issues. Over time, we started using more of Datadog features to keep the logging under control. Most of this configuration is concentrated around Pipelines and Indexes.

Pipelines are used to preprocess incoming logs and reshape them or remap them. For example to map WARN or ERROR codes of known functional errors (which are not technical errors) to an OK state. As well as mapping fields to standard field names so the Datadog UI gets more enriched. This is pretty cheap at $0.10/GB ingested logs.

Indexes on the other hand are what is made available on the search, it determines which logs are kept for 15 days. This is the more expensive part at $1.70/million log events.

To keep indexes under control from storing all available logs, we use exclusion filters to get rid of logs that are no interest to us. This is an ongoing task to examine the search UI and determine if there is more to be excluded.

We follow this approach because logging by definition is unpredictable. It is of no use to exclude everything and then include what interests us because we don't always know. Doing this would cause us to miss log events which are interesting but we didn't think about.

Problem

Over the last month usage of our product has taken off and caused the number of requests to rise rapidly. This has caused an influx of log events due to the fact we logged all incoming requests. Some of our clients have made 12 million requests in just a few days, causing 90 million log events to occur and making our Datadog bill rise sharply.

As an emergency measure to save costs, indexing has been disabled. The next steps are to determine more exclusion rules which filter out a lot more log events from the index to keep costs under control.

Additionally we will use Log Archives on S3 to store logging in case of troubleshooting.

Progress

  • Define Exclusion Rules
  • Setup Log Archives
@CumpsD CumpsD added research Topic being actively researched. monitoring Everything related to monitoring. labels Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
monitoring Everything related to monitoring. research Topic being actively researched.
Projects
None yet
Development

No branches or pull requests

2 participants