Skip to content

Commit

Permalink
Move invalid metrics_name to a dedicated place and re-implement a sma…
Browse files Browse the repository at this point in the history
…ll parser to collect them (#24)

Signed-off-by: Augustin Husson <[email protected]>
  • Loading branch information
Nexucis authored Nov 15, 2024
1 parent 29cf95b commit 401bb1b
Show file tree
Hide file tree
Showing 15 changed files with 5,884 additions and 99 deletions.
43 changes: 41 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,12 @@ Metrics Usage

This tool analyzes static files - like dashboards and Prometheus alert rules - to track where and how Prometheus metrics are used.

It’s especially helpful for identifying whether metrics are actively used. Unused metrics should ideally not be scraped by Prometheus to avoid unnecessary load.
It’s especially helpful for identifying whether metrics are actively used.
Prometheus should ideally not scrape unused metrics to avoid an unnecessary load.

## API exposed

### Metrics

The tool provides an API endpoint, `/api/v1/metrics`, which returns the usage data for each collected metric as shown below:

Expand Down Expand Up @@ -77,7 +82,41 @@ You can used the following query parameter to filter the list returned:
* **metric_name**: when used, it will trigger a fuzzy search on the metric_name based on the pattern provided.
* **used**: when used, will return only the metric used or not (depending if you set this boolean to true or to false). Leave it empty if you want both.

## How to use it
### Invalid Metrics

The API endpoint `/api/v1/invalid_metrics` is exposing the usage for metrics that contains variable or regexp.

```json
{
"node_cpu_utilization_${instance}": {
"usage": {
"alertRules": [
{
"prom_link": "https://prometheus.demo.do.prometheus.io",
"group_name": "ansible managed alert rules",
"name": "NodeCPUUtilizationHigh",
"expression": "instance:node_cpu_utilisation:rate5m * 100 > ignoring (severity) node_cpu_utilization_percent_threshold{severity=\"critical\"}"
}
]
}
},
"node_disk_discard_time_.+": {
"usage": {
"dashboards": [
"https://demo.perses.dev/api/v1/projects/perses/dashboards/nodeexporterfull"
]
}
}
}
```

### Pending Usage

The API endpoint `/api/v1/pending_usages` is exposing usage associated to metrics that has not yet been associated to the metrics available on the endpoint `/api/v1/metrics`.

It's even possible usage is never associated as the metric doesn't exist anymore.

## Different way to deploy it

### Central instance

Expand Down
87 changes: 66 additions & 21 deletions database/database.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,23 +28,29 @@ import (
type Database interface {
GetMetric(name string) *v1.Metric
ListMetrics() map[string]*v1.Metric
ListInvalidMetrics() map[string]*v1.Metric
ListPendingUsage() map[string]*v1.MetricUsage
EnqueueMetricList(metrics []string)
EnqueueInvalidMetricsUsage(usages map[string]*v1.MetricUsage)
EnqueueUsage(usages map[string]*v1.MetricUsage)
EnqueueLabels(labels map[string][]string)
}

func New(cfg config.Database) Database {
d := &db{
metrics: make(map[string]*v1.Metric),
usage: make(map[string]*v1.MetricUsage),
usageQueue: make(chan map[string]*v1.MetricUsage, 250),
labelsQueue: make(chan map[string][]string, 250),
metricsQueue: make(chan []string, 10),
path: cfg.Path,
metrics: make(map[string]*v1.Metric),
invalidMetrics: make(map[string]*v1.Metric),
usage: make(map[string]*v1.MetricUsage),
usageQueue: make(chan map[string]*v1.MetricUsage, 250),
invalidMetricsUsageQueue: make(chan map[string]*v1.MetricUsage, 250),
labelsQueue: make(chan map[string][]string, 250),
metricsQueue: make(chan []string, 10),
path: cfg.Path,
}

go d.watchUsageQueue()
go d.watchMetricsQueue()
go d.watchInvalidMetricsUsageQueue()
go d.watchLabelsQueue()
if !*cfg.InMemory {
if err := d.readMetricsInJSONFile(); err != nil {
Expand All @@ -60,6 +66,8 @@ type db struct {
// metrics is the list of metric name (as a key) associated to their usage based on the different collector activated.
// This struct is our "database".
metrics map[string]*v1.Metric
// invalidMetrics is the list of metric name that likely contains a variable or a regexp and as such cannot be a valid metric name.
invalidMetrics map[string]*v1.Metric
// usage is a buffer in case the metric name has not yet been collected
usage map[string]*v1.MetricUsage
// metricsQueue is the channel that should be used to send and receive the list of metric name to keep in memory.
Expand All @@ -73,6 +81,10 @@ type db struct {
// There will be no other way to write in it.
// Doing that allows us to accept more HTTP requests to write data and to delay the actual writing.
usageQueue chan map[string]*v1.MetricUsage
// invalidMetricsUsageQueue is the way to send the usage per metric that is not valid to write in the database.
// There will be no other way to write in it.
// Doing that allows us to accept more HTTP requests to write data and to delay the actual writing.
invalidMetricsUsageQueue chan map[string]*v1.MetricUsage
// path is the path to the JSON file where metrics is flushed periodically
// It is empty if the database is purely in memory.
path string
Expand All @@ -83,37 +95,54 @@ type db struct {
// 1. Then let's flush the data into a file periodically (or once the queue is empty (if it happens))
// 2. Read the file directly when a read query is coming
// Like that we have two different ways to read and write the data.
mutex sync.Mutex
metricsMutex sync.Mutex
invalidMetricsUsageMutex sync.Mutex
}

func (d *db) GetMetric(name string) *v1.Metric {
d.mutex.Lock()
defer d.mutex.Unlock()
d.metricsMutex.Lock()
defer d.metricsMutex.Unlock()
return d.metrics[name]
}

func (d *db) ListMetrics() map[string]*v1.Metric {
d.mutex.Lock()
defer d.mutex.Unlock()
d.metricsMutex.Lock()
defer d.metricsMutex.Unlock()
return d.metrics
}

func (d *db) ListInvalidMetrics() map[string]*v1.Metric {
d.invalidMetricsUsageMutex.Lock()
defer d.invalidMetricsUsageMutex.Unlock()
return d.invalidMetrics
}

func (d *db) EnqueueMetricList(metrics []string) {
d.metricsQueue <- metrics
}

func (d *db) ListPendingUsage() map[string]*v1.MetricUsage {
d.metricsMutex.Lock()
defer d.metricsMutex.Unlock()
return d.usage
}

func (d *db) EnqueueUsage(usages map[string]*v1.MetricUsage) {
d.usageQueue <- usages
}

func (d *db) EnqueueInvalidMetricsUsage(usages map[string]*v1.MetricUsage) {
d.invalidMetricsUsageQueue <- usages
}

func (d *db) EnqueueLabels(labels map[string][]string) {
d.labelsQueue <- labels
}

func (d *db) watchMetricsQueue() {
for _metrics := range d.metricsQueue {
d.mutex.Lock()
for _, metricName := range _metrics {
for metricsName := range d.metricsQueue {
d.metricsMutex.Lock()
for _, metricName := range metricsName {
if _, ok := d.metrics[metricName]; !ok {
// As this queue only serves the purpose of storing missing metrics, we are only looking for the one not already present in the database.
d.metrics[metricName] = &v1.Metric{}
Expand All @@ -125,13 +154,29 @@ func (d *db) watchMetricsQueue() {
}
}
}
d.mutex.Unlock()
d.metricsMutex.Unlock()
}
}

func (d *db) watchInvalidMetricsUsageQueue() {
for data := range d.invalidMetricsUsageQueue {
d.invalidMetricsUsageMutex.Lock()
for metricName, usage := range data {
if _, ok := d.invalidMetrics[metricName]; !ok {
d.invalidMetrics[metricName] = &v1.Metric{
Usage: usage,
}
} else {
d.invalidMetrics[metricName].Usage = mergeUsage(d.invalidMetrics[metricName].Usage, usage)
}
}
d.invalidMetricsUsageMutex.Unlock()
}
}

func (d *db) watchUsageQueue() {
for data := range d.usageQueue {
d.mutex.Lock()
d.metricsMutex.Lock()
for metricName, usage := range data {
if _, ok := d.metrics[metricName]; !ok {
logrus.Debugf("metric_name %q is used but it's not found by the metric collector", metricName)
Expand All @@ -148,13 +193,13 @@ func (d *db) watchUsageQueue() {
d.metrics[metricName].Usage = mergeUsage(d.metrics[metricName].Usage, usage)
}
}
d.mutex.Unlock()
d.metricsMutex.Unlock()
}
}

func (d *db) watchLabelsQueue() {
for data := range d.labelsQueue {
d.mutex.Lock()
d.metricsMutex.Lock()
for metricName, labels := range data {
if _, ok := d.metrics[metricName]; !ok {
// In this case, we should add the metric, because it means the metrics has been found from another source.
Expand All @@ -165,7 +210,7 @@ func (d *db) watchLabelsQueue() {
d.metrics[metricName].Labels = utils.Merge(d.metrics[metricName].Labels, labels)
}
}
d.mutex.Unlock()
d.metricsMutex.Unlock()
}
}

Expand All @@ -180,8 +225,8 @@ func (d *db) flush(period time.Duration) {
}

func (d *db) writeMetricsInJSONFile() error {
d.mutex.Lock()
defer d.mutex.Unlock()
d.metricsMutex.Lock()
defer d.metricsMutex.Unlock()
data, err := json.Marshal(d.metrics)
if err != nil {
return err
Expand Down
18 changes: 8 additions & 10 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ go 1.23.1

require (
github.com/go-openapi/strfmt v0.23.0
github.com/grafana/grafana-openapi-client-go v0.0.0-20241101140420-bc381928ae6e
github.com/grafana/grafana-openapi-client-go v0.0.0-20241113095943-9cb2bbfeb8a3
github.com/labstack/echo/v4 v4.12.0
github.com/lithammer/fuzzysearch v1.1.8
github.com/perses/common v0.26.0
github.com/perses/perses v0.49.0
github.com/prometheus/client_golang v1.20.5
github.com/prometheus/common v0.60.1
github.com/prometheus/prometheus v0.55.1
github.com/prometheus/prometheus v0.300.0
github.com/sirupsen/logrus v1.9.3
github.com/stretchr/testify v1.9.0
golang.org/x/oauth2 v0.24.0
Expand Down Expand Up @@ -45,8 +45,6 @@ require (
github.com/go-git/go-billy/v5 v5.5.0 // indirect
github.com/go-git/go-git/v5 v5.12.0 // indirect
github.com/go-jose/go-jose/v4 v4.0.4 // indirect
github.com/go-kit/log v0.2.1 // indirect
github.com/go-logfmt/logfmt v0.6.0 // indirect
github.com/go-logr/logr v1.4.2 // indirect
github.com/go-logr/stdr v1.2.2 // indirect
github.com/go-openapi/analysis v0.23.0 // indirect
Expand Down Expand Up @@ -75,7 +73,7 @@ require (
github.com/jpillora/backoff v1.0.0 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/kevinburke/ssh_config v1.2.0 // indirect
github.com/klauspost/compress v1.17.9 // indirect
github.com/klauspost/compress v1.17.10 // indirect
github.com/labstack/gommon v0.4.2 // indirect
github.com/lucasb-eyer/go-colorful v1.2.0 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
Expand Down Expand Up @@ -111,13 +109,13 @@ require (
github.com/zitadel/schema v1.3.0 // indirect
gitlab.com/digitalxero/go-conventional-commit v1.0.7 // indirect
go.mongodb.org/mongo-driver v1.14.0 // indirect
go.opentelemetry.io/otel v1.29.0 // indirect
go.opentelemetry.io/otel/metric v1.29.0 // indirect
go.opentelemetry.io/otel/sdk v1.29.0 // indirect
go.opentelemetry.io/otel/trace v1.29.0 // indirect
go.opentelemetry.io/otel v1.31.0 // indirect
go.opentelemetry.io/otel/metric v1.31.0 // indirect
go.opentelemetry.io/otel/sdk v1.30.0 // indirect
go.opentelemetry.io/otel/trace v1.31.0 // indirect
go.uber.org/atomic v1.11.0 // indirect
golang.org/x/crypto v0.28.0 // indirect
golang.org/x/net v0.29.0 // indirect
golang.org/x/net v0.30.0 // indirect
golang.org/x/sync v0.8.0 // indirect
golang.org/x/sys v0.26.0 // indirect
golang.org/x/text v0.19.0 // indirect
Expand Down
Loading

0 comments on commit 401bb1b

Please sign in to comment.