Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default metrics filters for new metrics agent #834

Merged
Binary file modified .DS_Store
Binary file not shown.
4 changes: 2 additions & 2 deletions apis/telemetry/v1alpha1/telemetry_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -130,8 +130,8 @@ type McacTelemetrySpec struct {
}

type CassandraAgentSpec struct {
Endpoint Endpoint `json:"endpoint,omitempty"`
Filters []promapi.RelabelConfig `json:"filters,omitempty"`
Endpoint *Endpoint `json:"endpoint,omitempty"`
Relabels []promapi.RelabelConfig `json:"relabels,omitempty"`
}

type Endpoint struct {
Expand Down
10 changes: 7 additions & 3 deletions apis/telemetry/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 6 additions & 6 deletions config/crd/bases/k8ssandra.io_k8ssandraclusters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13740,7 +13740,7 @@ spec:
port:
type: string
type: object
filters:
relabels:
items:
description: 'RelabelConfig allows dynamic
rewriting of the label set, being
Expand Down Expand Up @@ -14259,7 +14259,7 @@ spec:
port:
type: string
type: object
filters:
relabels:
items:
description: 'RelabelConfig allows dynamic
rewriting of the label set, being applied
Expand Down Expand Up @@ -16841,7 +16841,7 @@ spec:
port:
type: string
type: object
filters:
relabels:
items:
description: 'RelabelConfig allows dynamic rewriting
of the label set, being applied to samples before
Expand Down Expand Up @@ -25518,7 +25518,7 @@ spec:
port:
type: string
type: object
filters:
relabels:
items:
description: 'RelabelConfig allows dynamic rewriting
of the label set, being applied to samples before
Expand Down Expand Up @@ -28195,7 +28195,7 @@ spec:
port:
type: string
type: object
filters:
relabels:
items:
description: 'RelabelConfig allows dynamic rewriting
of the label set, being applied to samples before
Expand Down Expand Up @@ -29864,7 +29864,7 @@ spec:
port:
type: string
type: object
filters:
relabels:
items:
description: 'RelabelConfig allows dynamic rewriting
of the label set, being applied to samples before
Expand Down
2 changes: 1 addition & 1 deletion config/crd/bases/reaper.k8ssandra.io_reapers.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2087,7 +2087,7 @@ spec:
port:
type: string
type: object
filters:
relabels:
items:
description: 'RelabelConfig allows dynamic rewriting of
the label set, being applied to samples before ingestion.
Expand Down
4 changes: 2 additions & 2 deletions config/crd/bases/stargate.k8ssandra.io_stargates.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2775,7 +2775,7 @@ spec:
port:
type: string
type: object
filters:
relabels:
items:
description: 'RelabelConfig allows dynamic rewriting
of the label set, being applied to samples before
Expand Down Expand Up @@ -3242,7 +3242,7 @@ spec:
port:
type: string
type: object
filters:
relabels:
items:
description: 'RelabelConfig allows dynamic rewriting of
the label set, being applied to samples before ingestion.
Expand Down
2 changes: 1 addition & 1 deletion docs/content/en/components/metrics-collector/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ Ingress or port forwarding can be used to expose access to the Prometheus and Gr

2. How can I filter out metrics I don't care about?

Please read the [metric-collector.yaml](https://github.com/datastax/metric-collector-for-apache-cassandra/blob/master/config/metric-collector.yaml) section in the MCAC GitHub repo on how to add filtering rules.
Please read the [metrics-collector.yaml](https://github.com/datastax/metric-collector-for-apache-cassandra/blob/master/config/metrics-collector.yaml) section in the MCAC GitHub repo on how to add filtering rules.

3. What is the datalog? And what is it for?

Expand Down
1 change: 0 additions & 1 deletion docs/content/en/tasks/monitor/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,3 @@ title: "Monitor K8ssandra"
linkTitle: "Monitor"
weight: 6
description: "Access tools to monitor your Apache Cassandra® cluster running in Kubernetes."
---
89 changes: 86 additions & 3 deletions pkg/telemetry/cassandra_agent/cassandra_agent_config.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@ import (
"path/filepath"
"time"

"github.com/adutra/goalesce"

cassdcapi "github.com/k8ssandra/cass-operator/apis/cassandra/v1beta1"
k8ssandraapi "github.com/k8ssandra/k8ssandra-operator/apis/k8ssandra/v1alpha1"
telemetryapi "github.com/k8ssandra/k8ssandra-operator/apis/telemetry/v1alpha1"
"github.com/k8ssandra/k8ssandra-operator/pkg/labels"
"github.com/k8ssandra/k8ssandra-operator/pkg/reconciliation"
"github.com/k8ssandra/k8ssandra-operator/pkg/result"
promapi "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
Expand All @@ -21,10 +24,89 @@ import (
var (
agentConfigLocation = "/opt/management-api/configs/metrics-collector.yaml"
defaultAgentConfig = telemetryapi.CassandraAgentSpec{
Endpoint: telemetryapi.Endpoint{
Endpoint: &telemetryapi.Endpoint{
Port: "9000",
Address: "127.0.0.1",
},
Relabels: []promapi.RelabelConfig{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adejanovski can you please take a look at this to confirm that I've captured the intention behind the original filters?

Some metric names present quite differently (e.g. org.apache.cassandra.metrics.table.read doesn't appear directly anymore, now it seems to just be read_latency, similar with org.apache.cassandra.metrics.Table.LiveSSTableCount which now appears to be ... live_ss_table_count).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior looks different than what was done by the original filters.
Instead of dropping by default all org_apache_cassandra_metrics_table.* metrics, you're dropping everything that has a table label, and then allow specific metrics by changing the should_drop label value.

Based on my testing, none of the metrics with a table label survives these filters, despite the following rules.

With the default filters on, I can only see 3 lines when looking for for org_apache_cassandra_metrics_table_live_ss_table_count:

# HELP org_apache_cassandra_metrics_table_live_ss_table_count_all 
# TYPE org_apache_cassandra_metrics_table_live_ss_table_count_all gauge
org_apache_cassandra_metrics_table_live_ss_table_count_all{host="67140d82-7ae5-45a0-b3c8-cac99530fde2",instance="172.24.0.4",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-1",node_name="k8ssandra-0-worker2",} 12.0

But without the filters, I see 45 such lines. Here's a sample:

# HELP org_apache_cassandra_metrics_table_live_ss_table_count 
# TYPE org_apache_cassandra_metrics_table_live_ss_table_count gauge
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="built_views",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="view_builds_in_progress",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_traces",table="sessions",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_schema",table="aggregates",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="available_ranges",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="size_estimates",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_schema",table="dropped_columns",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_auth",table="role_members",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="transferred_ranges_v2",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="peers",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_schema",table="tables",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_auth",table="roles",} 3.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="peer_events_v2",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="peer_events",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="local",} 1.0

The above metrics should survive the filters, given the name of the metrics, but they don't 🤔
So, too many metrics are being dropped.

Switching the first relabeling to:

- regex: org_apache_cassandra_metrics_table.*
  replacement: "true"
  sourceLabels:
  - __name__
  targetLabel: should_drop

generates even weirder behaviors, since I now lose all the live_ss_table_count metrics, but end up with metrics that have both should_drop=true and should_drop=false 😆

org_apache_cassandra_metrics_table_all_memtables_live_data_size_all{host="af660e57-38c4-4863-af8a-31c08a4314dc",instance="172.24.0.4",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker2",should_drop="true",should_drop="false",} 125755.0

@burmanm, there seems to be something wrong with the relabelings here.
Could you tell us what you think?

{
SourceLabels: []string{"table"},
Regex: ".+",
TargetLabel: "should_drop",
Replacement: "true",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_live_ss_table_count",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_live_disk_space_used",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_memtable.*",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_all_memtables.*",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_compaction_bytes_written",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_pending_compactions",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_read_.*",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_write_.*",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_range.*",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_coordinator_.*",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"__name__"},
Regex: "org_apache_cassandra_metrics_table_dropped_mutations",
TargetLabel: "should_drop",
Replacement: "false",
},
{
SourceLabels: []string{"should_drop"},
Regex: "true",
Action: "drop",
},
},
}
)

Expand All @@ -42,7 +124,8 @@ func (c Configurator) GetTelemetryAgentConfigMap() (*corev1.ConfigMap, error) {
var yamlData []byte
var err error
if c.TelemetrySpec.Cassandra != nil {
yamlData, err = yaml.Marshal(&c.TelemetrySpec.Cassandra)
mergedSpec := goalesce.MustDeepMerge(&defaultAgentConfig, c.TelemetrySpec.Cassandra)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adejanovski and I have a slight problem with this. If the fields are merged, what's the ordering here? Can user override everything that's set in the defaults? How are they merged, before, after, what about duplicates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that my expectation here is that any custom filter from users would override all the default filters. They shouldn't get merged with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's exactly how this works. MustDeepMerge uses an atomic merge for slices by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I wasn't sure about this because Goalesce is very powerful and has a lot of different merge techniques.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in this case I'd say the "Merge" is a bit of a confusing (sorry, insanely confusing) name for a method (and perhaps this is why this would require a comment to say we don't ever want to merge these despite calling a function to merge). It's not merging anything. It either returns a copy of defaultAgentConfig (if c.TelemetrySpec.Cassandra is nil) or a copy of c.TelemetrySpec.Cassandra (if it isn't nil), it never merges values from those two.

Now finding out the intended behavior requires to go through 4-5 jumps in files while reading which method to jump to next in the goalesce. And even then next one has no idea what's the intention - was it a bug that it didn't merge when one user asks "why isn't it merging". The fact that the behavior depends of the type (map or slice have different behaviors, so if for whatever reason someone changes the slices to be maps for some preprocessing requirements then the entire behavior will change) makes it even more difficult to remember these.

yamlData, err = yaml.Marshal(&mergedSpec)
if err != nil {
return &corev1.ConfigMap{}, err
}
Expand Down Expand Up @@ -79,7 +162,7 @@ func (c Configurator) ReconcileTelemetryAgentConfig(dc *cassdcapi.CassandraDatac
recRes := reconciliation.ReconcileObject(c.Ctx, c.RemoteClient, c.RequeueDelay, *desiredCm)
switch {
case recRes.IsError():
fallthrough
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@burmanm why did you change this to fallthrough? This will give you a Done result AFAIK, which is wrong if the ConfigMap reconciliation has failed.

return recRes
case recRes.IsRequeue():
return recRes
}
Expand Down
Loading