Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default metrics filters for new metrics agent #834

Merged

Conversation

Miles-Garnsey
Copy link
Member

What this PR does:

This PR hopefully replicates the previous metrics filters that applied to MCAC over to the new metrics agent's config file which we've built and mounted via this PR.

I haven't been able to track down docs on how the metrics relabelling/filtering works for the new agent, so I'm flying slightly blind here (if anyone knows where docs are please feel free to chime in).

Having said that, if it works similarly to Prometheus then I think this will be pretty close to what we need.

Which issue(s) this PR fixes:
Fixes #816

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2023

Codecov Report

Merging #834 (45f435c) into main (98f7cc8) will decrease coverage by 0.23%.
The diff coverage is 11.11%.

❗ Current head 45f435c differs from pull request most recent head 00b97dc. Consider uploading reports for the commit 00b97dc to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #834      +/-   ##
==========================================
- Coverage   57.16%   56.94%   -0.23%     
==========================================
  Files          95       95              
  Lines        9257     9242      -15     
==========================================
- Hits         5292     5263      -29     
- Misses       3507     3523      +16     
+ Partials      458      456       -2     
Impacted Files Coverage Δ
...elemetry/cassandra_agent/cassandra_agent_config.go 21.87% <11.11%> (-40.16%) ⬇️
controllers/k8ssandra/stargate.go 63.77% <0.00%> (-2.37%) ⬇️
controllers/replication/secret_controller.go 65.43% <0.00%> (+1.00%) ⬆️
controllers/k8ssandra/schemas.go 75.92% <0.00%> (+1.38%) ⬆️
controllers/k8ssandra/reaper.go 62.42% <0.00%> (+1.81%) ⬆️

@Miles-Garnsey Miles-Garnsey force-pushed the feature/default-metrics-filters branch 3 times, most recently from 0aa528a to a3c1f0b Compare February 1, 2023 02:29
@Miles-Garnsey Miles-Garnsey force-pushed the feature/default-metrics-filters branch 2 times, most recently from 0f1b70d to dd8226a Compare February 14, 2023 00:29
@Miles-Garnsey Miles-Garnsey marked this pull request as ready for review February 14, 2023 00:32
@Miles-Garnsey Miles-Garnsey requested a review from a team as a code owner February 14, 2023 00:32
@Miles-Garnsey Miles-Garnsey force-pushed the feature/default-metrics-filters branch from 632c746 to a90c121 Compare February 14, 2023 00:33
@Miles-Garnsey
Copy link
Member Author

Miles-Garnsey commented Feb 15, 2023

I'm a bit concerned that we introduce a breaking change here. Apparently in the course of delivering the metrics agent work, the config field filters was changed to relabels.

As a result, I have updated the CR's API so that filters is now relabels but I think that API was in the prior release. We should check this and discuss prior to approving this PR.

I'm also shifting the Endpoint field in the spec over to use a pointer, since it is optional from the POV of the manifest. I'm less concerned about that change being breaking, but just taking note as it is a change to a public API.

},
{
SourceLabels: []string{"__origname__"},
Regex: "org\\.apache\\.cassandra\\.metrics\\.table.*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric name does not exists in Cassandra.

},
{
SourceLabels: []string{"__origname__"},
Regex: "org\\.apache\\.cassandra\\.metrics\\.table\\.live_ss_table_count",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric name does not exists in Cassandra.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am building to the original MCAC filters where that metric name is filtered. Perhaps this metrics differs in name between Cassandra versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

},
{
SourceLabels: []string{"__origname__"},
Regex: "org\\.apache\\.cassandra\\.metrics\\.table\\.live_disk_space_used",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric name does not exists in Cassandra.

},
{
SourceLabels: []string{"__origname__"},
Regex: "org\\.apache\\.cassandra\\.metrics\\.table\\.read",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric name does not exists in Cassandra.

},
{
SourceLabels: []string{"__origname__"},
Regex: "org\\.apache\\.cassandra\\.metrics\\.table\\.write",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric name does not exists in Cassandra.

},
{
SourceLabels: []string{"__origname__"},
Regex: "org\\.apache\\.cassandra\\.metrics\\.table\\.range",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric name does not exists in Cassandra.

},
{
SourceLabels: []string{"__origname__"},
Regex: "org\\.apache\\.cassandra\\.metrics\\.table\\.coordinator",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric name does not exists in Cassandra.

},
{
SourceLabels: []string{"__origname__"},
Regex: "org\\.apache\\.cassandra\\.metrics\\.table\\.dropped_mutations",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric name does not exists in Cassandra.

cm, err := Cfg.GetTelemetryAgentConfigMap()
println(cm.Data)
assert.NoError(t, err)
assert.Equal(t, expectedCm.Data["metric-collector.yaml"], cm.Data["metric-collector.yaml"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct filename is /configs/metrics-collector.yaml, notice the s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't what is specified here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's MCAC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... I'm glad I mentioned this since I need to change some documentation back as I think it refers to MCAC, not the new collector.

Thanks for clearing this up for me anyway, didn't realise this had also changed from the original PR.

@@ -43,7 +131,8 @@ func (c Configurator) GetTelemetryAgentConfigMap() (*corev1.ConfigMap, error) {
var yamlData []byte
var err error
if c.TelemetrySpec.Cassandra != nil {
yamlData, err = yaml.Marshal(&c.TelemetrySpec.Cassandra)
mergedSpec := goalesce.MustDeepMerge(&defaultAgentConfig, c.TelemetrySpec.Cassandra)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adejanovski and I have a slight problem with this. If the fields are merged, what's the ordering here? Can user override everything that's set in the defaults? How are they merged, before, after, what about duplicates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that my expectation here is that any custom filter from users would override all the default filters. They shouldn't get merged with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's exactly how this works. MustDeepMerge uses an atomic merge for slices by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I wasn't sure about this because Goalesce is very powerful and has a lot of different merge techniques.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in this case I'd say the "Merge" is a bit of a confusing (sorry, insanely confusing) name for a method (and perhaps this is why this would require a comment to say we don't ever want to merge these despite calling a function to merge). It's not merging anything. It either returns a copy of defaultAgentConfig (if c.TelemetrySpec.Cassandra is nil) or a copy of c.TelemetrySpec.Cassandra (if it isn't nil), it never merges values from those two.

Now finding out the intended behavior requires to go through 4-5 jumps in files while reading which method to jump to next in the goalesce. And even then next one has no idea what's the intention - was it a bug that it didn't merge when one user asks "why isn't it merging". The fact that the behavior depends of the type (map or slice have different behaviors, so if for whatever reason someone changes the slices to be maps for some preprocessing requirements then the entire behavior will change) makes it even more difficult to remember these.

Port: "9000",
Address: "127.0.0.1",
},
Relabels: []promapi.RelabelConfig{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adejanovski can you please take a look at this to confirm that I've captured the intention behind the original filters?

Some metric names present quite differently (e.g. org.apache.cassandra.metrics.table.read doesn't appear directly anymore, now it seems to just be read_latency, similar with org.apache.cassandra.metrics.Table.LiveSSTableCount which now appears to be ... live_ss_table_count).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior looks different than what was done by the original filters.
Instead of dropping by default all org_apache_cassandra_metrics_table.* metrics, you're dropping everything that has a table label, and then allow specific metrics by changing the should_drop label value.

Based on my testing, none of the metrics with a table label survives these filters, despite the following rules.

With the default filters on, I can only see 3 lines when looking for for org_apache_cassandra_metrics_table_live_ss_table_count:

# HELP org_apache_cassandra_metrics_table_live_ss_table_count_all 
# TYPE org_apache_cassandra_metrics_table_live_ss_table_count_all gauge
org_apache_cassandra_metrics_table_live_ss_table_count_all{host="67140d82-7ae5-45a0-b3c8-cac99530fde2",instance="172.24.0.4",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-1",node_name="k8ssandra-0-worker2",} 12.0

But without the filters, I see 45 such lines. Here's a sample:

# HELP org_apache_cassandra_metrics_table_live_ss_table_count 
# TYPE org_apache_cassandra_metrics_table_live_ss_table_count gauge
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="built_views",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="view_builds_in_progress",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_traces",table="sessions",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_schema",table="aggregates",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="available_ranges",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="size_estimates",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_schema",table="dropped_columns",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_auth",table="role_members",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="transferred_ranges_v2",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="peers",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_schema",table="tables",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_auth",table="roles",} 3.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="peer_events_v2",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="peer_events",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="local",} 1.0

The above metrics should survive the filters, given the name of the metrics, but they don't 🤔
So, too many metrics are being dropped.

Switching the first relabeling to:

- regex: org_apache_cassandra_metrics_table.*
  replacement: "true"
  sourceLabels:
  - __name__
  targetLabel: should_drop

generates even weirder behaviors, since I now lose all the live_ss_table_count metrics, but end up with metrics that have both should_drop=true and should_drop=false 😆

org_apache_cassandra_metrics_table_all_memtables_live_data_size_all{host="af660e57-38c4-4863-af8a-31c08a4314dc",instance="172.24.0.4",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker2",should_drop="true",should_drop="false",} 125755.0

@burmanm, there seems to be something wrong with the relabelings here.
Could you tell us what you think?

@Miles-Garnsey
Copy link
Member Author

I've done some manual testing and this is ready for merge, subject to my interpretation of the original metrics filters (which I've commented on above) being right.

@Miles-Garnsey Miles-Garnsey force-pushed the feature/default-metrics-filters branch from 22fbc85 to ff4650c Compare February 20, 2023 01:04
---
Copy link
Member Author

@Miles-Garnsey Miles-Garnsey Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these docs were removed in a previous PR, I'm not sure if that was a good idea since some folks still need to use MCAC given that older patch versions aren't supported by the new metrics agent.

Just getting thoughts here, not insisting it be left in.

@@ -79,7 +162,7 @@ func (c Configurator) ReconcileTelemetryAgentConfig(dc *cassdcapi.CassandraDatac
recRes := reconciliation.ReconcileObject(c.Ctx, c.RemoteClient, c.RequeueDelay, *desiredCm)
switch {
case recRes.IsError():
fallthrough
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@burmanm why did you change this to fallthrough? This will give you a Done result AFAIK, which is wrong if the ConfigMap reconciliation has failed.

@Miles-Garnsey
Copy link
Member Author

I have a few questions in my comments RE docs, and some changes that I didn't previously catch to the configmap reconciliation.

Beyond that, I believe that this now works.

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This content was intentionally moved to make this page a ToC, and spread the monitoring tasks over multiple pages: https://docs-staging.k8ssandra.io/tasks/monitor/

This doc is here now and has been updated to reflect the latest changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'll delete it again, thanks!

Port: "9000",
Address: "127.0.0.1",
},
Relabels: []promapi.RelabelConfig{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior looks different than what was done by the original filters.
Instead of dropping by default all org_apache_cassandra_metrics_table.* metrics, you're dropping everything that has a table label, and then allow specific metrics by changing the should_drop label value.

Based on my testing, none of the metrics with a table label survives these filters, despite the following rules.

With the default filters on, I can only see 3 lines when looking for for org_apache_cassandra_metrics_table_live_ss_table_count:

# HELP org_apache_cassandra_metrics_table_live_ss_table_count_all 
# TYPE org_apache_cassandra_metrics_table_live_ss_table_count_all gauge
org_apache_cassandra_metrics_table_live_ss_table_count_all{host="67140d82-7ae5-45a0-b3c8-cac99530fde2",instance="172.24.0.4",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-1",node_name="k8ssandra-0-worker2",} 12.0

But without the filters, I see 45 such lines. Here's a sample:

# HELP org_apache_cassandra_metrics_table_live_ss_table_count 
# TYPE org_apache_cassandra_metrics_table_live_ss_table_count gauge
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="built_views",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="view_builds_in_progress",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_traces",table="sessions",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_schema",table="aggregates",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="available_ranges",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="size_estimates",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_schema",table="dropped_columns",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_auth",table="role_members",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="transferred_ranges_v2",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="peers",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_schema",table="tables",} 2.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system_auth",table="roles",} 3.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="peer_events_v2",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="peer_events",} 0.0
org_apache_cassandra_metrics_table_live_ss_table_count{host="1f8db44a-2fb4-4a65-8f60-82fdb75e3b07",instance="172.24.0.5",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker3",keyspace="system",table="local",} 1.0

The above metrics should survive the filters, given the name of the metrics, but they don't 🤔
So, too many metrics are being dropped.

Switching the first relabeling to:

- regex: org_apache_cassandra_metrics_table.*
  replacement: "true"
  sourceLabels:
  - __name__
  targetLabel: should_drop

generates even weirder behaviors, since I now lose all the live_ss_table_count metrics, but end up with metrics that have both should_drop=true and should_drop=false 😆

org_apache_cassandra_metrics_table_all_memtables_live_data_size_all{host="af660e57-38c4-4863-af8a-31c08a4314dc",instance="172.24.0.4",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker2",should_drop="true",should_drop="false",} 125755.0

@burmanm, there seems to be something wrong with the relabelings here.
Could you tell us what you think?

@Miles-Garnsey
Copy link
Member Author

Instead of dropping by default all org_apache_cassandra_metrics_table.* metrics, you're dropping everything that has a table label, and then allow specific metrics by changing the should_drop label value.

The former part of the solution was @burmanm's suggestion, the latter part is required due to the differences in the way allow/deny rules work in MCAC (additive) vs the new metrics agent (subtractive).

w.r.t. some metrics not appearing, that's exactly what I was hoping you could confirm @adejanovski. I do see table related metrics appearing, however, having just re-tested I've realised that you're right - the ones I can see don't have the table label set. They appear to be keyspace related metrics which are just named similarly to table related metrics (e.g. org_apache_cassandra_metrics_table_read_latency_all_bucket).

So it appears that something is indeed wrong.

@Miles-Garnsey Miles-Garnsey force-pushed the feature/default-metrics-filters branch from 9f59727 to 00b97dc Compare February 22, 2023 04:43
@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 4 Code Smells

No Coverage information No Coverage information
6.4% 6.4% Duplication

@Miles-Garnsey
Copy link
Member Author

Coming back to this now that this PR has been merged.

The previous iteration of management api's new metrics agent had some issues where a capitalised metric name sometimes appeared, which meant that our default rules in this PR did not recognise the metric name via regex and failed to mark it with should_drop=false. This led to metrics being expectedly dropped.

I have manually tested and found that:

  1. The metrics we do not want dropped (e.g. org_apache_cassandra_metrics_table_live_ss_table_count for table="views") are not dropped: org_apache_cassandra_metrics_table_live_ss_table_count{host="9fb6377b-62ba-4c7d-a869-ff0c926036ed",instance="10.244.4.4",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker4",keyspace="system_schema",table="views",should_drop="false",} 1.0.
  2. Metrics we do want dropped (e.g. CompressionRatio at table level) do not appear in the metrics (see below grep to confirm):
$ curl localhost:9000/metrics | grep org_apache_cassandra_metrics_table_compression
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP org_apache_cassandra_metrics_table_compression_ratio
# TYPE org_apache_cassandra_metrics_table_compression_ratio gauge
org_apache_cassandra_metrics_table_compression_ratio{host="9fb6377b-62ba-4c7d-a869-ff0c926036ed",instance="10.244.4.4",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker4",} 0.24513263925651432
# HELP org_apache_cassandra_metrics_table_compression_metadata_off_heap_memory_used
# TYPE org_apache_cassandra_metrics_table_compression_metadata_off_heap_memory_used gauge
org_apache_cassandra_metrics_table_compression_metadata_off_heap_memory_used{host="9fb6377b-62ba-4c7d-a869-ff0c926036ed",instance="10.244.4.4",cluster="test",datacenter="dc1",rack="default",pod_name="test-dc1-default-sts-0",node_name="k8ssandra-0-worker4",} 112.0
100 6998k  100 6998k    0     0  29.4M      0 --:--:-- --:--:-- --:--:-- 29.5M

These tests are run with serverImage: docker.io/datastax/dse-mgmtapi-6_8:6.8.33-ubi7-v0.1.59.

I believe that makes this PR ready for merge.

@Miles-Garnsey Miles-Garnsey merged commit a7cbbbc into k8ssandra:main Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port MCAC default filters to the new metrics endpoint
4 participants