Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New telemetry agent config #813

Merged
merged 35 commits into from
Feb 7, 2023

Conversation

Miles-Garnsey
Copy link
Member

@Miles-Garnsey Miles-Garnsey commented Jan 17, 2023

What this PR does:

Adds configuration functionality for the new telemetry agent.

Which issue(s) this PR fixes:
Fixes #815

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@Miles-Garnsey Miles-Garnsey force-pushed the feature/metrics-config branch from 43425c8 to 77a122a Compare January 18, 2023 00:56
@Miles-Garnsey
Copy link
Member Author

Miles-Garnsey commented Jan 20, 2023

I had a few questions on this one:

  1. It looks like the new metrics config doesn’t conform to the Prometheus RelabelConfig API as expected. It uses source_labels instead of sourceLabels. @burmanm has confirmed that we need to update this so it is camelcased. PR here.
  2. I wrote some code to unwind the PodTemplateSpec configuration when Spec.Cassandra.Telemetry.Cassandra is nil, but I think that (as the new agent is always enabled) this code has no purpose. Is there any case in which we don't run the new agent or might want to remove it's ConfigMaps?
  3. The types in this part of the codebase have become problematic, and so is the original ticket. At some point, folks have added MCAC and Vector fields into the TelemetrySpec. This does not make sense because Mcac only applies when the TelemetrySpec is under the Spec.Cassandra field (but TelemetrySpec is used for Reaper and Stargate too). I'm not sure about Vector, but it probably needs more thought too.

I propose that:

  1. I fix the field name on source_labels (changing to sourceLabels).
  2. I remove the RemoveStsVolumes function, and always add the configmap in reconciliation.
  3. We split the types out so that Mcac and Cassandra fields (and Vector?) become embedded structs under a new types CassandraTelemetrySpec.

@adejanovski
Copy link
Contributor

I fix the field name on source_labels (changing to sourceLabels).

Sounds good

I remove the RemoveStsVolumes function, and always add the configmap in reconciliation.

Sure, I don't see a reason to remove the volume.

We split the types out so that Mcac and Cassandra fields (and Vector?) become embedded structs under a new types CassandraTelemetrySpec.

Vector applies to Reaper and Stargate as well.
I'm not keen on breaking stuff, knowing that we're probably not going to write a converting webhook.
So the Cassandra struct can move if you feel like it makes more sense, but the MCAC one should stick around, knowing that we'll fully remove it in the near future (We're not gonna keep MCAC for very long, just the time to transition).

@Miles-Garnsey
Copy link
Member Author

Miles-Garnsey commented Jan 20, 2023

I'm not keen on breaking stuff, knowing that we're probably not going to write a converting webhook.

This doesn't break anything, the yaml stays the same for Spec.Cassandra.Telemetry. The only breakage will occur if users have been adding the Mcac field to their telemetry specs for Reaper/Stargate, in which case that configuration is actually unsupported/broken anyway.

@adejanovski
Copy link
Contributor

I'm not keen on breaking stuff, knowing that we're probably not going to write a converting webhook.

This doesn't break anything, the yaml stays the same. The only breakage will occur if users have been adding the Mcac field to their telemetry specs for Reaper/Stargate, in which case that configuration is actually unsupported/broken anyway.

ok then, if there's no breakage then you can go nuts.

return nil
}

func (c Configurator) AddStsVolumes(dc *cassdcapi.CassandraDatacenter) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you reuse AddOrUpdateVolume() which is in pkg/cassandra/datacenter.go ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a close look at this and wasn't quite sure that it was appropriate. I'm open minded here, but it seems that the function you're referring to works on the datacenterConfig type, whereas we currently work directly on the cassDC and hook into the reconciler at that level.

If you want me to investigate a change of this nature in more detail, I'm happy to do so, but it is a fairly deep re-write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage is that using that function does overwrite any existing Volume which might already exist. But also, we don't usually provide guarantees in our controllers that a manually modified object will be self-healed back to the desired state. This is in line with in-built k8s controllers like the STS controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that the function you're referring to works on the datacenterConfig type

Right, and it actually makes no sense, we could have it work on a PodTemplateSpec.
I think we have an in flight PR that will actually change the function to get it to work on the PodTemplateSpec, which will allow to use it for Deployments as well.
Not a blocker for me, let's refactor it in the future.

@Miles-Garnsey
Copy link
Member Author

I don't quite know what's happening here, but for some reason using sourceLabels doesn't work. The Prometheus APIs appear to be serialising that field into sourcelabels instead. That's despite the fact that the json tag is definitely sourceLabels for that field (see here).

@burmanm I might need to update your metrics agent again I'm sorry :(

@Miles-Garnsey Miles-Garnsey force-pushed the feature/metrics-config branch 3 times, most recently from e910a8a to ecf5348 Compare January 27, 2023 07:08
@Miles-Garnsey
Copy link
Member Author

Can I draw any reviewer's attention to this block, which is the default location for the configuration file, and the default configuration itself.

Is this the way we want it set up?

var (
	agentConfigLocation = "/config/metric-collector.yaml"
	defaultAgentConfig  = telemetryapi.CassandraAgentSpec{
		Endpoint: telemetryapi.Endpoint{
			Port:    "9001",
			Address: "127.0.0.1",
		},
		Filters: []promapi.RelabelConfig{
			{
				SourceLabels: []string{"__tag1__", "__tag2__"},
				Separator:    ";",
				Regex:        "(.*);(b.*)",
				Action:       "drop",
			},
			{
				SourceLabels: []string{"__tag1__", "__tag2__"},
				Separator:    ",",
				Regex:        "^(a|b|c),.*",
				Action:       "drop",
			},
		},
	}
)

@Miles-Garnsey Miles-Garnsey marked this pull request as ready for review January 30, 2023 00:45
@Miles-Garnsey Miles-Garnsey requested a review from a team as a code owner January 30, 2023 00:45
@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2023

Codecov Report

Merging #813 (9e49ffd) into main (5789b38) will increase coverage by 0.35%.
The diff coverage is 79.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #813      +/-   ##
==========================================
+ Coverage   56.30%   56.65%   +0.35%     
==========================================
  Files          96       98       +2     
  Lines        9547     9671     +124     
==========================================
+ Hits         5375     5479     +104     
- Misses       3685     3703      +18     
- Partials      487      489       +2     
Impacted Files Coverage Δ
...ollers/k8ssandra/cassandra_telemetry_reconciler.go 65.00% <ø> (ø)
pkg/telemetry/test_objects.go 0.00% <0.00%> (ø)
...elemetry/cassandra_agent/cassandra_agent_config.go 82.82% <82.82%> (ø)
controllers/k8ssandra/datacenters.go 73.61% <85.00%> (+0.74%) ⬆️
controllers/config/clientconfig_controller.go 81.39% <0.00%> (-0.78%) ⬇️
controllers/k8ssandra/schemas.go 75.70% <0.00%> (+1.40%) ⬆️
controllers/k8ssandra/reaper.go 62.42% <0.00%> (+1.81%) ⬆️

@Miles-Garnsey
Copy link
Member Author

This is ready to go as far as I can see. The only areas where we may want changes are:

  • Using cassandra.AddOrUpdateVolume in the reconciliation logic. This does change the locations where we'd hook our reconciliation logic into the reconcilers quite substantially though and involves a bit of rework. If there are benefits or this is the way it is done for the rest of the reconciliation I'm happy to make the change.
  • I've modified an interface here to enable smoother integration of reconciliation logic further away from the reconcilers themselves (passing Result types around in pkg modules seems to lead to better modularity). But the alternative would have been to use a case switch statement on the returned result type, which would resulted in less overall change to the existing Result interfaces. I'm open to doing it this way instead if preferred.

@Miles-Garnsey Miles-Garnsey force-pushed the feature/metrics-config branch 3 times, most recently from 210bde0 to 38b4167 Compare January 30, 2023 06:33
Copy link
Contributor

@adejanovski adejanovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to rebase again to get the fix on the integration tests.
The PR looks good, there are just a few comments to remove and a couple minor comments.

}
cmObjectKey := types.NamespacedName{Name: c.Kluster.Name + "-metrics-agent-config",
Namespace: c.Kluster.Namespace}
labels.SetManagedBy(desiredCm, cmObjectKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need to set the object as "Managed by" if it's "part of" the cluster.
Managed by is used to monitor externally created objects so that reconcile triggers when they are modified.
Here the object is definitely created by the operator. Right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adejanovski yes the object is definitely created by the operator. Once the tests are passing I'll try removing the "Managed by" labels, and see if the tests keep passing.

return nil
}

func (c Configurator) AddStsVolumes(dc *cassdcapi.CassandraDatacenter) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that the function you're referring to works on the datacenterConfig type

Right, and it actually makes no sense, we could have it work on a PodTemplateSpec.
I think we have an in flight PR that will actually change the function to get it to work on the PodTemplateSpec, which will allow to use it for Deployments as well.
Not a blocker for me, let's refactor it in the future.

@Miles-Garnsey
Copy link
Member Author

ClusterScoped/MultiDcMultiCluster is failing consistently, I can't test locally because make multi-create errors out.

@adejanovski do you want me to just risk merging this and see what happens? Or should I try to figure out some way to debug manually?

@Miles-Garnsey Miles-Garnsey force-pushed the feature/metrics-config branch 5 times, most recently from 056d92f to 51dc62c Compare February 1, 2023 04:51
Copy link
Contributor

@adejanovski adejanovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miles-Garnsey, please merge the PR after correcting the default path for the metrics config file.

pkg/telemetry/cassandra_agent/cassandra_agent_config.go Outdated Show resolved Hide resolved
@Miles-Garnsey Miles-Garnsey force-pushed the feature/metrics-config branch from 81ccfe5 to 427a4df Compare February 6, 2023 01:28
@Miles-Garnsey
Copy link
Member Author

(From offline) We hit an issue where older versions of Cassandra caused test failures because jackson XML's methods changed between versions, and different Cassandra 3.11.x versions used different Jackson versions.

To move things forward, I have updated all tests to use Cassandra 3.11.14, which should be compatible with the new metrics agent.

I will leave it to the rest of the team (ping @burmanm and @adejanovski) to decide whether(a) we want to continue supporting older Cassandra versions with the new metrics agent, or whether (b) we will make the call that older patch versions of Cassandra cannot benefit from the new agent. In general, the changes in this PR will be compatible with either whatever scenario (since we are just adding a file to a directory in /opt) but scenario (b) requires that the metrics agent doesn't try to start on older Cassandra / management-api images.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 7, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 17 Code Smells

No Coverage information No Coverage information
8.0% 8.0% Duplication

@Miles-Garnsey Miles-Garnsey merged commit 6201418 into k8ssandra:main Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Cassandra struct in the Telemetry spec for the new metrics endpoint
3 participants