Separating Metrics Reporting Responsibilities Between CNCF Project and TAG Environmental Sustainability Initiative #14

incertum · 2024-02-07T23:39:52Z

incertum
Feb 7, 2024
Maintainer

Since Falco is the first project to onboard to the TAG Environmental Sustainability Green Reviews Initiative, there is an opportunity to discuss the metrics reporting responsibilities to lay the foundation for organic growth of the initiative.

Proposing the following based on previous discussions:

Project: (If applicable) reports custom internal metrics to a Green Reviews-hosted Prometheus. The project assists in creating a meaningful Grafana dashboard.

Green Review:

Kepler energy metrics for all project namespaces are reported to Prometheus and visualized through a Grafana dashboard accessible by the projects.
Traditional SRE Metrics: Even if the project is capable of supplying similar metrics, the Green Review team uniformly logs traditional SRE Metrics across all namespaces. This complements the Kepler energy metrics and benefits the easier understanding of resource utilization impacts. The Green Review team manages this deployment. One option could be cAdvisor as dameonset feeding into Prometheus, which then feeds into a Grafana dashboard accessible by the projects.

References:

https://github.com/cncf-tags/green-reviews-tooling/
Kepler project: https://github.com/sustainable-computing-io/kepler
SRE Metrics: Traditional metrics related to CPU usage and memory usage, as outlined on Falco's page https://falco.org/docs/metrics/performance/ or checkout https://github.com/google/cadvisor + more universally useful metrics (TBD)

CC @AntonioDiTuri @nikimanoledaki @rossf7

nikimanoledaki · 2024-02-08T08:49:21Z

nikimanoledaki
Feb 8, 2024

Project: (If applicable) reports custom internal metrics to a Green Reviews-hosted Prometheus. The project assists in creating a meaningful Grafana dashboard.

+1 for exporting the kernel event rate and other useful metrics about Falco to Prometheus.

Specifically, it would be great if Falco Maintainers could create a panel to visualise the kernel event rate. This could be achieved with a Prometheus query using the kernel event rate metrics that would be exposed to Prometheus and a PromQL rate e.g.: rate(falco_kernel_event{container=falco}[1min])

Note: container=falco refers to the container name.

One option could be cAdvisor as dameonset feeding into Prometheus

I would like to propose a different approach to consider instead of cAdvisor for surfacing conventional SRE metrics: let's first evaluate the metrics available from kube-state-metrics.

One benefit is that this is a Kubernetes project, so it's consistent with the WG's goal of creating an architectural reference with tools from the CNCF ecosystem.

I propose that we use the metrics surfaced by ksm that are exported to Prometheus and then use these metrics + PromQL to create meaningful Prometheus queries that are then visualised in the panels of our Grafana dashboard.

0 replies

AntonioDiTuri · 2024-02-14T12:31:58Z

AntonioDiTuri
Feb 14, 2024

Could you give us a list of the SRE metrics you would like to be monitored for the Falco project?
Are those metrics sufficient?:

CPU usage: Typically measured as a percentage of one CPU, it can be compared with the number of available CPUs on the host. Falco's hot path is single-threaded, so it should not be able to exceed the capacity of one full CPU.
Memory RSS: Resident Set Size is the portion of memory held in RAM by a process.
Memory VSZ: Virtual Memory Size is the total memory allocated to a process, including both RAM and swap space.
container_memory_working_set_bytes in Kubernetes settings: This is almost equivalent to the cgroups container memory_used metric natively exposed in Falco metrics.

On this comment you also provided some other metrics: #11 (comment)

The mapping is not straightforward to me: I guess
falco.cpu_usage_perc -> CPU Usage

But then for the other I am not sure, could you help me figuring this out?

2 replies

incertum Feb 14, 2024
Maintainer Author

CPU and memory (e.g. container_memory_working_set_bytes) usages should be a good start, see https://github.com/google/cadvisor.

Falco's internal metrics could perhaps be treated separately, a bonus so to say.
falco.cpu_usage_perc -> CPU Usage This is correct it is the CPU usage of the Falco binary itself, just as if you ran ps on the machine manually. We do not need it from Falco and instead should rely on a formal external framework as suggested above (e.g. https://github.com/google/cadvisor should provide the exact equivalent and the best part is that it would work for any project deployed via containers). WDYT?

incertum Feb 14, 2024
Maintainer Author

SRE metrics you would like to be monitored for the Falco project

I would propose defining a standard for the entire initiative aka for each project to be onboarded beyond Falco.

mickael-carl · 2024-02-19T22:25:22Z

mickael-carl
Feb 19, 2024

Hi there! So it looks like this repo already has a full kube-prometheus stack deployed (judging from the HelmRelease here).

The interesting thing is that kube-prometheus includes:

kube-state-metrics
cadvisor
scraping configuration for both
dashboards for compute usage across a cluster, namespace and workloads

It'd be worth clarifying what is the intent here.
Do we want to measure compute (and other resources) usage (on top of what Kepler reports)? If so, then we can use what's already there.
The main issue I see right now is that it's very hard for anyone to contribute meaningfully without some form of access to:

cluster (there is a note to ping the TAG leads, great!)
Prometheus (ingress is disabled right now, so it's not accessible)
Grafana (ingress is enabled, how to access is not documented though)

One more point worth mentioning is that the current configuration is a bit opaque. It's a great start in that all pieces are deployed already, but it's coming at the cost of granularity: by that I mean that we take an entire HelmChart or repo and deploy that, without really having a sense of what gets deployed. It might be worth considering rendering charts and manifests and committing the output, so as to understand what resources are actually there.

Hopefully that helps moving this issue forward! 🙂 I'd be happy to contribute a couple PRs if necessary!

6 replies

rossf7 Feb 21, 2024

Hi @mickael-carl and @incertum,

So it looks like this repo already has a full kube-prometheus stack deployed

Yes, the full stack is deployed and can be used as the source for non Falco specific metrics.

Do we want to measure compute (and other resources) usage (on top of what Kepler reports)?

Our highest priority are the energy metrics from Kepler which @raymundovr is working on but we do see the need to have some widely used metrics like CPU and memory to correlate with the energy metrics.

We've already identified container_memory_working_set_bytes. It would help us a lot if you could identify a small set (like 4 or 5) of metrics you would like to see?

On access it is certainly a challenge and something we continue to work on. However we're a fairly small team and also doing prep for KubeCon EU.

There are now docs for accessing Grafana and we can certainly add an ingress for Prometheus. The access is still WIP as we don't have a domain set up yet.

https://github.com/cncf-tags/green-reviews-tooling/tree/main/docs/infrastructure#monitoring

mickael-carl Feb 21, 2024

That's helpful thanks!

We've already identified container_memory_working_set_bytes. It would help us a lot if you could identify a small set (like 4 or 5) of metrics you would like to see?

That's what I'm trying to say, that's basically already there, see this dashboard.

Do we need more data than what's already available in that Grafana instance?

rossf7 Feb 21, 2024

Hi @mickael-carl,
thank you! That dashboard is indeed a good fit for what we need.

It would still be useful to have a proposal with the metric names for the promql queries. We can also take a look but we have a lot of plates spinning pre-KubeCon so that may delay things.

On access I took a look at enabling the ingress for Prometheus but we can't have both it and Grafana available. Once we have DNS records that should be doable. I'm working on that for cncf-tags/green-reviews-tooling#31

incertum Feb 22, 2024
Maintainer Author

@mickael-carl and I synced internally and Mickael will post the list tomorrow his time (also thanks @mickael-carl for all of your help).

Once we have DNS records that should be doable.

@rossf7 thanks a bunch for sharing the current blockers, much appreciated.

incertum Feb 27, 2024
Maintainer Author

@mickael-carl will get back next week or later. Meanwhile posting some raw notes:

Let's start with 3 SRE Metrics that shall be reported for each project in addition to the Kepler energy metrics:

rate(container_cpu_usage_seconds_total[5m])
container_memory_rss
container_memory_working_set_bytes

I am suspecting the Grafana dashboard will allow to interactively select the namespace and pod (and possibly container, but right now the Falco pod has just one container anyways) etc.

CPU usage -> rate(container_cpu_usage_seconds_total[5m]) - There seems to be many references online to decide on the final final query.
container_memory_rss -> cAdvisor here doesn't follow the naming convention and is missing the unit suffix, but it's also bytes (not kb like the Linux kernel internally handles it), type: gauge.
container_memory_working_set_bytes -> pretty clear, we already agreed on this one, type: gauge.

Last I spoke to @mickael-carl the default dashboards best integrate with bytes wrt memory metrics, but we can also discuss if we want to add conversions to MB into the queries. Either way works for us.

@raymundovr @rossf7 is this enough information to unblock SRE Metrics metrics reporting by the Green Review WG?

incertum · 2024-05-09T17:18:08Z

incertum
May 9, 2024
Maintainer Author

@rossf7 could we mark this discussion as concluded? See my last comment #14 (reply in thread).

0 replies

rossf7 · 2024-05-10T11:01:16Z

rossf7
May 10, 2024

could we mark this discussion as concluded?

Hi @incertum, yes sure. I've added a note so we include this when writing the proposal for collecting metrics.
cncf-tags/green-reviews-tooling#83 (comment)

1 reply

incertum May 10, 2024
Maintainer Author

Thanks, we can always re-open this discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separating Metrics Reporting Responsibilities Between CNCF Project and TAG Environmental Sustainability Initiative #14

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Separating Metrics Reporting Responsibilities Between CNCF Project and TAG Environmental Sustainability Initiative #14

incertum Feb 7, 2024 Maintainer

Replies: 5 comments · 9 replies

nikimanoledaki Feb 8, 2024

AntonioDiTuri Feb 14, 2024

incertum Feb 14, 2024 Maintainer Author

incertum Feb 14, 2024 Maintainer Author

mickael-carl Feb 19, 2024

rossf7 Feb 21, 2024

mickael-carl Feb 21, 2024

rossf7 Feb 21, 2024

incertum Feb 22, 2024 Maintainer Author

incertum Feb 27, 2024 Maintainer Author

incertum May 9, 2024 Maintainer Author

rossf7 May 10, 2024

incertum May 10, 2024 Maintainer Author

incertum
Feb 7, 2024
Maintainer

Replies: 5 comments 9 replies

nikimanoledaki
Feb 8, 2024

AntonioDiTuri
Feb 14, 2024

incertum Feb 14, 2024
Maintainer Author

incertum Feb 14, 2024
Maintainer Author

mickael-carl
Feb 19, 2024

incertum Feb 22, 2024
Maintainer Author

incertum Feb 27, 2024
Maintainer Author

incertum
May 9, 2024
Maintainer Author

rossf7
May 10, 2024

incertum May 10, 2024
Maintainer Author