Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove timestamp from the metrics #43

Open
prabinsh opened this issue Mar 30, 2021 · 13 comments
Open

Remove timestamp from the metrics #43

prabinsh opened this issue Mar 30, 2021 · 13 comments
Assignees

Comments

@prabinsh
Copy link

The timestamp in the metrics is 2 hours behind the system time.

 HELP collectd_collectd_cache_size write_prometheus plugin: 'collectd' Type: 'cache_size', Dstype: 'gauge', Dsname: 'value'
# TYPE collectd_collectd_cache_size gauge
collectd_collectd_cache_size{collectd="cache",instance="10.0.1.1",cluster="CassCluster",dc="DAL",rack="rack1"} 11969 1617137796120

Here's the system time and timestamp it translates to

$ date -d @1617137796
Tue Mar 30 13:56:36 GMT+7 2021
$ date
Tue Mar 30 15:39:22 GMT+7 2021

The time reported in metric is 2 hours behind and I can't figure out the way to disable the timestamp in metrics.

This is causing the following error when scrapping in prometheus

msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=51908
@MattFellows
Copy link

I've also got a similar issue...
Our k8ssandra nodes ran out of disk space, we fixed the issue, but ever since, we've had no grafana metrics from k8ssandra... I've restarted, deleted and recreated every pod an servicemonitor, removed coutless directories / caches and it just keeps happening. The timestamps are out by about 5 minutes immediately after a delete of mcac_data and restart, then get older and older until they are bout 4 hours old, then start moving forwards...

Any advice or help about what to grab for diagnosis would be great, but this really feels like a bug of some sort, induced by an unexpected state...

@kenjaix
Copy link

kenjaix commented Aug 16, 2021

Same here.

I got the following metrics with the timestamps that are 2 months ago:

collectd_tcpconns_tcp_connections{tcpconns="9999-local",type="SYN_SENT",instance="172.17.47.22",cluster="V2",dc="F1",rack="D1"} 0 1624932662653
collectd_uptime{instance="172.17.47.22",cluster="V2",dc="F1",rack="D1"} 10100889 1624932662650
collectd_vmem_vmpage_action_total{vmem="dirtied",instance="172.17.47.22",cluster="V2",dc="F1",rack="D1"} 20647291651 1624932662647

1624932662650
GMT: Tuesday, June 29, 2021 2:11:02.650 AM
Relative: 2 months ago

Causing the Prometheus drops those metrics. Not sure why the mcac doesn't update the timestamp.

Please advise.

@tah-mas
Copy link

tah-mas commented Aug 20, 2021

We got the same issue. It was working fine, but after leaving it for a couple of days, mcac is reporting the wrong time causing prometheus to fail:
level=warn ts=2021-08-20T15:29:17.688Z caller=scrape.go:1375 component="scrape manager" scrape_pool=k8ssandra/k8ssandra-prometheus-k8ssandra/0 target=http://xxxxx:9103/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=205

Please fix.

@adejanovski
Copy link
Collaborator

I'm unable to reproduce the issue on GKE. I've left the cluster run for a few days and Prometheus isn't complaining about metrics that are too old.
Could you compare the clocks from the Prometheus container and from the Cassandra containers to see if there's a drift? Same question on comparing the clocks on all K8s worker nodes to see if they're in sync.

@tah-mas
Copy link

tah-mas commented Sep 27, 2021

Hi, @adejanovski no drift and both prometheus and cassandra container report the same time (UTC). I did notice that with the fix for 'out-of-order timespaces' (#969), I had no problems with the timestamps as long as I had a smaller number of tables (~100) in the DB. After our production upgrade, I now have 326 tables spread across keyspaces and the problem has reappeared again. Our dev env has also got a similar number of tables so it appears that this happens if you've got a large number of tables in your DB, but that is just an observation...

@adejanovski
Copy link
Collaborator

Hi @tah-mas,

that's an interesting observation. Each table comes with a large set of metrics and this could mean that they take too long to be ingested and end up being ingested once they're outside of the accepted timestamp range.
The solution there would be to filter some metrics so that we reduce the overall volume. I'm not even sure we have table specific metrics used in the current set of dashboards.
I'll investigate to see how easily this could be achieved.

@tah-mas
Copy link

tah-mas commented Sep 28, 2021

Thank you @adejanovski! Much appreciated

@eriksw
Copy link

eriksw commented Dec 31, 2021

@adejanovski Any ETA making the default config usable?

We just switched from the instaclustr exporter to MCAC and are winding up with no metrics/blank dashboards from our main cluster due to this issue, despite it working fine on a smaller cluster with fewer tables.

@adejanovski
Copy link
Collaborator

Hi @eriksw,

we merged the changes a while ago actually to let you filter metrics more easily. Check this commit for some examples.
Let me know how this works for you.

@eriksw
Copy link

eriksw commented Jan 3, 2022

@adejanovski Glad to see some rules documented here! I had looked around and found https://github.com/k8ssandra/k8ssandra/pull/1149/files and derived the following rule set:

filtering_rules:
  - policy: deny
    pattern: org.apache.cassandra.metrics.Table
    scope: global
  - policy: deny
    pattern: org.apache.cassandra.metrics.table
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.live_ss_table_count
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.Table.LiveSSTableCount
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.live_disk_space_used
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.Table.Pending
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.Table.Memtable
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.Table.Compaction
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.read
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.write
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.range
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.coordinator
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.dropped_mutations
    scope: global

The bad news: with those rules, on our main cluster we still ran into wildly out of date metric timestamps and all the other issues of #39

Has MCAC ever been used in actual production on a cluster with >300 tables on 60 nodes? If so, how?

@ducnm0711
Copy link

ducnm0711 commented Feb 21, 2022

Hi everyone

The rate of having Prometheus warning out-of-order samples indeed decrease with above setup.

Increase metric_sampling_interval_in_seconds: 120 does help a bit.

I was from having scrape warning every minutes to every 3-4 mins.

I'm testing MCAC in a 3-nodes-cluster with 100+ tables.
Prometheus/ServiceMonitor deployed in k8s cluster.
Cassandra in VM Instances.

@raskar7
Copy link

raskar7 commented May 6, 2022

Hi everyone
We're having the same issue.
The mcac exporter metrics are timestamped 2hours in the past compared to our France current UTC+2. All servers are NTP synced.
So I think the exporter gets the time from Cassandra and not from the system.
If there's no way to configure it, the simplest way would be to change prometheus server timezone to match UTC ?

@jsanda
Copy link
Contributor

jsanda commented Aug 7, 2022

@Miles-Garnsey can you investigate this? Could this be related to #73?

@Miles-Garnsey Miles-Garnsey self-assigned this Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants