Possible concurrency issue w/ multiple prometheus scrapers #38

briend · 2023-06-26T23:17:19Z

I noticed random errors in our ceph rgw bucket usage exporter, which would show up as occasional gaps in our metrics, or Target Down alerts. We scrape every 2 minutes and we have HA Prometheus (two instances of prometheus). So, it is possible that both prometheus are scraping the metrics endpoint at the same time. Since it is taking about 5-6 seconds to scrape our environment, this becomes somewhat likely to happen.

Below you can see the first context crash due to a KeyError, despite mybucket definitely existing. It seemed that the 2nd prometheus instance hit GET /admin/usage/ at the same time the first instance crashed. This doesn't happen every time, and we do usually get metrics for mybucket, which makes me strongly suspect it is a concurrency issue. Also, the error is not limited to mybucket, it can happen for any bucket, it seems.

DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/usage/?format=json&show-summary=False HTTP/1.1" 200 8955764
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/bucket/?format=json&stats=True HTTP/1.1" 200 143992
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/user/?format=json&list HTTP/1.1" 200 322
Traceback (most recent call last):
  File "/usr/lib/python3.8/wsgiref/handlers.py", line 137, in run
    self.result = application(self.environ, self.start_response)
  File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 123, in prometheus_app
    status, header, output = _bake_output(registry, accept_header, params)
  File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 105, in _bake_output
    output = encoder(registry)
  File "/usr/local/lib/python3.8/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest
    for metric in registry.collect():
  File "/usr/local/lib/python3.8/dist-packages/prometheus_client/registry.py", line 83, in collect
    for metric in collector.collect():
  File "./radosgw_usage_exporter.py", line 71, in collect
    self._get_usage(entry)
  File "./radosgw_usage_exporter.py", line 254, in _get_usage
    if category_name not in self.usage_dict[bucket_owner][bucket_name].keys():
KeyError: 'mybucket'
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/usage/?format=json&show-summary=False HTTP/1.1" 200 8955764
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/bucket/?format=json&stats=True HTTP/1.1" 200 143992
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/user/?format=json&list HTTP/1.1" 200 322

I thought maybe self.usage_dict shouldn't be a class variable so it can't be reinitialized prematurely? I don't know if that's the only thing that might break with concurrent requests, however.

radosgw_usage_exporter/radosgw_usage_exporter.py

Line 61 in 2fd70fc

self.usage_dict = defaultdict(dict)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible concurrency issue w/ multiple prometheus scrapers #38

Possible concurrency issue w/ multiple prometheus scrapers #38

briend commented Jun 26, 2023 •

edited

Loading

Possible concurrency issue w/ multiple prometheus scrapers #38

Possible concurrency issue w/ multiple prometheus scrapers #38

Comments

briend commented Jun 26, 2023 • edited Loading

briend commented Jun 26, 2023 •

edited

Loading