Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible concurrency issue w/ multiple prometheus scrapers #38

Open
briend opened this issue Jun 26, 2023 · 0 comments
Open

Possible concurrency issue w/ multiple prometheus scrapers #38

briend opened this issue Jun 26, 2023 · 0 comments

Comments

@briend
Copy link

briend commented Jun 26, 2023

I noticed random errors in our ceph rgw bucket usage exporter, which would show up as occasional gaps in our metrics, or Target Down alerts. We scrape every 2 minutes and we have HA Prometheus (two instances of prometheus). So, it is possible that both prometheus are scraping the metrics endpoint at the same time. Since it is taking about 5-6 seconds to scrape our environment, this becomes somewhat likely to happen.

Below you can see the first context crash due to a KeyError, despite mybucket definitely existing. It seemed that the 2nd prometheus instance hit GET /admin/usage/ at the same time the first instance crashed. This doesn't happen every time, and we do usually get metrics for mybucket, which makes me strongly suspect it is a concurrency issue. Also, the error is not limited to mybucket, it can happen for any bucket, it seems.

DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/usage/?format=json&show-summary=False HTTP/1.1" 200 8955764
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/bucket/?format=json&stats=True HTTP/1.1" 200 143992
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/user/?format=json&list HTTP/1.1" 200 322
Traceback (most recent call last):
  File "/usr/lib/python3.8/wsgiref/handlers.py", line 137, in run
    self.result = application(self.environ, self.start_response)
  File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 123, in prometheus_app
    status, header, output = _bake_output(registry, accept_header, params)
  File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 105, in _bake_output
    output = encoder(registry)
  File "/usr/local/lib/python3.8/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest
    for metric in registry.collect():
  File "/usr/local/lib/python3.8/dist-packages/prometheus_client/registry.py", line 83, in collect
    for metric in collector.collect():
  File "./radosgw_usage_exporter.py", line 71, in collect
    self._get_usage(entry)
  File "./radosgw_usage_exporter.py", line 254, in _get_usage
    if category_name not in self.usage_dict[bucket_owner][bucket_name].keys():
KeyError: 'mybucket'
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/usage/?format=json&show-summary=False HTTP/1.1" 200 8955764
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/bucket/?format=json&stats=True HTTP/1.1" 200 143992
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/user/?format=json&list HTTP/1.1" 200 322

I thought maybe self.usage_dict shouldn't be a class variable so it can't be reinitialized prematurely? I don't know if that's the only thing that might break with concurrent requests, however.

self.usage_dict = defaultdict(dict)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant