You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed random errors in our ceph rgw bucket usage exporter, which would show up as occasional gaps in our metrics, or Target Down alerts. We scrape every 2 minutes and we have HA Prometheus (two instances of prometheus). So, it is possible that both prometheus are scraping the metrics endpoint at the same time. Since it is taking about 5-6 seconds to scrape our environment, this becomes somewhat likely to happen.
Below you can see the first context crash due to a KeyError, despite mybucket definitely existing. It seemed that the 2nd prometheus instance hit GET /admin/usage/ at the same time the first instance crashed. This doesn't happen every time, and we do usually get metrics for mybucket, which makes me strongly suspect it is a concurrency issue. Also, the error is not limited to mybucket, it can happen for any bucket, it seems.
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/usage/?format=json&show-summary=False HTTP/1.1" 200 8955764
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/bucket/?format=json&stats=True HTTP/1.1" 200 143992
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/user/?format=json&list HTTP/1.1" 200 322
Traceback (most recent call last):
File "/usr/lib/python3.8/wsgiref/handlers.py", line 137, in run
self.result = application(self.environ, self.start_response)
File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 123, in prometheus_app
status, header, output = _bake_output(registry, accept_header, params)
File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 105, in _bake_output
output = encoder(registry)
File "/usr/local/lib/python3.8/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest
for metric in registry.collect():
File "/usr/local/lib/python3.8/dist-packages/prometheus_client/registry.py", line 83, in collect
for metric in collector.collect():
File "./radosgw_usage_exporter.py", line 71, in collect
self._get_usage(entry)
File "./radosgw_usage_exporter.py", line 254, in _get_usage
if category_name not in self.usage_dict[bucket_owner][bucket_name].keys():
KeyError: 'mybucket'
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/usage/?format=json&show-summary=False HTTP/1.1" 200 8955764
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/bucket/?format=json&stats=True HTTP/1.1" 200 143992
DEBUG:urllib3.connectionpool:http://s3.server.lan:80 "GET /admin/user/?format=json&list HTTP/1.1" 200 322
I thought maybe self.usage_dict shouldn't be a class variable so it can't be reinitialized prematurely? I don't know if that's the only thing that might break with concurrent requests, however.
I noticed random errors in our ceph rgw bucket usage exporter, which would show up as occasional gaps in our metrics, or Target Down alerts. We scrape every 2 minutes and we have HA Prometheus (two instances of prometheus). So, it is possible that both prometheus are scraping the metrics endpoint at the same time. Since it is taking about 5-6 seconds to scrape our environment, this becomes somewhat likely to happen.
Below you can see the first context crash due to a
KeyError
, despitemybucket
definitely existing. It seemed that the 2nd prometheus instance hitGET /admin/usage/
at the same time the first instance crashed. This doesn't happen every time, and we do usually get metrics formybucket
, which makes me strongly suspect it is a concurrency issue. Also, the error is not limited tomybucket
, it can happen for any bucket, it seems.I thought maybe
self.usage_dict
shouldn't be a class variable so it can't be reinitialized prematurely? I don't know if that's the only thing that might break with concurrent requests, however.radosgw_usage_exporter/radosgw_usage_exporter.py
Line 61 in 2fd70fc
The text was updated successfully, but these errors were encountered: