You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS: Rocky Linux 8.10 (Green Obsidian)
Python version: 3.9.19
SDK version: 1.27.0
API version: 1.27.0
Gevent: 23.9.1
What happened?
Service Overview:
The service is complex, utilizing the _thread module to create new threads and the subprocess module to manage multiple subprocesses (especially for SSH clients). Additionally, the service uses gevent.monkey.patch_all() at the very beginning to enable cooperative multitasking. Note: the service frequently create subprocess with module subprocess all the time to do some None python or gevent related logic.
Errors Encountered:
1. Error During BatchSpanProcessor Initialization:
When initializing BatchSpanProcessor, the following error is observed:
Traceback (most recent call last):
File "src/gevent/_abstract_linkable.py", line 287, in gevent._gevent_c_abstract_linkable.AbstractLinkable._notify_links
File "src/gevent/_abstract_linkable.py", line 333, in gevent._gevent_c_abstract_linkable.AbstractLinkable._notify_links
AssertionError: (None, <callback at 0x7f522d2748c0 args=([],)>)
2. Error After Service Runs for Some Time(within 30 secs):
After the service runs for a while, the following error appears multiple times with different daemon thread IDs:
Traceback (most recent call last):
File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
File "/usr/lib64/python3.9/threading.py", line 937, in _bootstrap
self._bootstrap_inner()
File "/usr/lib64/python3.9/threading.py", line 972, in _bootstrap_inner
del _limbo[self]
KeyError: <Thread(OtelBatchSpanProcessor, stopped daemon 140119617635328)>
2024-11-15T22:20:58Z <Greenlet at 0x7f7024090400: <bound method Thread._bootstrap of <Thread(OtelBatchSpanProcessor, stopped daemon 140119617635328)>>> failed with KeyError
These errors affect the other threads running in the service and eventually cause the service to enter an unhealthy state, with some hosts becoming unreachable.
##Investigation Process:
As we mentioned above, the problem we find is when we initialize BatchSpanProcessor in service main process. We found that os.register_at_fork(after_in_child=self._at_fork_reinit) may be the issue.
By adding a log in a function test_fork()
def test_fork(self):
log.INFO("Otel in fork: PID=%s" %os.getpid())
stack = inspect.stack()
log.INFO("Call stack:\n" + "\n".join(["{%s}:{%s} in {%s}" %(frame.filename, frame.lineno, frame.function)for frame in stack]))
By adding a log in a function (test_fork) to track forking behavior, we discovered that the BatchSpanProcessor gets reinitialized many times in the SSH subprocesses. We registered this function using os.register_at_fork(after_in_child=self.test_fork) and noticed the following log output:
2024-11-15 22:35:59,363Z INFO server.py:278 Otel in fork: PID=2402193
2024-11-15 22:35:59,641Z INFO server.py:280 Call stack:
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/server/server.py}:{279} in {test_fork}
{<gevent-23.9.1>/gevent/os.py}:{426} in {fork_and_watch}
{<gevent-23.9.1>/gevent/subprocess.py}:{1624} in {_execute_child}
{<gevent-23.9.1>/gevent/subprocess.py}:{807} in {__init__}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/base/command.py}:{217} in {timed_command}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{227} in {__timed_command}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{484} in {execute}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{730} in {wrapper}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/hypervisor/kvm_ssh_client.py}:{111} in {execute}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/hypervisor/ssh_utils.py}:{42} in {ssh_execute}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/hypervisor/kvm/libvirt_connection.py}:{689} in {_get_uefi_nvram_size}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/hypervisor/kvm/libvirt_connection.py}:{1308} in {_connect}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/util/hypervisor/connection.py}:{614} in {_reconnect_loop}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/misc/decorators.py}:{47} in {wrapper}
Which means BatchSpanprocessor gets reinit many times in these ssh subprocess.
But checked the active threads by dumping all the greenlets in this service, only the batchSpanProcessor running in the main processor can be seen and others can not be seen.
We also tried adding a time.sleep(5) for test purpose before our opentelemetry BatchSpanProcessor initialization(our tracing initialization is at the end of service run), we can not see the error logs always but intermittently.
Any idea on how to solve this problem? Please help. Thanks!
More context: we are migrating SimpleSpanProcessor to BatchSpanProcessor now, and SimpleSpanProcessor can work well but have performance issue that we have tested.
Would you like to implement a fix?
None
The text was updated successfully, but these errors were encountered:
xrmx
changed the title
the new thread initialized in os.register_at_fork() of BatchSpanProcessor will cause issue to service which creates subprocess frequently
BatchSpanProcessor: issue to service which creates subprocess frequently when monkeypatched by gevent
Nov 19, 2024
Describe your environment
OS: Rocky Linux 8.10 (Green Obsidian)
Python version: 3.9.19
SDK version: 1.27.0
API version: 1.27.0
Gevent: 23.9.1
What happened?
Service Overview:
The service is complex, utilizing the
_thread
module to create new threads and thesubprocess
module to manage multiple subprocesses (especially for SSH clients). Additionally, the service usesgevent.monkey.patch_all()
at the very beginning to enable cooperative multitasking. Note: the service frequently create subprocess with module subprocess all the time to do some None python or gevent related logic.Errors Encountered:
1. Error During
BatchSpanProcessor
Initialization:When initializing
BatchSpanProcessor
, the following error is observed:2. Error After Service Runs for Some Time(within 30 secs):
After the service runs for a while, the following error appears multiple times with different daemon thread IDs:
These errors affect the other threads running in the service and eventually cause the service to enter an unhealthy state, with some hosts becoming unreachable.
Steps to Reproduce
It is not easy to reproduce. But will give a try.
Since os.register_at_fork() in BatchSpanProcessor will affect all the subprocess even if this fork is to exec something else entirely not to keep running python or gevent.
Expected Result
No error or exception from service log.
Actual Result
See error in What happened.
Additional context
##Investigation Process:
As we mentioned above, the problem we find is when we initialize BatchSpanProcessor in service main process. We found that os.register_at_fork(after_in_child=self._at_fork_reinit) may be the issue.
By adding a log in a function test_fork()
By adding a log in a function (test_fork) to track forking behavior, we discovered that the BatchSpanProcessor gets reinitialized many times in the SSH subprocesses. We registered this function using os.register_at_fork(after_in_child=self.test_fork) and noticed the following log output:
Which means BatchSpanprocessor gets reinit many times in these ssh subprocess.
But checked the active threads by dumping all the greenlets in this service, only the batchSpanProcessor running in the main processor can be seen and others can not be seen.
We also tried adding a time.sleep(5) for test purpose before our opentelemetry BatchSpanProcessor initialization(our tracing initialization is at the end of service run), we can not see the error logs always but intermittently.
Any idea on how to solve this problem? Please help. Thanks!
More context: we are migrating SimpleSpanProcessor to BatchSpanProcessor now, and SimpleSpanProcessor can work well but have performance issue that we have tested.
Would you like to implement a fix?
None
The text was updated successfully, but these errors were encountered: