Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchSpanProcessor: issue to service which creates subprocess frequently when monkeypatched by gevent #4298

Open
scxingnutanix opened this issue Nov 19, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@scxingnutanix
Copy link

scxingnutanix commented Nov 19, 2024

Describe your environment

OS: Rocky Linux 8.10 (Green Obsidian)
Python version: 3.9.19
SDK version: 1.27.0
API version: 1.27.0
Gevent: 23.9.1

What happened?

Service Overview:

The service is complex, utilizing the _thread module to create new threads and the subprocess module to manage multiple subprocesses (especially for SSH clients). Additionally, the service uses gevent.monkey.patch_all() at the very beginning to enable cooperative multitasking. Note: the service frequently create subprocess with module subprocess all the time to do some None python or gevent related logic.

Errors Encountered:

1. Error During BatchSpanProcessor Initialization:

When initializing BatchSpanProcessor, the following error is observed:

Traceback (most recent call last):
  File "src/gevent/_abstract_linkable.py", line 287, in gevent._gevent_c_abstract_linkable.AbstractLinkable._notify_links
  File "src/gevent/_abstract_linkable.py", line 333, in gevent._gevent_c_abstract_linkable.AbstractLinkable._notify_links
AssertionError: (None, <callback at 0x7f522d2748c0 args=([],)>)

2. Error After Service Runs for Some Time(within 30 secs):

After the service runs for a while, the following error appears multiple times with different daemon thread IDs:

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
  File "/usr/lib64/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib64/python3.9/threading.py", line 972, in _bootstrap_inner
    del _limbo[self]
KeyError: <Thread(OtelBatchSpanProcessor, stopped daemon 140119617635328)>
2024-11-15T22:20:58Z <Greenlet at 0x7f7024090400: <bound method Thread._bootstrap of <Thread(OtelBatchSpanProcessor, stopped daemon 140119617635328)>>> failed with KeyError

These errors affect the other threads running in the service and eventually cause the service to enter an unhealthy state, with some hosts becoming unreachable.

Steps to Reproduce

It is not easy to reproduce. But will give a try.

Since os.register_at_fork() in BatchSpanProcessor will affect all the subprocess even if this fork is to exec something else entirely not to keep running python or gevent.

Expected Result

No error or exception from service log.

Actual Result

See error in What happened.

Additional context

##Investigation Process:
As we mentioned above, the problem we find is when we initialize BatchSpanProcessor in service main process. We found that os.register_at_fork(after_in_child=self._at_fork_reinit) may be the issue.
By adding a log in a function test_fork()

def test_fork(self):
    log.INFO("Otel in fork: PID=%s" %os.getpid())
    stack = inspect.stack()
    log.INFO("Call stack:\n" + "\n".join(["{%s}:{%s} in {%s}" %(frame.filename, frame.lineno, frame.function)for frame in stack])) 

By adding a log in a function (test_fork) to track forking behavior, we discovered that the BatchSpanProcessor gets reinitialized many times in the SSH subprocesses. We registered this function using os.register_at_fork(after_in_child=self.test_fork) and noticed the following log output:

2024-11-15 22:35:59,363Z INFO server.py:278 Otel in fork: PID=2402193
2024-11-15 22:35:59,641Z INFO server.py:280 Call stack:
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/server/server.py}:{279} in {test_fork}
{<gevent-23.9.1>/gevent/os.py}:{426} in {fork_and_watch}
{<gevent-23.9.1>/gevent/subprocess.py}:{1624} in {_execute_child}
{<gevent-23.9.1>/gevent/subprocess.py}:{807} in {__init__}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/base/command.py}:{217} in {timed_command}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{227} in {__timed_command}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{484} in {execute}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{730} in {wrapper}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/hypervisor/kvm_ssh_client.py}:{111} in {execute}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/hypervisor/ssh_utils.py}:{42} in {ssh_execute}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/hypervisor/kvm/libvirt_connection.py}:{689} in {_get_uefi_nvram_size}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/hypervisor/kvm/libvirt_connection.py}:{1308} in {_connect}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/util/hypervisor/connection.py}:{614} in {_reconnect_loop}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/misc/decorators.py}:{47} in {wrapper}

Which means BatchSpanprocessor gets reinit many times in these ssh subprocess.
But checked the active threads by dumping all the greenlets in this service, only the batchSpanProcessor running in the main processor can be seen and others can not be seen.

We also tried adding a time.sleep(5) for test purpose before our opentelemetry BatchSpanProcessor initialization(our tracing initialization is at the end of service run), we can not see the error logs always but intermittently.

Any idea on how to solve this problem? Please help. Thanks!

More context: we are migrating SimpleSpanProcessor to BatchSpanProcessor now, and SimpleSpanProcessor can work well but have performance issue that we have tested.

Would you like to implement a fix?

None

@scxingnutanix scxingnutanix added the bug Something isn't working label Nov 19, 2024
@xrmx xrmx changed the title the new thread initialized in os.register_at_fork() of BatchSpanProcessor will cause issue to service which creates subprocess frequently BatchSpanProcessor: issue to service which creates subprocess frequently when monkeypatched by gevent Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant