Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APCu stuck on FUTEX_WAIT #410

Open
duritong opened this issue Dec 1, 2020 · 15 comments
Open

APCu stuck on FUTEX_WAIT #410

duritong opened this issue Dec 1, 2020 · 15 comments

Comments

@duritong
Copy link

duritong commented Dec 1, 2020

Since a while we have one particular Drupal website that at some point (we think it's related to a cronjob flushing the caches) suddenly & immediately consuming all php-fpm processes. This then goes on and eventually fills the complete apache scoreboard, ending up the whole apache being blocked with:

[Tue Dec 01 09:36:19.610741 2020] [mpm_event:error] [pid 17036:tid 139654491891840] AH00485: scoreboard is full, not at MaxRequestWorkers

If you look at the processes, you see all processes being idle:

example+ 12202     1 12202  0    1 Nov30 ?        00:00:00 /usr/bin/scl enable php74 -- /opt/remi/php74/root/usr/sbin/php-fpm --fpm-config=/etc/opt/remi/php74/php-fpm.d/example.org.conf
example+ 12203 12202 12203  0    1 Nov30 ?        00:00:00 /bin/bash /var/tmp/sclV6IUYG
example+ 12206 12203 12206  0    1 Nov30 ?        00:00:01 php-fpm: master process (/etc/opt/remi/php74/php-fpm.d/example.org.conf)
example+ 30790 12206 30790  0    1 01:02 ?        00:00:00 php-fpm: pool example.org-0
example+ 30816 12206 30816  0    1 01:02 ?        00:00:00 php-fpm: pool example.org-0
example+ 31026 12206 31026  0    1 01:04 ?        00:00:00 php-fpm: pool example.org-0
example+ 31053 12206 31053  0    1 01:04 ?        00:00:00 php-fpm: pool example.org-0
example+ 31600 12206 31600  0    1 01:05 ?        00:00:00 php-fpm: pool example.org-0
example+ 31766 12206 31766  0    1 01:07 ?        00:00:00 php-fpm: pool example.org-0
example+ 31772 12206 31772  0    1 01:07 ?        00:00:00 php-fpm: pool example.org-0
example+ 31953 12206 31953  0    1 01:09 ?        00:00:00 php-fpm: pool example.org-0
example+ 32682 12206 32682  0    1 01:11 ?        00:00:00 php-fpm: pool example.org-0
example+ 32759 12206 32759  0    1 01:12 ?        00:00:00 php-fpm: pool example.org-0

When we stop the systemd service managing the fpm processes, the httpd recovers and also the site itself recovers and works fine until the lock is retriggered.

All processes are waiting on FUTEX:

# strace -p 31026
strace: Process 31026 attached
futex(0x7f33727ab084, FUTEX_WAIT, 0, NULL^Cstrace: Process 31026 detached
 <detached ...>

If we get the backtrace of one of these processes, we see the following:

(gdb) backtrace
#0  0x00007f3395ed139e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x00007f337c400220 in apc_lock_wlock () from /opt/remi/php74/root/usr/lib64/php/modules/apcu.so
#2  0x00007f337c40386f in apc_cache_store () from /opt/remi/php74/root/usr/lib64/php/modules/apcu.so
#3  0x00007f337c400890 in apc_store_helper () from /opt/remi/php74/root/usr/lib64/php/modules/apcu.so
#4  0x0000557ebf6e422a in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1732
#5  execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#6  0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#7  0x0000557ebf6e298d in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1602
#8  execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53861
#9  0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#10 0x0000557ebf6e298d in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1602
#11 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53861
#12 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#13 0x0000557ebf6e298d in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1602
#14 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53861
#15 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#16 0x0000557ebf6e298d in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1602
#17 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53861
#18 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#19 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#20 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#21 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#22 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#23 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#24 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#25 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#26 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#27 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#28 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#29 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#30 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#31 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#32 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#33 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#34 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#35 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#36 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#37 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#38 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#39 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#40 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#41 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#42 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#43 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#44 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#45 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#46 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#47 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#48 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#49 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#50 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#51 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#52 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#53 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#54 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#55 0x0000557ebf6e3b6d in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:1714
#56 execute_ex () at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:53865
#57 0x00007f3378d73b4d in sp_execute_ex () from /opt/remi/php74/root/usr/lib64/php/modules/snuffleupagus.so
#58 0x0000557ebf6e4d81 in zend_execute (op_array=0x7f3393e77000, return_value=<optimized out>) at /usr/src/debug/php-7.4.13/Zend/zend_vm_execute.h:57957
#59 0x0000557ebf65de33 in zend_execute_scripts () at /usr/src/debug/php-7.4.13/Zend/zend.c:1677
#60 0x0000557ebf5fc610 in php_execute_script () at /usr/src/debug/php-7.4.13/main/main.c:2621
#61 0x0000557ebf468d90 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/php-7.4.13/sapi/fpm/fpm/fpm_main.c:1942

We had the issue with php 7.2 and 7.4, although both versions are using apcu 5.1.19 from @remicollet repositories.

This is:

# cat /etc/redhat-release 
CentOS Linux release 7.9.2009 (Core)

The system is up2date and all patches haves been applied (also from @remicollet SCL repos).

# rpm -qi php74-php-pecl-apcu                                                                                                                                                                                                                   
Name        : php74-php-pecl-apcu
Version     : 5.1.19
Release     : 1.el7.remi

php-fpm config for that particular vhost:

# cat /etc/opt/remi/php74/php-fpm.d/example.org.conf 
[global]
pid = /run/fpm-example.org/pid
error_log = /var/www/vhosts/example.org/logs/fpm-error.log
daemonize = no

[example.org-0]
listen = /run/fpm-example.org-socket/0.socket
pm = ondemand
pm.max_children = 10
pm.process_idle_timeout = 30s
slowlog = /var/www/vhosts/example.org/logs/fpm-slow-0.log

env[TMP] = /var/www/vhosts/example.org/tmp/tmp
env[TMPDIR] = /var/www/vhosts/example.org/tmp/tmp
env[TEMP] = /var/www/vhosts/example.org/tmp/tmp


php_admin_value[apc.mmap_file_mask] = /var/www/vhosts/example.org/tmp/apc.XXXXXX
php_admin_flag[engine] = on
php_admin_value[error_log] = /var/www/vhosts/example.org/logs/php_error_log
php_admin_value[max_execution_time] = 207
php_admin_value[open_basedir] = /opt/remi/php74/root/usr/share/php/:/opt/remi/php74/root/usr/share/pear/:/var/www/vhosts/example.org/www:/var/www/vhosts/example.org/data:/var/www/vhosts/example.org/tmp
php_admin_value[post_max_size] = 99M
php_admin_value[session.save_path] = /var/www/vhosts/example.org/tmp/sessions
php_admin_value[sp.configuration_file] = /etc/opt/remi/php74/php.d/snuffleupagus-*.rules,/etc/opt/remi/php74/snuffleupagus.d/base.rules,/etc/opt/remi/php74/snuffleupagus.d/example.org.rules
php_admin_value[upload_max_filesize] = 99M
php_admin_value[upload_tmp_dir] = /var/www/vhosts/example.org/tmp/uploads

The fpm process managed through systemd unit and started socket-based activation.

For now we downgraded apcu down to 5.1.18 to see whether it is an issue with the changes in 5.1.19.

There is nothing in the php-fpm slow log and we also don't see anything particular exciting in the fpm-log itself except that max children exceeded.

Let me know if we can provide any additional infos.

@nikic
Copy link
Collaborator

nikic commented Dec 6, 2020

That's a dead lock... The first suspect would be whether an opcache restart might have taken place at the same time this occurred. Opcache may kill processes that fail to restart in time, and if a killed process holds an rwlock, this will result in a deadlock.

@duritong
Copy link
Author

duritong commented Dec 7, 2020

So 5.1.18 was still triggering the issue. Thus for now we disabled the drupal cron flush cache and see whether this is really what triggers it.

What probably happens around the same time, is that logrotation happens, which will reload httpd. BUT since fpm processes are running as fpm, they are independent from httpd and thus should not be affected.

How would such a opcache restart look like?

@nikic
Copy link
Collaborator

nikic commented Dec 7, 2020

How would such a opcache restart look like?

It would result in "Attempting to kill locker" warnings in the opcache log.

@duritong
Copy link
Author

duritong commented Dec 7, 2020

I have not really been able to find such a log entry in one of the logs. Would I need to set a specific opcache log to be sure to not loose it, as otherwise it just seems to go to stdout aka. engine log.

@nikic
Copy link
Collaborator

nikic commented Dec 7, 2020

@duritong I think it's necessary to set opcache.log_verbosity_level=2 (or higher) for these to show up.

@gaohuia
Copy link

gaohuia commented Feb 4, 2021

Hi, have this issue been resolved ? @duritong We met an issue which is quite similar with you. the apcu_inc was stuck

@nikic
Copy link
Collaborator

nikic commented Feb 4, 2021

I have tried to reproduce a deadlock under various conditions, but the only case where I was successful is if a process is terminated using SIGKILL.

I have landed php/php-src@8b7aaad on the opcache side to try terminating processes using SIGTERM first. However, I still do not know if this is really the cause for the issue you're seeing or not.

@nikic
Copy link
Collaborator

nikic commented Mar 4, 2021

If you are making use of apcu_entry(), then this issue may be resolved in apcu 5.1.20. The opcache restart issue should also be addressed in PHP 7.4.16 / 8.0.3 (although there is only one attempt to SIGTERM before SIGKILL).

@chrisddwrt
Copy link

hi,
we updated to 5.1.20 and used php 7.4.16 (from plesk) 2 days ago. It looks good so far.

Since php 7.3 isn't in support anymore, this problem will not be fixed in php 7.3.x ?

Thanks in advance, and thanks for the update !

@duritong
Copy link
Author

We did a new try with php 7.4.16 and acpu 5.1.20 unfortunately, we had another lockup after some days.

I'll try to get a run with a more verbose opcache log. But it seems still tricky to reproduce the issue.

@chrisddwrt
Copy link

Hi,
php 7.4.16 + 5.1.20. Runs well so far. We only see still processes that are hanging (but only 1-2 per week).

But these processes do not lead to problems in the customer store anymore.

I still have 2 processes "hanging arround", let me know if i can help with more information on this.

strace -p 55317
strace: Process 55317 attached
futex(0x7f79ce493084, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 3, NULL, FUTEX_BITSET_MATCH_ANY

@nikic
Copy link
Collaborator

nikic commented Mar 27, 2021

@chrisddwrt Would it be possible to get a backtrace from gdb, to find out which lock is hanging? Something like gdb -p 55317 and then run backtrace in the console.

@chrisddwrt
Copy link

@nikic

here we go:
backtrace.txt

@towardsyoung
Copy link

same problem with php 7.2.13+apcu5.1.16, did this issue resolved?

@LionsAd
Copy link
Contributor

LionsAd commented Apr 21, 2022

Be sure that you are not running an older libc version. It has been quite a while, but some libc versions had problems with deadlocks in FUTEX WAIT where a writer was just sitting in front of a lock and not taking it. In fact the reader writer lock had been rewritten three times IIRC in libc till they got it right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants