-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test point pscheduler failures #1466
Comments
And further, the available memory and CPU utilisation (very high load) clues for the affected systems, with the service outage periods, are quite interesting. These are both testpoint installations. The toolkit installations seem fine. ps-small (a 1G small node in Szymon's PMP mesh): And for Imperial ( a full node): |
Attaching logs obtained today. postgresql-15-main.log.2024-08-27.txt Services started failing around 10:09. I restarted pgsql around 13:30. Then all services around 13:50. |
I haven't attached syslog logs because I am always weary about attaching syslog in a public website. We can share them somehow if Mark and team need them. |
I intended to attach statistics collected over 20h about use of resources, including
Problem: the statistics file has 473MB. Too much for giithub. Is there a place where I can upload it? The data seems to define a trend where the user pscheduler is requesting resources and never releasing them.
This testpoint was member of the PMP dashboard. A month ago, it was reinstalled and never subscribed to PMP. Yet, the PMP giant node keeps contacting this node. I wonder if this can lead to a non-RAII situation. How could an archiver unsubscribe a node that is dead? Does that make sense? |
This is being caused by twin memory leaks, one in the API and one in the runner. The one in the API has been identified and there's a candidate fix for it. I'm working on the runner. |
* Closing branch * Properly cache thread-local database connection object. #1466 * Check cursor closed attribute correctly. #1466 * Don't let leaky API processes live longer than 30 minutes. #1466 * Remove closed branch file from earlier mistake --------- Co-authored-by: Andy Lake <[email protected]>
I have seen pscheduler related services failing several times in three different perfsonar test point hosts.
I am attaching troubleshoot results for two of those hosts.
failures.txt
Also attached are logs for one of those servers.
systemctl.log
Regards,
Raul
The text was updated successfully, but these errors were encountered: