Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test point pscheduler failures #1466

Open
rhclopes opened this issue Aug 19, 2024 · 6 comments
Open

test point pscheduler failures #1466

rhclopes opened this issue Aug 19, 2024 · 6 comments

Comments

@rhclopes
Copy link

I have seen pscheduler related services failing several times in three different perfsonar test point hosts.

I am attaching troubleshoot results for two of those hosts.
failures.txt

Also attached are logs for one of those servers.

systemctl.log

Regards,
Raul

@timchown
Copy link

And further, the available memory and CPU utilisation (very high load) clues for the affected systems, with the service outage periods, are quite interesting.

These are both testpoint installations. The toolkit installations seem fine.

ps-small (a 1G small node in Szymon's PMP mesh):
See https://ps-mesh.perf.ja.net/grafana/d/eb96563b-0d93-4910-be3c-5be331a00339/perfsonar-host-metrics?orgId=1&var-host=ps-small-slough.perf.ja.net&var-node_name=ps-london-bw.perf.ja.net&var-node_ip=All&from=now-7d&to=now.

And for Imperial ( a full node):
https://ps-mesh.perf.ja.net/grafana/d/eb96563b-0d93-4910-be3c-5be331a00339/perfsonar-host-metrics?orgId=1&var-host=lt2ps00-bw.grid.hep.ph.ic.ac.uk&var-node_name=lt2ps00.grid.hep.ph.ic.ac.uk&var-node_ip=All&from=now-7d&to=now

@rhclopes
Copy link
Author

Attaching logs obtained today.

postgresql-15-main.log.2024-08-27.txt
pscheduler.log-Aug27.txt

Services started failing around 10:09.

I restarted pgsql around 13:30. Then all services around 13:50.

@rhclopes
Copy link
Author

rhclopes commented Aug 27, 2024

I haven't attached syslog logs because I am always weary about attaching syslog in a public website. We can share them somehow if Mark and team need them.

@rhclopes
Copy link
Author

I intended to attach statistics collected over 20h about use of resources, including

  • numer of open files;
  • open files;
  • number of processes owned by pscheduler, postgresql, perfsonar;
  • processes running reverse sorted by memory.

Problem: the statistics file has 473MB. Too much for giithub. Is there a place where I can upload it?

The data seems to define a trend where the user pscheduler is requesting resources and never releasing them.

  • 48K files opened yesterday at 13:00, 100K+ today around 8:00.
  • 40 prices owned by psecheduper yesterday, 349 now.

This testpoint was member of the PMP dashboard. A month ago, it was reinstalled and never subscribed to PMP. Yet, the PMP giant node keeps contacting this node. I wonder if this can lead to a non-RAII situation.

How could an archiver unsubscribe a node that is dead? Does that make sense?

@rhclopes
Copy link
Author

nohup.out-last1000.gz

@mfeit-internet2
Copy link
Member

This is being caused by twin memory leaks, one in the API and one in the runner. The one in the API has been identified and there's a candidate fix for it. I'm working on the runner.

mfeit-internet2 added a commit that referenced this issue Sep 4, 2024
* Closing branch

* Properly cache thread-local database connection object.  #1466

* Check cursor closed attribute correctly.  #1466

* Don't let leaky API processes live longer than 30 minutes.  #1466

* Remove closed branch file from earlier mistake

---------

Co-authored-by: Andy Lake <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready
Development

No branches or pull requests

3 participants