-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What happens to my ComputeUnit? #156
Comments
PS.: I actually don't want to list the CU description here -- again, my hunch is that it is incorrect. But I don't want to find this out by guessing -- that is not an option when using BigJob as Troy backend, I need to be able to identify and report an error... |
Andre, when you say it is in the pending state forever, is this the BigJob output to stdout or what is obtained from using the CU.get_state()? There is not a lot of robust error checking in this area (if any) - but without the CUD, I really don't know what is causing the problem therefore I can't really fix it. In the ideal scenario, the CU would enter Failed state so you get some feedback. Also, out of curiosity, if you do not bind this directly, i.e. use ComputeDataService, is the result the same? I haven't really done any digging in the code on this yet. Further, if you kill your running job, and check the agent file, do you get the information about the CU for which you seek (either agent level or subjob level)? I know this is not the solution you're looking for - but I am trying to see if there is any info in the output. |
Hi Melissa, I am calling cu.get_state(), and see
which only seems to confirm that a status check is dispatched. The CUD is
But again, the purpose of the ticket is not really to debug the problem -- but I want to know what I can do programatically in Troy to find out if a Unit is still alive or not... I did not try to submit via the ComputeDataService, so not sure if that would look different (we don't use that one in Troy). I now saw an error in the agent log (not sure why I did no see that before?). It complete stderr log is
Not sure what the 'busy python' message means -- but it seems to dislike the Description indeed. Not sure why -- see above, 'Executable' is defined. I tried to change the agent code to print the description at this -- but since the agent is pulled from pypi, it seems to have no effect, or at least the print seems to be ignored no matter where and how I install. But again, I don't actually want to fix this -- I need to find out how to handle errors like this... I feel like I am too deep down the rabbit hole again anyways... :/ |
Welp, while I understand your point, I think that is a point for @drelu to comment on, because I am not 100% sure how you can query it other than get_state() The error on the other hand is that the dictionary keys are lowercase, and you're trying to use an uppercase key. PS is dtype something you added? Old-school Dictionary Style:
New-school Variable Style:
|
Ah, so bigjob is using That notwithstanding, lets wait for AL to comment... Thanks! A. |
AndreM: What is the status of this? You marked this as documentation issue (meaning it belongs to me), but afaik, we were waiting for AndreL to comment. |
Well, I fixed the CU description, so things work -- however, I think I still don't understand how errors are to be handled. Not sure if I care anymore at this stage though ;) So, feel free to close the ticket. Thanks, Andre. |
I am submitting a ComputeUnit to a BJ pilot, i.e. via direct submission. I is likely that my ComputeUnitDescription is incorrect / incomplete, but then I would have expected an error. I get, however, a valid Unit instance, and it does never enter 'FAILED' state -- in fact, it remains 'PENDING' forever. I don't see any traces in the bigjob log, nor in the agent logs -- the agent working dir remains empty (local agent via ssh). After printing the jd dict, I see not trace of the CU whatsoever.
How can I find out what happens to the CU, w/o using a debugger or sifting through redis? What is the correct way to get submission errors / runtime errors for CUs?
The text was updated successfully, but these errors were encountered: