issue with lsf-drmaa #14

ink1 · 2014-06-13T12:04:59Z

Hello,
Could you think of a reason why drmaa-python and lsf-drmaa
https://github.com/PlatformLSF/lsf-drmaa
would segfault on RH 6.4 but not on opensuse 12.1?
Trying some of the examples, I can see a job gets submitted (and run) but Python segfaults.
Thanks

dan-blanchard · 2014-06-13T13:38:40Z

Nothing comes to mind, but you could always try using faulthandler to see if you can get more debugging information.

ink1 · 2014-06-13T13:49:54Z

thanks, will try. i could only trace it to the c call in drmaa/helpers.py

ink1 · 2014-06-13T16:00:29Z

I'm afraid I can't get any further with faulthandler than I already got with python tracing

login1 examples> ./test.sh 
Call to main on line 25 of example4my.py from line 48 of example4my.py
Call to __init__ on line 233 of /drmaa-0.7.6/drmaa/session.py from line 29 of example4my.py
Call to initialize on line 237 of /drmaa-0.7.6/drmaa/session.py from line 30 of example4my.py
Call to py_drmaa_init on line 70 of /drmaa-0.7.6/drmaa/wrappers.py from line 257 of /drmaa-0.7.6/drmaa/session.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 73 of /drmaa-0.7.6/drmaa/wrappers.py
Creating job template
Call to createJobTemplate on line 274 of /drmaa-0.7.6/drmaa/session.py from line 32 of example4my.py
Call to __init__ on line 156 of /drmaa-0.7.6/drmaa/session.py from line 284 of /drmaa-0.7.6/drmaa/session.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 163 of /drmaa-0.7.6/drmaa/session.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 299 of /drmaa-0.7.6/drmaa/helpers.py
Call to __set__ on line 147 of /drmaa-0.7.6/drmaa/helpers.py from line 33 of example4my.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 154 of /drmaa-0.7.6/drmaa/helpers.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 299 of /drmaa-0.7.6/drmaa/helpers.py
Call to __set__ on line 181 of /drmaa-0.7.6/drmaa/helpers.py from line 34 of example4my.py
Call to string_vector on line 302 of /drmaa-0.7.6/drmaa/helpers.py from line 183 of /drmaa-0.7.6/drmaa/helpers.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 183 of /drmaa-0.7.6/drmaa/helpers.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 299 of /drmaa-0.7.6/drmaa/helpers.py
Call to __set__ on line 147 of /drmaa-0.7.6/drmaa/helpers.py from line 35 of example4my.py
Call to to_drmaa on line 70 of /drmaa-0.7.6/drmaa/helpers.py from line 149 of /drmaa-0.7.6/drmaa/helpers.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 154 of /drmaa-0.7.6/drmaa/helpers.py
Call to error_check on line 145 of /drmaa-0.7.6/drmaa/errors.py from line 299 of /drmaa-0.7.6/drmaa/helpers.py
Call to runJob on line 301 of /drmaa-0.7.6/drmaa/session.py from line 37 of example4my.py
Call to create_string_buffer on line 52 of /usr/lib64/python2.6/ctypes/__init__.py from line 313 of /drmaa-0.7.6/drmaa/session.py
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 314 of /drmaa-0.7.6/drmaa/session.py
Job <124235> is submitted to queue <short>.
./test.sh: line 7: 29578 Segmentation fault      python example4my.py

With faulthandler

login1 examples> ./test.sh 
Creating job template
Job <124254> is submitted to queue <short>.
Fatal Python error: Segmentation fault

Current thread 0x00002aca8d42dbe0:
  File "/drmaa-0.7.6/drmaa/helpers.py", line 299 in c
  File "/drmaa-0.7.6/drmaa/session.py", line 314 in runJob
  File "example4my.py", line 38 in main
  File "example4my.py", line 51 in <module>
./test.sh: line 14: 24698 Segmentation fault      python2.7 example4my.py

It is clear that runJob calls "c" which crashes when returning a call from the API. It does not look to me as an API problem since the c function is called a number of times prior to that.
On the other hand, I can successully use the DRMAA API using the C example provided with lsf-drmaa and yet another Java application. Puzzling.

dan-blanchard · 2014-06-13T17:23:17Z

The c function is just a little helper method for calling any of the C DRMAA functions, so the segfault is definitely when it's interacting with the C library.

I've never used LSF before (or even really heard much about it), but I think the problem is that the LSF DRMAA implementation seems to be expecting the job ID to be a integer, whereas we're passing it as a string. I'm basing this on the fact that PyLSF library uses a long in their runJobRequest struct.

I just double-checked the DRMAA specifications, and they definitely say that job IDs should be strings, so this seems to be a mistake in the LSF DRMAA interface.

I would file an issue with the LSF people if you can, or try using PyLSF if you need Python bindings that work with LSF right away.

Although, did you say that this is working on openSUSE?

ink1 · 2014-06-14T23:24:33Z

Dan, thank you for looking into this.
PyLSF is rather dated while we are running LSF 9.1.2. IBM/PlatformLSF have
recently renewed their support for DRMAA which is validated for 9.1.2. They
also released Python API but I'm trying to make Galaxy work and it needs
Python DRMAA.
I would not claim thorough testing but, yes, Python DRMAA seems to be
working with LSF 9.1.2 on OpenSUSE 12.1 (python 2.7).

LSF DRMAA specifies job id as a char array. For example,
https://github.com/PlatformLSF/lsf-drmaa/blob/master/sample/sub.c#L17
sets job_id length to

#define MAX_LEN_JOBID 100

This should work similarly to your

    jid = create_string_buffer(128)
    c(drmaa_run_job, jid, sizeof(jid), jobTemplate)

in runJob function

For reference, LSF drmaa.h is here
https://github.com/PlatformLSF/lsf-drmaa/blob/master/drmaa_utils/drmaa_utils/drmaa.h

On 13 June 2014 18:23, Dan Blanchard [email protected] wrote:

The c function is just a little helper method for calling any of the C
DRMAA functions, so the segfault is definitely when it's interacting with
the C library.

I've never used LSF before (or even really heard much about it), but I
think the problem is that the LSF DRMAA implementation seems to be
expecting the job ID to be a integer, whereas we're passing it as a string.
I'm basing this on the fact that PyLSF library uses a long in their
runJobRequest struct
https://github.com/gmccance/pylsf/blob/master/pylsf/pylsf.pyx#L1401.

I just double-checked the DRMAA specifications
http://www.ogf.org/documents/GFD.130.pdf, and they definitely say that
job IDs should be strings, so this seems to be a mistake in the LSF DRMAA
interface.

I would file an issue with the LSF people if you can, or try using PyLSF
https://github.com/gmccance/pylsf/ if you need Python bindings that
work with LSF right away.

Although, did you say that this is working on openSUSE?

—
Reply to this email directly or view it on GitHub
#14 (comment)
.

dan-blanchard · 2014-06-16T13:28:43Z

I see that Galaxy has version 0.6 of our library set as what they require. Is that what you're using? I'm curious if you see that same issues with 0.7.6.

I've also just submitted a PR to that project to update their version to 0.7.6 (or at least, I tried to but bitbucket is being very slow with the forking at the moment).

If LSF limits the JOBID length at 100, that might explain why you get a segfault (since we pass a string buffer of 128 characters). Although, that wouldn't explain why it works on OpenSUSE and not on RedHat...

Maybe try modifying your locally installed copy of DRMAA Python to change the buffer there to be 100 characters and see if that works. If that works, I'll modify the DRMAA Python to allow you set an environment variable that controls how long the buffer can be.

ink1 · 2014-06-16T14:44:12Z

Yes, once I identified where the problem is with Galaxy I switched to
debugging 0.7.6.

Indeed, LSF drmaa.h defines
#define DRMAA_ERROR_STRING_BUFFER 4096
#define DRMAA_JOBNAME_BUFFER 128
#define DRMAA_SIGNAL_BUFFER 32

whereas drmaa/const.py sets
ERROR_STRING_BUFFER = 1024
JOBNAME_BUFFER = 1024
SIGNAL_BUFFER = 32

so I've already tried changing the latter to
ERROR_STRING_BUFFER = 4096
JOBNAME_BUFFER = 128
SIGNAL_BUFFER = 32
but no luck.

I've also tried reducing string buffer in runJob in drmaa/session.py from
128 to 100
jid = create_string_buffer(128)
c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
but that also did not help.
I have re-built LSF DRMAA library with --enable-debug in addition to python
tracing and observed

< cut >
Call to c on line 294 of /drmaa-0.7.6/drmaa/helpers.py from line 314 of
/drmaa-0.7.6/drmaa/session.py
t #261c [     0.00] -> drmaa_run_job(jt=0xc44e00)
t #261d [     0.01] -> fsd_job_set_get(job_id=124287)
t #261d [     0.01] <- fsd_job_set_get(job_id=124287) =NULL
t #261d [     0.01] -> fsd_job_set_get(job_id=124289)
t #261d [     0.01] <- fsd_job_set_get(job_id=124289) =NULL
t #261d [     0.01] -> fsd_job_set_get(job_id=124290)
t #261d [     0.01] <- fsd_job_set_get(job_id=124290) =NULL
t #261d [     0.01] -> fsd_job_set_get(job_id=124291)
t #261d [     0.01] <- fsd_job_set_get(job_id=124291) =NULL
t #261d [     0.01] <- lsfdrmaa_session_update_all_jobs_status
d #261c [     0.01]  |   command: "exec" "hostname"
d #261c [     0.01]  |   numProcessors: 0
d #261c [     0.01]  |   maxNumProcessors: 0
d #261c [     0.01]  |   errFile: /dev/null
d #261c [     0.01]  |   jsdlDoc: (null)
Job <124292> is submitted to queue <short>.
d #261c [     0.02]  * lsb_submit( 0xe030e0, 0x7fff5dd05390 ) = 124292[0]
t #261c [     0.02] -> fsd_job_new(124292)
t #261c [     0.02] <- fsd_job_new=0xe061e0: ref_cnt=1 [lock 124292]
t #261c [     0.02] -> fsd_job_set_add(job=0xe061e0, job_id=124292)
t #261c [     0.02] <- fsd_job_set_add: job->ref_cnt=2
t #261c [     0.02] -> fsd_job_release(0xe061e0={job_id=124292, ref_cnt=2})
[unlock 124292]
t #261c [     0.02] <- fsd_job_release
./test.sh: line 16:  9756 Segmentation fault      python2.7 example4my.py

Running the sample C code from LSF -DRMAA produces the following

< cut >
Job <124296> is submitted to queue <normal>.
d #4357 [     0.02]  * lsb_submit( 0x1076c00, 0x7fffda923240 ) = 124296[0]
t #4357 [     0.02] -> fsd_job_new(124296)
t #4357 [     0.02] <- fsd_job_new=0x1079b80: ref_cnt=1 [lock 124296]
t #4357 [     0.02] -> fsd_job_set_add(job=0x1079b80, job_id=124296)
t #4357 [     0.02] <- fsd_job_set_add: job->ref_cnt=2
t #4357 [     0.02] -> fsd_job_release(0x1079b80={job_id=124296,
ref_cnt=2}) [unlock 124296]
t #4357 [     0.02] <- fsd_job_release
t #4357 [     0.02] <- drmaa_run_job =0: job_id=124296
< cut >

So the next line above after fsd_job_release should have been a return from
drmaa_run_job. This means it is likely that the segfault happens on return
to Python.

ink1 · 2014-06-16T16:25:05Z

Tested Python DRMAA on another cluster - SLES 11 SP1, python 2.6.
Job submission works even without the changes to the constants. Still not clear what exactly makes the crucial difference.

jakirkham · 2018-03-12T14:28:59Z

Have been using a recent copy of lsf-drmaa and drmaa-python without issues. So maybe this was fixed at some point in one of them?

zihhuafang · 2020-02-04T15:20:54Z

I am having issue with lsf-drmaa(1.1.1) and drmaa-python (0.7.9) with string formatting.
I ran the following test script to see what might causes the issue

#!/usr/bin/env python

import drmaa

def main():
""" Query the system. """
with drmaa.Session() as s:
print('A DRMAA object was created')
print('Supported contact strings: %s' % s.contact)
print('Supported DRM systems: %s' % s.drmsInfo)
print('Supported DRMAA implementations: %s' % s.drmaaImplementation)
print('Version %s' % s.version)

    print('Exiting')

if name=='main':
main()

I got the following error message:

A DRMAA object was created
Supported contact strings:
Supported DRM systems: IBM Spectrum LSF 10.1
Supported DRMAA implementations: FedStage DRMAA for LSF 1.1.1
Traceback (most recent call last):
File "./test.py", line 17, in
main()
File "./test.py", line 12, in main
print('Version %s' % s.version)
TypeError: not all arguments converted during string formatting

Anyone has an idea how to fix it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue with lsf-drmaa #14

issue with lsf-drmaa #14

ink1 commented Jun 13, 2014

dan-blanchard commented Jun 13, 2014

ink1 commented Jun 13, 2014

ink1 commented Jun 13, 2014

dan-blanchard commented Jun 13, 2014

ink1 commented Jun 14, 2014

dan-blanchard commented Jun 16, 2014

ink1 commented Jun 16, 2014

ink1 commented Jun 16, 2014

jakirkham commented Mar 12, 2018

zihhuafang commented Feb 4, 2020 •

edited

Loading

issue with lsf-drmaa #14

issue with lsf-drmaa #14

Comments

ink1 commented Jun 13, 2014

dan-blanchard commented Jun 13, 2014

ink1 commented Jun 13, 2014

ink1 commented Jun 13, 2014

dan-blanchard commented Jun 13, 2014

ink1 commented Jun 14, 2014

dan-blanchard commented Jun 16, 2014

ink1 commented Jun 16, 2014

ink1 commented Jun 16, 2014

jakirkham commented Mar 12, 2018

zihhuafang commented Feb 4, 2020 • edited Loading

zihhuafang commented Feb 4, 2020 •

edited

Loading