Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed .cerise/ environment setup on remote HPC resource #81

Open
marcvdijk opened this issue Dec 4, 2018 · 8 comments
Open

Failed .cerise/ environment setup on remote HPC resource #81

marcvdijk opened this issue Dec 4, 2018 · 8 comments
Assignees
Labels

Comments

@marcvdijk
Copy link
Member

I'm experiencing incomplete setup of remote .cerise/ environments on GT and LISA HPC resources.

For GT the .cerise/ directory and all of the files in it where copied to the remote resource but the setup procedure failed at the miniconda stage. Miniconda was downloaded and installed but the cerise virtual env was not created. After running the respective install script manually the environment was created successfully and MD jobs could be launched by lie_md.

For LISA only the .cerise/api and .cerise/jobs directories where created without any files in them. The system hanged in this state indefinite. I tried a few times always with the same result.

@LourensVeen
Copy link
Member

Some server logs would be really useful for this. If you have the container still, you can get them using docker cp container-name:/var/log/cerise/cerise_backend.log . to copy them to the current directory.

@marcvdijk
Copy link
Member Author

This is the cerise_backend log from the last run using the cerise-mdstudio-lisa specialization:

[2018-12-04 12:43:30.014] [INFO] Starting up [root]
[2018-12-04 12:43:30.022] [DEBUG] protocol: sftp, location: lisa.surfsara.nl, credential: <cerulean.credential.PasswordCredential object at 0x7f4f2c2434a8> [cerise.config]
[2018-12-04 12:43:30.369] [INFO] Connecting to lisa.surfsara.nl on port 22 [cerulean.ssh_terminal]
[2018-12-04 12:43:30.369] [DEBUG] Authenticating using a password [cerulean.ssh_terminal]
[2018-12-04 12:43:30.369] [DEBUG] starting thread (client mode): 0x2c2431d0 [paramiko.transport]
[2018-12-04 12:43:30.370] [DEBUG] Local version/idstring: SSH-2.0-paramiko_2.4.2 [paramiko.transport]
[2018-12-04 12:43:30.375] [DEBUG] Remote version/idstring: SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u4 [paramiko.transport]
[2018-12-04 12:43:30.376] [INFO] Connected (version 2.0, client OpenSSH_7.4p1) [paramiko.transport]
[2018-12-04 12:43:30.377] [DEBUG] kex algos:['curve25519-sha256', '[email protected]', 'ecdh-sha2-nistp256', 'ecdh-sha2-nistp384', 'ecdh-sha2-nistp521', 'diffie-hellman-group-exchange-sha256', 'diffie-hellman-group16-sha512', 'diffie-hellman-group18-sha512', 'diffie-hellman-group14-sha256', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa', 'rsa-sha2-512', 'rsa-sha2-256', 'ssh-ed25519'] client encrypt:['[email protected]', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', '[email protected]', '[email protected]'] server encrypt:['[email protected]', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', '[email protected]', '[email protected]'] client mac:['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] server mac:['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] client compress:['none', '[email protected]'] server compress:['none', '[email protected]'] client lang:[''] server lang:[''] kex follows?False [paramiko.transport]
[2018-12-04 12:43:30.377] [DEBUG] Kex agreed: ecdh-sha2-nistp256 [paramiko.transport]
[2018-12-04 12:43:30.377] [DEBUG] HostKey agreed: ssh-ed25519 [paramiko.transport]
[2018-12-04 12:43:30.378] [DEBUG] Cipher agreed: aes128-ctr [paramiko.transport]
[2018-12-04 12:43:30.378] [DEBUG] MAC agreed: hmac-sha2-256 [paramiko.transport]
[2018-12-04 12:43:30.378] [DEBUG] Compression agreed: none [paramiko.transport]
[2018-12-04 12:43:30.425] [DEBUG] kex engine KexNistp256 specified hash_algo <built-in function openssl_sha256> [paramiko.transport]
[2018-12-04 12:43:30.425] [DEBUG] Switch to new keys ... [paramiko.transport]
[2018-12-04 12:43:30.425] [DEBUG] Attempting password auth... [paramiko.transport]
[2018-12-04 12:43:30.429] [DEBUG] userauth is OK [paramiko.transport]
[2018-12-04 12:43:30.562] [INFO] Auth banner: b'                              SURFsara\n        \n                        Welcome to SURFsara\n\n   This is a private computer facility.   Access for any reason must be\n   specifically authorized by the owner.  Unless you are so authorized,\n   your continued  access and any other use may  expose you to criminal\n   and/or civil proceedings.\n\n   Information:          http://www.surfsara.nl\n\n' [paramiko.transport]
[2018-12-04 12:43:30.562] [INFO] Authentication (password) successful! [paramiko.transport]
[2018-12-04 12:43:30.562] [INFO] Connection (re)established [cerulean.ssh_terminal]
[2018-12-04 12:43:30.562] [INFO] Connecting to SFTP server [cerulean.sftp_file_system]
[2018-12-04 12:43:30.562] [DEBUG] [chan 0] Max packet in: 32768 bytes [paramiko.transport]
[2018-12-04 12:43:30.813] [DEBUG] Received global request "[email protected]" [paramiko.transport]
[2018-12-04 12:43:30.813] [DEBUG] Rejecting "[email protected]" global request from server. [paramiko.transport]
[2018-12-04 12:43:30.815] [DEBUG] [chan 0] Max packet out: 32768 bytes [paramiko.transport]
[2018-12-04 12:43:30.815] [DEBUG] Secsh channel 0 opened. [paramiko.transport]
[2018-12-04 12:43:30.819] [DEBUG] [chan 0] Sesch channel 0 request ok [paramiko.transport]
[2018-12-04 12:43:31.051] [INFO] [chan 0] Opened sftp connection (server version 3) [paramiko.transport.sftp]
[2018-12-04 12:43:31.051] [INFO] Connected to SFTP server [cerulean.sftp_file_system]
[2018-12-04 12:43:31.056] [INFO] Connecting to lisa.surfsara.nl on port 22 [cerulean.ssh_terminal]
[2018-12-04 12:43:31.056] [DEBUG] Authenticating using a password [cerulean.ssh_terminal]
[2018-12-04 12:43:31.056] [DEBUG] starting thread (client mode): 0x309ffb00 [paramiko.transport]
[2018-12-04 12:43:31.056] [DEBUG] Local version/idstring: SSH-2.0-paramiko_2.4.2 [paramiko.transport]
[2018-12-04 12:43:31.062] [DEBUG] Remote version/idstring: SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u4 [paramiko.transport]
[2018-12-04 12:43:31.062] [INFO] Connected (version 2.0, client OpenSSH_7.4p1) [paramiko.transport]
[2018-12-04 12:43:31.064] [DEBUG] kex algos:['curve25519-sha256', '[email protected]', 'ecdh-sha2-nistp256', 'ecdh-sha2-nistp384', 'ecdh-sha2-nistp521', 'diffie-hellman-group-exchange-sha256', 'diffie-hellman-group16-sha512', 'diffie-hellman-group18-sha512', 'diffie-hellman-group14-sha256', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa', 'rsa-sha2-512', 'rsa-sha2-256', 'ssh-ed25519'] client encrypt:['[email protected]', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', '[email protected]', '[email protected]'] server encrypt:['[email protected]', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', '[email protected]', '[email protected]'] client mac:['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] server mac:['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] client compress:['none', '[email protected]'] server compress:['none', '[email protected]'] client lang:[''] server lang:[''] kex follows?False [paramiko.transport]
[2018-12-04 12:43:31.064] [DEBUG] Kex agreed: ecdh-sha2-nistp256 [paramiko.transport]
[2018-12-04 12:43:31.064] [DEBUG] HostKey agreed: ssh-ed25519 [paramiko.transport]
[2018-12-04 12:43:31.065] [DEBUG] Cipher agreed: aes128-ctr [paramiko.transport]
[2018-12-04 12:43:31.065] [DEBUG] MAC agreed: hmac-sha2-256 [paramiko.transport]
[2018-12-04 12:43:31.065] [DEBUG] Compression agreed: none [paramiko.transport]
[2018-12-04 12:43:31.071] [DEBUG] kex engine KexNistp256 specified hash_algo <built-in function openssl_sha256> [paramiko.transport]
[2018-12-04 12:43:31.072] [DEBUG] Switch to new keys ... [paramiko.transport]
[2018-12-04 12:43:31.072] [DEBUG] Attempting password auth... [paramiko.transport]
[2018-12-04 12:43:31.075] [DEBUG] userauth is OK [paramiko.transport]
[2018-12-04 12:43:31.209] [INFO] Auth banner: b'                              SURFsara\n        \n                        Welcome to SURFsara\n\n   This is a private computer facility.   Access for any reason must be\n   specifically authorized by the owner.  Unless you are so authorized,\n   your continued  access and any other use may  expose you to criminal\n   and/or civil proceedings.\n\n   Information:          http://www.surfsara.nl\n\n' [paramiko.transport]
[2018-12-04 12:43:31.209] [INFO] Authentication (password) successful! [paramiko.transport]
[2018-12-04 12:43:31.209] [INFO] Connection (re)established [cerulean.ssh_terminal]
[2018-12-04 12:43:31.210] [DEBUG] [chan 0] stat(b'/') [paramiko.transport.sftp]
[2018-12-04 12:43:31.213] [DEBUG] [chan 0] stat(b'/home') [paramiko.transport.sftp]
[2018-12-04 12:43:31.217] [DEBUG] [chan 0] stat(b'/home/mvandijk') [paramiko.transport.sftp]
[2018-12-04 12:43:31.221] [DEBUG] [chan 0] stat(b'/home/mvandijk/.cerise') [paramiko.transport.sftp]
[2018-12-04 12:43:31.224] [DEBUG] [chan 0] mkdir(b'/home/mvandijk/.cerise', 511) [paramiko.transport.sftp]
[2018-12-04 12:43:31.227] [DEBUG] [chan 0] chmod(b'/home/mvandijk/.cerise', 511) [paramiko.transport.sftp]
[2018-12-04 12:43:31.231] [DEBUG] [chan 0] stat(b'/home/mvandijk/.cerise/api') [paramiko.transport.sftp]
[2018-12-04 12:43:31.234] [DEBUG] [chan 0] mkdir(b'/home/mvandijk/.cerise/api', 511) [paramiko.transport.sftp]
[2018-12-04 12:43:31.237] [DEBUG] [chan 0] chmod(b'/home/mvandijk/.cerise/api', 488) [paramiko.transport.sftp]
[2018-12-04 12:43:31.242] [DEBUG] Received global request "[email protected]" [paramiko.transport]
[2018-12-04 12:43:31.242] [DEBUG] Rejecting "[email protected]" global request from server. [paramiko.transport]
[2018-12-04 12:43:31.243] [DEBUG] Scanning file for requirements: /home/cerise/cerise/../api/cerise/steps/cerise/test/hostname.cwl [cerise.back_end.job_planner]
[2018-12-04 12:43:31.248] [DEBUG] Step cerise/test/hostname.cwl requires 2 cores [cerise.back_end.job_planner]
[2018-12-04 12:43:31.248] [DEBUG] Scanning file for requirements: /home/cerise/cerise/../api/cerise/steps/cerise/test/wc.cwl [cerise.back_end.job_planner]
[2018-12-04 12:43:31.254] [DEBUG] Step cerise/test/wc.cwl requires 0 cores [cerise.back_end.job_planner]
[2018-12-04 12:43:31.254] [DEBUG] Scanning file for requirements: /home/cerise/cerise/../api/cerise/steps/cerise/test/echo.cwl [cerise.back_end.job_planner]
[2018-12-04 12:43:31.259] [DEBUG] Step cerise/test/echo.cwl requires 0 cores [cerise.back_end.job_planner]
[2018-12-04 12:43:31.260] [DEBUG] Scanning file for requirements: /home/cerise/cerise/../api/cerise/steps/cerise/test/sleep.cwl [cerise.back_end.job_planner]
[2018-12-04 12:43:31.263] [DEBUG] Step cerise/test/sleep.cwl requires 0 cores [cerise.back_end.job_planner]
[2018-12-04 12:43:31.264] [DEBUG] basedir: /home/mvandijk/.cerise [cerise.back_end.remote_job_files]
[2018-12-04 12:43:31.264] [DEBUG] [chan 0] stat(b'/') [paramiko.transport.sftp]
[2018-12-04 12:43:31.267] [DEBUG] [chan 0] stat(b'/home') [paramiko.transport.sftp]
[2018-12-04 12:43:31.270] [DEBUG] [chan 0] stat(b'/home/mvandijk') [paramiko.transport.sftp]
[2018-12-04 12:43:31.273] [DEBUG] [chan 0] stat(b'/home/mvandijk/.cerise') [paramiko.transport.sftp]
[2018-12-04 12:43:31.276] [DEBUG] [chan 0] stat(b'/') [paramiko.transport.sftp]
[2018-12-04 12:43:31.278] [DEBUG] [chan 0] stat(b'/home') [paramiko.transport.sftp]
[2018-12-04 12:43:31.281] [DEBUG] [chan 0] stat(b'/home/mvandijk') [paramiko.transport.sftp]
[2018-12-04 12:43:31.284] [DEBUG] [chan 0] stat(b'/home/mvandijk/.cerise') [paramiko.transport.sftp]
[2018-12-04 12:43:31.287] [DEBUG] [chan 0] stat(b'/home/mvandijk/.cerise/jobs') [paramiko.transport.sftp]
[2018-12-04 12:43:31.290] [DEBUG] [chan 0] mkdir(b'/home/mvandijk/.cerise/jobs', 511) [paramiko.transport.sftp]
[2018-12-04 12:43:31.294] [DEBUG] [chan 0] chmod(b'/home/mvandijk/.cerise/jobs', 511) [paramiko.transport.sftp]
[2018-12-04 12:43:31.302] [INFO] Connecting to lisa.surfsara.nl on port 22 [cerulean.ssh_terminal]
[2018-12-04 12:43:31.302] [DEBUG] Authenticating using a password [cerulean.ssh_terminal]
[2018-12-04 12:43:31.302] [DEBUG] starting thread (client mode): 0x2a5754e0 [paramiko.transport]
[2018-12-04 12:43:31.303] [DEBUG] Local version/idstring: SSH-2.0-paramiko_2.4.2 [paramiko.transport]
[2018-12-04 12:43:31.308] [DEBUG] Remote version/idstring: SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u4 [paramiko.transport]
[2018-12-04 12:43:31.308] [INFO] Connected (version 2.0, client OpenSSH_7.4p1) [paramiko.transport]
[2018-12-04 12:43:31.310] [DEBUG] kex algos:['curve25519-sha256', '[email protected]', 'ecdh-sha2-nistp256', 'ecdh-sha2-nistp384', 'ecdh-sha2-nistp521', 'diffie-hellman-group-exchange-sha256', 'diffie-hellman-group16-sha512', 'diffie-hellman-group18-sha512', 'diffie-hellman-group14-sha256', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa', 'rsa-sha2-512', 'rsa-sha2-256', 'ssh-ed25519'] client encrypt:['[email protected]', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', '[email protected]', '[email protected]'] server encrypt:['[email protected]', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', '[email protected]', 'aes256-gcm@
openssh.com'] client mac:['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] server mac:['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] client compress:['none', '[email protected]'] server compress:['none', '[email protected]'] client lang:[''] server lang:[''] kex follows?False [paramiko.transport]
[2018-12-04 12:43:31.310] [DEBUG] Kex agreed: ecdh-sha2-nistp256 [paramiko.transport]
[2018-12-04 12:43:31.310] [DEBUG] HostKey agreed: ssh-ed25519 [paramiko.transport]
[2018-12-04 12:43:31.310] [DEBUG] Cipher agreed: aes128-ctr [paramiko.transport]
[2018-12-04 12:43:31.310] [DEBUG] MAC agreed: hmac-sha2-256 [paramiko.transport]
[2018-12-04 12:43:31.310] [DEBUG] Compression agreed: none [paramiko.transport]
[2018-12-04 12:43:31.315] [DEBUG] kex engine KexNistp256 specified hash_algo <built-in function openssl_sha256> [paramiko.transport]
[2018-12-04 12:43:31.315] [DEBUG] Switch to new keys ... [paramiko.transport]
[2018-12-04 12:43:31.316] [DEBUG] Attempting password auth... [paramiko.transport]
[2018-12-04 12:43:31.318] [DEBUG] userauth is OK [paramiko.transport]
[2018-12-04 12:43:31.459] [INFO] Auth banner: b'                              SURFsara\n        \n                        Welcome to SURFsara\n\n   This is a private computer facility.   Access for any reason must be\n   specifically authorized by the owner.  Unless you are so authorized,\n   your continued  access and any other use may  expose you to criminal\n   and/or civil proceedings.\n\n   Information:          http://www.surfsara.nl\n\n' [paramiko.transport]
[2018-12-04 12:43:31.459] [INFO] Authentication (password) successful! [paramiko.transport]
[2018-12-04 12:43:31.459] [INFO] Connection (re)established [cerulean.ssh_terminal]
[2018-12-04 12:43:31.460] [DEBUG] Executing sbatch --version [cerulean.ssh_terminal]
[2018-12-04 12:43:31.460] [DEBUG] [chan 0] Max packet in: 32768 bytes [paramiko.transport]
[2018-12-04 12:43:31.492] [DEBUG] Received global request "[email protected]" [paramiko.transport]
[2018-12-04 12:43:31.492] [DEBUG] Rejecting "[email protected]" global request from server. [paramiko.transport]
[2018-12-04 12:43:31.494] [DEBUG] [chan 0] Max packet out: 32768 bytes [paramiko.transport]
[2018-12-04 12:43:31.494] [DEBUG] Secsh channel 0 opened. [paramiko.transport]
[2018-12-04 12:43:31.495] [DEBUG] Opened session [cerulean.ssh_terminal]
[2018-12-04 12:43:31.498] [DEBUG] [chan 0] Sesch channel 0 request ok [paramiko.transport]
[2018-12-04 12:43:31.498] [DEBUG] exec_command done [cerulean.ssh_terminal]
[2018-12-04 12:43:31.498] [DEBUG] stdin sent [cerulean.ssh_terminal]
[2018-12-04 12:43:31.859] [DEBUG] [chan 0] EOF received (0) [paramiko.transport]
[2018-12-04 12:43:31.859] [DEBUG] got output True slurm-wlm 18.08.3
 True  [cerulean.ssh_terminal]
[2018-12-04 12:43:31.860] [DEBUG] received exit status 0 [cerulean.ssh_terminal]
[2018-12-04 12:43:31.860] [DEBUG] [chan 0] EOF sent (0) [paramiko.transport]
[2018-12-04 12:43:31.860] [DEBUG] Command executed successfully [cerulean.ssh_terminal]
[2018-12-04 12:43:31.860] [DEBUG] sbatch --version exit code: 0 [cerulean.slurm_scheduler]
[2018-12-04 12:43:31.860] [DEBUG] sbatch --version output: slurm-wlm 18.08.3
 [cerulean.slurm_scheduler]
[2018-12-04 12:43:31.860] [DEBUG] sbatch --version error:  [cerulean.slurm_scheduler]
[2018-12-04 12:43:31.861] [DEBUG] Slots per node set to 4 [cerise.back_end.job_runner]
[2018-12-04 12:43:31.861] [DEBUG] [chan 0] stat(b'/home/mvandijk/.cerise/api/cerise/version') [paramiko.transport.sftp]
[2018-12-04 12:43:31.866] [CRITICAL] Traceback (most recent call last):
  File "cerise/run_back_end.py", line 42, in <module>
    manager = ExecutionManager(config, apidir)
  File "cerise/../cerise/back_end/execution_manager.py", line 78, in __init__
    self._update_available = self._remote_api.update_available()
  File "cerise/../cerise/back_end/remote_api.py", line 57, in update_available
    return self._updatable_projects() != []
  File "cerise/../cerise/back_end/remote_api.py", line 140, in _updatable_projects
    project_name))
RuntimeError: Project "steps" in local API definition is missing a "version" file.
 [root]
[2018-12-04 12:43:31.866] [INFO] Shutting down [root]

@LourensVeen
Copy link
Member

Uh oh, this is not good. You're using the latest develop Cerise, and the specialisation hasn't been updated to that yet. The new version has a version file to facilitate the CWL API update mechanism.

It looks like the problem is in the cerise-mdstudio-lisa Dockerfile, it says FROM mdstudio/cerise:develop at the top, which should be FROM mdstudio/cerise:0.1.0. The GT and DAS5 versions are correct.

@marcvdijk
Copy link
Member Author

I see, makes sense, I will update the version

@marcvdijk
Copy link
Member Author

Updated the docker file and rerun a workflow with the newly build docker image.
Got a bit further but running into the following error in the cerise_backend log:

[2018-12-04 14:05:18.362] [DEBUG] Staging API install script to /home/mvandijk/.cerise/api/install.sh from /home/cerise/cerise/../api/install.sh [cerise.back_end.xenon_remote_files]
[2018-12-04 14:05:21.858] [CRITICAL] Traceback (most recent call last):
  File "cerise/run_back_end.py", line 48, in <module>
    manager = ExecutionManager(config, apidir, xenon_)
  File "cerise/../cerise/back_end/execution_manager.py", line 64, in __init__
    api_files_path, api_install_script_path)
  File "cerise/../cerise/back_end/xenon_job_runner.py", line 30, in __init__
    self._sched = config.get_scheduler()
  File "cerise/../cerise/config.py", line 224, in get_scheduler
    scheme, location, credential, properties)
jpype._jexception.nl.esciencecenter.xenon.XenonExceptionPyRaisable: nl.esciencecenter.xenon.XenonException: slurm adaptor: Got invalid key/value pair in output: Cgroup Support Configuration:
 [root]
[2018-12-04 14:05:21.859] [INFO] Shutting down [root]

The api is indeed not build, no conda installation.

@LourensVeen
Copy link
Member

That looks like it could be that Lisa has the very latest version of Slurm, and that Xenon 1 doesn't support it. So I guess it'll have to wait for the new Cerise with Cerulean...

@LourensVeen
Copy link
Member

Lisa is running Slurm 18.08, so indeed pretty new. Cerulean isn't tested with it yet either, but I'll add it.

@LourensVeen
Copy link
Member

The current Cerulean works fine with 18.08, and this will be backed up by tests in the next version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants