For a list of open issues and known problems, see: https://github.com/radical-cybertools/radical.saga/issues
- remove MacOS support
- maintenance release
- Provide a target for Sphinx :py:mod: role.
- set version requirement for RCT stack
- sync with RU
- fix for setuptools upgrade
- maintenance
- pypi fix
- maintenance
- maintenance
- fix RTD
- RU dep version bump
- fixed request for GPUs on Amarel (SLURM)
- support
partition[:qos]
asqueue
instead ofpartition/queue
for Perlmutter@NERSC - fixed SLURM generation for Rivanna
- adding
exclusive
arg to srun - cleaned intermediate fix (PBSPro)
- fixed final state (PBSPro)
- fixed monitoring update interval (PBSPro)
- fixed option for Perlmutter
- fixed option(s) for PBSPro tool(s)
- fix a race condition
- fix CI (removed codecov)
- fix
core-spec
with consideration ofthreads-per-core
(Slurm)
- support for Frontier
- added set of PPNs per queue
- add some job attributes to the shell adaptor but ignore them
- added core specialization option (SLURM)
- enforce to use
threads-per-core
only ifsmt
is insystem_architecture
- fixed value for
processes_per_host
- let shell_wrapper.py close file more promptly.
- updated and cleaned PBSPro adapter
- bump python test env to 3.7
- enhanced SMT control (same approach as for LSF adaptor)
- SAGA.Exception to not overwrite representation
- sync with ru
- add namespace docs as they are not pulled via inheritance
- allow to trap
SIGTERM
beforeSIGKILL
is sent
- add resource
ascent
into LSF adaptor - add
threads-per-core
parameter (Slurm) - convince traverse to provide GPUs
- enforce a predefined directory for the batch-script
- fix SMT setup (in sync with RADICAL-Pilot SMT setup)
- keep sbatch script within job's working directory
- move
host_is_local
to RU - spock needs -N
- Add notifications to slurm
- another attempt to fix traverse support
- support node features (constraints) in SLURM
- clean temporary setup files
- updated slurm script for
crusher
- adding SDSC expanse to slurm adaptor
- consistent use of radical_base
- expanse support
- fix PBS host lookup
- fix directory creation (recursive flag is set)
- add longhorn GRES exception
- typo
- option
--attrs
, refresh Cobalt script generation - added test for Cobalt
- daemonize timer thread
- fix slurm workdir
- remove flake8 and pylint actions
- linting, GH actions
- fix state update notifications on
CANCELED
- bridges2 support
- address race condition on cancellation
- re-enable prompt escape
- fix #787
- Comet takes --gres for gpu
- extended list of valid alloc_flags options for Lassen
- ip-isolate flag for distributed pytorch on Lassen
- updated slurm script record for Comet
- added QoS option for special queue "tmp3" at SuperMUC-NG
- fixed the case when job is routed to a different queue, thus SAGA shouldn't fail the job in this case (Cobalt)
- rtx switch for ppn
- better fix ssh timeout
- fix ssh timeout on idle (but alive) connections
- added job description attribute
SystemArchitecture
- removed
summitdev
from LSFJob, fix cobalt/theta settings - updates for LSF and SLURM job modules
- fix a python3 transition remnant
- Access to K80 and P100 gpus on Bridges
- added exception if number of nodes is not set, but memory is allocated
- get
parse_qs
fromurllib.parse
(modulecgi
is not used in SAGA) - replace PIL with Pillow
- set min python version to 3.6
- merge #782: Lassen support
- support for Frontera's RTX queue
- pr/768: Removing some warnings
- pr/769: fix ppn for frontera
- pr/771: fixed handling of Longhorn's sbatch
- pr/776: clean up logs
- pr/778: traverse support also ensure baclward compatibility for cgi module
- add example for cobalt
- doc fixes (issue #765)
- First commit fixing the docs
- apply exception chaining and some exception / logging cleanup
- longhorn ppn fix
- mira -> theta
- resolve name conflict with threading module
- fix ppn for frontera
- documentation fixing (issue #765)
- fix MANIFEST.in
- ensure '-n' beingused for Stamepede2
- python 3 and later have implicit namespaces
- add noop job adaptor for tests
- slurm data staging
- small fixes and cleanup for slurm and docs (tutorial prep)
- transition to Python3
- fix issue #744
- config fixes
- desable logger during gc
- fix attribute callback
- fix generated slurm script (walltime format)
- fix slurm for rhea
- rename async as it is a reserved word in p3
- testing, flaking, linting and travis fixes
- tiger's Slurm needs
--chdir
now (thanks Lucas!)
- improved support GPUs in lsf (tiger)
- frontera support
- support GPUs in lsf
- fix state notifications in lsf
- default lsf is summit-enabled
- torque: use checkjob on failing qstat
- better cray detection
- code improvements
- upport srun as (unsccheduled or scheduled) launch method
- support tiger @ princeton
- remove support for some legacy machines
- radicalization! rename saga-python to radical.saga
- fix logger levels
- fix issue #661
- Adding missing flags for Summit. add test case
- fix job.cancel in some corner cases
- document missing path normalization
- add support for summit
- get hostname from env variable
- clean out unsupported jod description attributes
- LSF SMT level now defaults to 1 (summit)
- convert to new config file format (json)
- get unit tests back in working order
- gpu support for bridges
- linted, flaked, pepped, cleaned... a bit...
- make commtransparent flag optional
- remove topology restriction on BW
- sync torque and pbspro, remove deprecated pbs
- make job adaptors uniformely emit EPOCH timestamps
- fixes in Slurm and Torque adaptor
- fix version check in slurm adaptor
- fix version check in face of git errors
- fix version check for Stampede (#rp-1754) - thanks Ioannis!
- Add Cheyenne suport - thanks Vivek!
- partial support for heterogeneous clusters (slurm)
- (origin/pr/637) more thorough handling of $PROMPT_COMMAND
- Fix #680 - Corrected Error on LSF: module object has no attribute Event
- Fix/issue 1514
- Fix/issue 662
- Fix/issue 663
- add gpu "support" to torque
- correctly interprete candidate_hosts on pbspro
- fix torque job name handling
- make sure theh default session also inherits the uid (#671)
- remove obsolete comment, add local GPUs
- sync pbspro and torque adaptor.
- catch up with RU log, rep and prof settings
- slurm uses '-N' on wrangler now.
- trigger BW syntax based on version string
- accept SID settings from upper layers (RP) (#654)
- add travis tests, badge
- cheyenne fix to pbspro
- get titan back to work (#664)
- fix task container get_state reval
- accept SID settings from upper layers (RP) (#654)
- fix task container get_state reval
- avoid use of
condor_history
if so configured - cleaner recovery for
condor_hist
when its used - be more resilient against
sacct
errors - one more
PROMPT_COMMAND
setting
- hotfix for RP #1415
- Fix several debug messages
- Fix/621 experiment aimes (#622)
- Fixed run_job to run in service.py
- Properly support SGE job name. (#624)
- Anselm support in pbspro v.13 (#634)
- fix a condor script syntax error
- attempt to prevent job eviction
- be resiliant against lingering NFS locks (hi titan)
- clean bulk job info for condor
- clean up state management in condor adaptor,
- slim down condor log calls, simplify status updates, ensure output transfer
- container_cancel needs to accept timeout parameter (#633)
- don't barf on failing condor_history
- don't limit status check length
- iteration on slurm mpi support (#623)
- fix logic error in directive evaluation
- fix parsing of condor_history for multi-file staging
- fixes on src/tgt ordering in osg staging
- follow the rename of ru.Task to ru.Future
- follow the rename of ru.Thread to ru.Future
- implement bulk cancel for condor
- iteration on bulk submission, data staging
- iteration on run_job implementatio
- improve condor scaling / performance
- make sure we alsways have a valid transfer directive in condor
- make sure we do not fail on a missing exit code
- more clarity on file staging semantics, some cleanup
- better handling of $PROMPT_COMMAND
- resilience against condor_history errors
- update on torque to avoid triggering a check
- hotfix to support CandidateHosts for LoadLeveler
- Add srm adaptor - Thanks Mark!
- Add cobalt adaptor (blue gene/q) - Thanks Manuel!
- Add special case for Rhea
- Deal with timeouts.
- Don't want our jobs to restart after eviction (OSG)
- Make pty shell url configurable.
- Remove some more PBSPro remains.
- address #585
- fix #590
- check that prev_info exists before populating curr_info with its info in _job_get_info
- clean up slurm adaptor to get it fit for the split branch in RP
- fix state interpretation for pbspro
- make torque fit for rp split branch
- remove some debug logs
- some consistency fixes
- update slurm example in context of #611
- use
-o Port=%d
notation for ssh based channels - backport ft and bulk ops for torque from osg_optimization
- added basework_dir parameter to sge and proxy adaptors
- added new PBS versions to pbsnodes CPU count check
- changed the regular expression to find the job-id on LSF adaptor. Fixes #568
- re-enable fallback methods for slurm job containers, jd.name support for slurm
- fix for issue #586, removing invalid dirs from shell wrapper script file path
- fix parsing of file staging directives
- make shell job adaptor workdir configurable
- avoid double close for shell job service
- raise error on missing tools
- removed duplicate get_name function in job class
- add missing container method to get job states
- merged pull request #583
- enforce version in radical stack
- fix #555
- still hating it...
- I hate python deployment
- hotfix: remove some debug code which causes trouble in multiuser envs
-
Feature/job name shell (#541)
-
implementation of job names for shell job adaptor and on API level
-
add job.name for condor
-
address #552
-
allow for non-interative local pty shells
-
initialidir support for Condor
-
change HOSTFILE settings
-
fix missing var setting in aws example
-
fix port opening directive, make sure port is opened just once
-
fix regression in job cancellation
-
fix rel path on open
-
fix staging calls
-
fix string formatting error
-
fix ssh context to handle passwords containing spaces
-
cleanup of the shell spawner, getting in sync with RP version
-
implement dir.exists
-
make sure we have a job description on reconnected jobs
-
make task passing to async calls optional
-
aws security group tweaking
-
remove invalid obj redirection
-
code simplification
-
simplify local mkdir in shell file adaptor
-
sync with updated benchmark tools
-
use shared shell connection for FS ops
- Hotfix release to avoid security warnings on Stampede
- Hotfix release to address a tmp file race condition on file staging
- Added job monitor state update fix to PBSPro adapter.
- Add session property to base class. Fix #480.
- add traceback property to exception
- support gsissh for condor job submission
- pass span parameter to LSF.
- support SLURM reservation.
- file staging for shell adaptor
- Fix #477, set session for shell job service
- Fixed job state monitor to correctly identify state changes.
- Fixed string formatting error.
- Fixes #501. Thanks Javi!
- fix session documentation. Thanks Jeremy!
- Fix to allow use of environment vars in ssh context key/cert property file paths.
- Implement bulk submit, states and wait for condor.
- Improve file staging directives handling.
- Logging goes into working directory.
- make ssh connection timeout configirable (defaults now to 10 seconds)
- Passing ssh_timeout param to ssh ConnectTimeout option
- Updated fix to #494 to take account of empty key/cert parameter.
- deprecate PBS adaptor
- support dynamic adaptor loading
- fix #477, set session for shell job service (thanks Mehdi!)
- set session on file and directory instances, #480
- fix 0.38 after botched merge
- support for anaconda client install
- scattered bug fixes
- update of GO adaptor with recent GlobusOnline evolotion
- scattered fixes in GO adaptor
Note that RADICAL_SAGA_VERBOSE should now be used instead of SAGA_VERBOSE (but the latter will be supported for a while). PTY layer debug levels can separately be enabled via RADICAL_SAGA_PTY_VERBOSE.
- fix cray qsub arguments
- fix interpretation of relative / absolute URLs in some cases
- fix #449 - thanks jcohen02!
- keep up with logger changes in util
- properly detect failed jobs.
- follow changes on resource configurations (BW)
- remove dead code.
- fix state mapping in some queue adaptors
- clean torque/pbs separation
- Add contexts to session at start of context list
- add tc.get_task(id)
- Add PROCESSES_PER_HOST to all job adaptors.
- sync an sdist naming fix
- Add dedicated PBS Pro adaptor.
- Get ppn from proccess_per_host.
- implement candidate_hosts for slurm adaptor
- processes_per_host for SLURM.
- apply setup/git fix
- fix exception type exception
- convert to locking "with"/by context manager.
- only create parents for the dir part of a target.
- reentrant lock to guard concurrent cache writing.
- fix object._id format
- set 'ssh_share_mode=no' on CentOS
- resilience against missing git in setup.py
- Prompt pattern for RSA SecureID (BW, Titan, etc.)
- cleaner version string
- treat empty strings as unset values in PBS job description
- fix slurm script generation
- implement discussed changes to job stdio inspection
- finish implementation of #202
- pre- and post-exec for shell job adaptor
- implement pre/post_exec, support in slurm, cleanup slurm submission
- clean up stdin/stdout/stderr/log handling for jobs
- sync setup.py with recent changes in RP
- Fix rounding up cores for blacklight
- ignore SIGHUP in the shell monitor -- fixes #415
- install sdist, export sdist location
- fix rounding-up cores for blacklight
- Major iteration on Globus Online adaptor, including tests and examples
- move sources into src/
- short_version -> version
- long_version -> version_detail
- use DebugHelper in test suite
- implement #413
- Check status of task to be able to detect failure.
- fix to make async tasks working for methods which do not provide metrics
- Fix ssh key logic.
- do not pick up pem certs by default
- Add "gres" (Generic Resource) query parameter.
- Add note about queue query parameter.
- fix syntax error in PBS variable spec
- Fix, cleanup and document qstat parsing.
- Document job states.
- stop job monitoring on continous errors (pbs)
- Support "bigflash" nodes selection on Gordon.
- make workdir for shell wrapper a parameter
- force copy shells to be non-posix (sftp)
- make sure that non-posix shells are excused from prompt triggering
- hotfix for sftp problems on some client/server version combinations which lead to data inconsistencies
- make ssh share mode configurable
- Re-enable explicit_exec and add more explanation.
- Make setting job_type conditional on >1 cores.
- more variety in PBS "constants" (ha!)
- fix #401
- make sure the target dir for leased shells exists on CREATE_PARENTS in dir ctor
- fix #400
- export PPN information to torque and pbs jobs
- merge and fix Danila's patch
- re-enable test for PBSPro_10
- added test config for archer
- re-enable test for PBSPro_10, as discussed with Ole.
- added test config for archer
- add missing error check on mkdir
- fix logical error on dir state recovery
- LoadLeveler support for BG/Q machines.
- Hotfix release fixing incompatbile sftp flag "-2"
- scattered bug fixes related to connection caching
- configurable switch between scp and sftp
- tweak timeouts on ssh channels
- disable irods adaptor
- LeaseManager for connection sharing.
- Improved file transfer performance
- Small improvements in PBS (esp. Cray) and LSF adaptors.
- Closed tickets:
- scattered fixes and perf improvement
- Improved prompt-detection and small bug fixes.
- Closed tickets:
- Several adaptor upddtes
- Addressed SSH caching and prompt-detection issues
- Closed tickets:
- Fixed issues with the shell aadaptor Directory.list() method:
- Fixed TTY wrapper issues
- Migration to new GitHub repository
- Documentation now on ReadTheDocs: http://radical.saga.readthedocs.org/en/latest/
- Integrated Mark's work on the Condor adaptor
- Closed tickets:
- Bugfix release.
- Closed tickets:
- Bugfix release + shell cleanup
- Changed versioning scheme from major.minor.patch to major.minor due to Python's messed up installers
- Fixed job script cleanup: https://github.com/radical-cybertools/radical.saga/issues?milestone=15&state=closed
- Some improvements to sftp file adaptor
- Closed tickets:
- Emergency release to fix missing VERSION file
- Migration to radical.utils
- Numerous SFTP file adaptor improvements
- Closed tickets:
- Added Platform LSF adaptor
- Closed tickets:
- Added iRODS replica adaptor
- Closed tickets:
- Bugfix release
- Closed tickets:
- Better support for Amazon EC2
- Fixed working directory handling for PBS
- Closed tickets:
- Hotfix release: bug in Url.str and SFTP copy
- Hotfix release: critical bug in wait() signature
- Added resource package
- Added 'liblcoud' based adaptor to access Amazon EC2 clouds
- Closed issues:
- Hotfix release: critical PBS/TORQUE adaptor fixes
- Hotfix release: critical SLURM adaptor fixes
- jd.working_directory now gets created if it doesn't exist
- Support for older Cray systems running PBS Pro 10
- Job state callback support for the PBS adaptor - others to follow
- A simple HTTP protocol file adaptor
- Fixed some issues with user-pass and X.509 security contexts
- Over 40 bugfixes and improvements:
- Added SFTP adaptor
- Added tutorial examples
- Closed issues:
- Hotfix release
- Major re-write of engine and adaptor interface
- Support for asynchronous operations
- Improved PTYWrapper for ssh/gsissh remote execution
- Added SLURM job adaptor
- Added Condor job adaptor
- Fixed errors related to pbs://localhost and sge://localhost URLs that were caused by a bug in the command-line wrappers.
- HOTFIX: credential management for SGE and PBS. both adaptors now iterate over SSH and GSISSH contexts as well as consider usernames that are part of the url, e.g., pbs+ssh://[email protected]...
- Changed documentation to Sphinx
- Removed object_type API. Python buildins can be used instead
- Updates to Filesystem API
- Added JobDescription.name attribute (as defined in DRMAA)
- Introduced stateful SSH connection substrate for PBS, SGE, etc
- Introduced support for GSISSH: pbs+gsissh://, sge+gsissh://
- Re-implementation of a (more Python-esque) attribute interface
- Fixed JobID issues, i.e., job.job_id returns 'None' in case the job is not running instead of
- Introduced dynamic, fault-tolerant plug-in loader. If anything goes wrong during loading of a specific plug-in (i.e., dependencies on 3rd party modules cannot be fulfilled, the plug-in will just get skipped and the remaining ones will still get loaded. Previously, a single problem during plug-in loading would take radical.saga down.
- Added unit-tests for SPMDVariation
- Added 'mpirun' support for local job plug-in (via SPMDVariation)
- Added some of the missing methods and flags to filesystem package
- An URL object can now be constructed from another URL object
- Fixed job.cancel()
- Wildcard support for Directory.list()
- Fixed query support for URL class (issue #61)
- Improved logging. No root logger hijacking anymore (issue #62)
- Fixed job.Description.number_of_processes (issue #63)
- Less chatty SSH plug-in (issue #51)
- job.Decription now accepts strings for int values. This has been implemented for backwards compatibility
- Fixed resource.Compute.wait() timeout issue
- Removed excessive SGE/PBS plug-in logging
- job.Service can now be created from a resource.Manager
- Implemented deep copy for description objects
- Runtime now supports multiple plug-ins for the same schema
- SFTP support for local <-> remote copy operations, mkdir, get_size
- Added supoprt for ssh re-connection after timeout (issue #29)
- Abandoned 'Exception' filenames and API inheritance. The radical.saga interface looks much cleaner now. Compatibility with previous versions has been ensured
- Improved (inline) API documentation
- Swapped urlparse with furl in saga.Url class This hopefully fixes the problem with inconsistent parsing accross different Python versions
- Added SGE (Sun Grid Engine) plug-in (issue #11)
- Removed sagacompat compatibility API
- Log source names now all start with 'radical.saga.'. This should make filtering much easier
- Moved SD package into development branch features/servicediscovery
- Hotfix - removed experimental Resource plug-in from release
- Fixed issue with plugin introspection
- Added template for job plug-in
- Hotfix
- Fixed issue: https://github.com/oweidner/radical.saga/issues/9
- Fixed issue: https://github.com/oweidner/radical.saga/issues/8
- Fixed issue: https://github.com/oweidner/radical.saga/issues/6
- First version of a bigjob plugin. See wiki for details.
- Fixed Python 2.4 compatibility issue
- Added bliss.sagacompat module for API compatibility.
- API documentation: http://oweidner.github.com/bliss/apidoc-compat/
- Added examples for 'compat' API, e.g.: https://github.com/oweidner/bliss/tree/master/examples/job-api/compat/
- Added configuration files for epydoc
- Fixed executable & argument handling for the local job plugin
- Added support for jd.output and jd.error to local job plugin
- Fixed bug in URL.get_host()
- Fixed issues with extremely short running PBS jobs in conjunction with scheduler configruations that remove the job from the queue the second it finishes execution.
- First working version of an SFTP file API plugini based on Paramiko
- Two advance bfast examples incl. output file staging: https://github.com/oweidner/bliss/blob/master/examples/advanced/bfast_workflow_01.py https://github.com/oweidner/bliss/blob/master/examples/advanced/bfast_workflow_02.py
- Fixed issues with PBS working directory
- Added simple job API example that uses BFAST: https://github.com/oweidner/bliss/blob/master/examples/job-api/pbs_via_ssh_bfast_job.py
- Updated apidoc: http://oweidner.github.com/bliss/apidoc/
- First prototype of a job container. Example can be found here: https://github.com/oweidner/bliss/blob/master/examples/job-api/pbs_via_ssh_container.py
- Implemented CPU and Memory information via PBS service discovery
- Changed job.Description.walltime_limit to job.Description.wall_time_limit
- Fixed issue with local job plugin
- Prototype of a Service Discovery packages
- PBS/SSH support for service discovery
- Fixed issue with PBS plugin job.wait()
- More or less stable job API
- First functional PBS over SSH plugin
- (HEAD -> master, origin/devel, origin/HEAD, devel) devel version bump
- (HEAD -> master, origin/devel, origin/HEAD, devel) devel version bump
- devel version bump
- no macos