-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
158 lines (124 loc) · 4.41 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
NAME
gcd - Grand Central Dispatch
SYNOPSIS
./gcd
Prints status and offers to submit jobs according to
`gcd_CONF_PLAN'.
./scripts/r
Universal job script that runs jobs (section FILES
UNDER JOBPATH).
DESCRIPTION
It implements functions for dispatching long running jobs,
with a file system based lock for parallel status update.
When called directly, `gcd' shows the status of the jobs and
suggests further actions.
FILES
gcd
The source code for system independent functionalities.
conf
Configuration.
gcd_CONF_PLAN=(
10 8192 1440 ProjectName "Path and/or Category"
10 4096 360 ProjectName "Path and/or Category"
)
The array, gcd_CONF_PLAN, contains a list in sets
of 5 items: the number of jobs to submit, the size
(in number of nodes), the time (in minutes), the
project name, and extra parameters for a preferred
path and / or preferred categories. Setting
`gcd_RESUBMIT' to 0 stops the job script to resubmit
itself after finishes. Other global variables can be
set here, but only without the prefix `gcd'.
util/*
Scripts to be sourced that defines useful platform
dependent utility functions.
scripts/r
The universal job scripts. It is designed to be
submitted to a queuing system that can run custom
script inside the top gcd directory. It manages
job availability and uses the files under JOBPATH
to start and monitor jobs. See section FILES UNDER
JOBPATH for details. Given platform support, it can
also be run directly, see files under `test' directory.
work/*
Job managers have their stdout and stderr redirected
to files here.
jobs/*
Paths to jobs, aka JOBPATH. Symlinks in practice.
test/*
For testing.
test/scripts/test
Run it under its directory for simple tests.
FILES UNDER JOBPATH
DESC
Free format job description.
PROP
Job properties related to scheduling, shared
among all JOBID under this JOBPATH. The file is
sourced as a shell script. It must only define
the following variables.
block_size=2048
time_limit=7200
repeat=1
category="URGENT HMC"
All are integers, except for the last, category is
a string, representing space separated categories.
The job only fits block_size nodes. The job should
finish within time_limit seconds. If repeat is
nonzero, the job will continue to be available after
finishing regardless of its last return status.
If time_limit is zero, it uses 5 percent more
than the previous max run time plus 20 minutes.
There is a special category, `URGENT', with which
jobs would get a higher priority.
LIST
All individual jobs under this JOBPATH. This file
contains the list of JOBID, one per line. Any line
that begins with `#' is ignored.
JOB
This file is sourced under the containing directory.
It prepares the job, and set up variables related
to running. It must define the following variables.
rank_per_node=32
omp_nthreads=2
update_time=1200
run=( full command to execute )
output_path=stdout
error_path=stderr
The job will start by running
"${run[@]}" >"$output_path" 2>"$error_path"
and the output_path is checked for possible hangs.
The output_path and error_path should be unique for
each job run, even for jobs with repeat=1. We
consider the job hangs if it hasn't updated its
output file in the last update_time seconds, if
update_time is positive. If update_time is zero,
it uses twice of the previous max update_time plus
5 minutes or the time_limit, whichever is smaller.
It can define additional variables that overwrites
those defined in PROP. It can use variables
given from the executing environment: gcd_JOBPATH,
gcd_JOBID, gcd_JOBSIZE, gcd_JOBBLOCK.
run/$JOBID.served
This file is created when `fetch' serves this JOBID,
and is removed when `return' gets this JOBID.
run/$JOBID.stop
This file is created when `return' gets this JOBID
with zero RETURNCODE, and PROP has repeat=0.
run/$JOBID.log
`fetch' and `return' record the time when the job
is served and returned. Job scripts record the
idle time during runs.
TODO
Find an exit strategy.
LICENSE
MIT license. See file `LICENSE'.
HISTORY
May, 2017, version 2.0
February, 2017, version 1.0
BUG
JOBID cannot begin with `#', nor can it contain the path
separator `/'.
Each job run should have unique output_path and error_path.
The path to gcd cannot contain single quotes.
Manual intervention required to stop jobs from self-replicating.