-
Notifications
You must be signed in to change notification settings - Fork 30
/
main.py
363 lines (308 loc) · 16.2 KB
/
main.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
#!/usr/bin/env python
"""
The main training script, and testing (if using a non-custom task).
A note about random seeds: the seed we set before drawing the episode's
starting state will determine the initial configuration of objects, for both
standard ravens and any custom envs (so far). Currently, if we ask to set 100
demos, the code will run through seeds 0 through 99. Then, for _test_ demos,
we offset by 10**max_order, so that with 100 demos (max_order=2) we'd start
with 10**2=100 and proceed from there. This way, if doing 20 test episodes,
then ALL snapshots are evaluated on seeds {100, 101, ..., 119}. If we change
max_order=3, which we should for an actual paper, then this simply means the
training is done on {0, 1, ..., 999} and testing starts at seed 1000.
With the custom deformable tasks, I have a separate load.py script. That one
also had max_order=3 (so when doing 100 demos, I was actually starting at
seed 1000, no big deal). However, I now have max_order=4 to start at 10K,
because (a) we will want to use 1000 demos eventually, and (b) for the
deformable tasks, sometimes the initial state might already be 'done', so I
ignore that data and re-sample the starting state with the next seed. Having
load.py start at seed 10K will give us a 'buffer zone' of seeds to protect
the train and test from overlapping.
With goal-conditioning, IF the goals are drawn by sampling from a similar
same starting state distribution (as in the case with insertion-goal) then
use generate_goals.py and set max_order=5 so that there's virtually no chance
of random seed overlap.
When training on a new machine, we can run this script with "1000 demos" for
deformable tasks. It will generate data (but not necessarily "1000 demos"
because max_order determines the actual amount), but exit before we can do
any training, and then we can use subsequent scripts with {1, 10, 100} demos
for actual training.
"""
import datetime
import os
import time
import argparse
import sys
import cv2
import pickle
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from ravens import Dataset, Environment, agents, tasks
# Of critical importance! Do 2 for max of 100 demos, 3 for max of 1000 demos.
MAX_ORDER = 3
def rollout(agent, env, tasks, args):
"""Standard gym environment rollout. A few clarifications:
(1) Originally, we did not append the LAST observation and info, since it
wasn't needed. However I need the last `info` because I record stuff for
custom envs (e.g., coverage). Don't make it conditioned on 'done' because
we need it even if done=False, in which case termination is wrt max steps.
(2) We do a slight hack to save the last observation and last info, which
we need for goal-conditioning Transporters, and also inspection of data
to check what the final images look like after the last action.
(3) Unlike standard gym, the `obs = env.reset(task)` line returns obs={}, an
EMPTY dict (hence len(obs)==0). Therefore, at t=0, the `episode` list is empty.
(4) An 'action' normally has these keys: ['camera_config', 'primitive', 'params']
However, there's a distinction between the ORACLE policies and LEARNED policies.
(4a) For the ORACLE at t=0, act['primitive'] = None and there is no 'params' key.
This means that the `env.step(act)` line will NOT actually take the action!
However, that line WILL return an image observation. So, at t=0, we (a) take no
action, (b) do not add to `episode` list, (c) but get image observation + info,
which we can use for the NEXT time step. Thus t=1 is when the first action takes
place, so for insertion, since it's just one action, we exit the loop when t=1,
and `len(episode) = 1`. The same applies for learned Transporter policies.
(4b) For the learned gt_state policies, at t=0, there WILL be a 'params' key, and
act['primitive'] is not None, hence at t=0 the first action actually takes place.
This won't affect data collection, as we don't use gt_state for data collection,
hence all the `episode` stuff gets ignored. However, it will be a bit confusing
because with insertion (for example) episodes can succeed in one action, but the
code says they are 'length 0.'
The reason for (4a) & (4b) can be seen in how these agents implement their `act`
function. The oracle and Transporter-based policies will return 'empty' actions
if obs={}, because they rely on images. (Well, some tasks mean the oracle doesn't
neeed images, but for consistency, the oracle follows the same interface.)
Anyway, keep this in mind. Actually, the practical effect might be that any
gt_state policies should run for task.max_steps MINUS one, right? The easiest
way might be to make the rollout start at t=0 for these agents.
Returns:
total_reward: scalar reward signal from the episode, usually 1.0 for
demonstrators of standard ravens tasks.
episode: a list of (obs,act,info) tuples used to add to the dataset,
which formats it for sampling later in training.
t: time steps (i.e., actions takens) for this episode. If t=0 then
something bad happened and we shouldn't count this episode.
last_obs_info: tuple of the (obs,info) at the final time step, the
one that doesn't have an action.
"""
start_t = 0
if args.agent in ['gt_state', 'gt_state_2_step']:
start_t = 1
episode = []
total_reward = 0
obs = env.reset(task)
info = env.info
for t in range(start_t, task.max_steps):
act = agent.act(obs, info)
if len(obs) > 0 and act['primitive']:
episode.append((obs, act, info))
(obs, reward, done, info) = env.step(act)
total_reward += reward
last_obs_info = (obs, info)
#print(info['extras'], info['...']) # Use this to debug if needed.
if done:
break
return total_reward, episode, t, last_obs_info
def has_deformables(task):
"""
Somewhat misleading name. This method is used to determine if we should
(a) be running training AFTER environment data collection, and (b)
evaluating with test-time rollouts periodically. For (a) the reason was
the --disp option, which is needed to see cloth, will not let us run
multiple Environment calls. This also applies to (b), but for (b) there's
an extra aspect: most of these new environments have reward functions
that are not easily interpretable, or they have extra stuff in `info`
that would be better for us to use. In that case we should be using
`load.py` to roll out these policies.
Actually, even for stuff like cable-shape, where the reward really tells
us all we need to know, it would be nice to understand failure cases
based on the type of the target. So for now let's set it to record
`cable-` which will ignore `cable` but catch all my custom environments.
The custom cable environments don't use --disp but let's just have them
here for consistency in the main-then-load paradigm.
"""
return ('cable-' in task) or ('cloth' in task) or ('bag' in task)
def is_goal_conditioned(args):
"""
Be careful with checking this condition. See `generate_goals.py`. Here,
though, we check the task name and as an extra safety measure, check that
the agent is also named with 'goal'.
Update: all right, let's modify this to incorpoate gt_state w/out too much
extra work. :(
"""
goal_tasks = ['insertion-goal', 'cable-shape-notarget', 'cable-line-notarget',
'cloth-flat-notarget', 'bag-color-goal']
goal_task = (args.task in goal_tasks)
if goal_task:
assert 'goal' in args.agent or 'gt_state' in args.agent, \
'Agent should be a goal-based agent, or gt_state agent.'
return goal_task
def ignore_this_demo(args, demo_reward, t, last_extras):
"""In some cases, we should filter out demonstrations.
Filter for if t == 0, which means the initial state was a success.
Also, for the bag envs, if we end up in a catastrophic state, I exit
gracefully and we should avoid those demos (they won't have images we
need for the dataset anyway).
"""
ignore = (t == 0)
# For bag envs.
if 'exit_gracefully' in last_extras:
assert last_extras['exit_gracefully']
return True
# Another bag env.
if (args.task in ['bag-color-goal']) and demo_reward <= 0.5:
return True
# Another bag env. We can get 0.5 reward by touching the cube only (bad).
if args.task == 'bag-items-easy' and demo_reward <= 0.5:
return True
# Harder bags: ignore if (a) no beads in zone, OR, (b) didn't get both
# items. Need separate case because we could get all beads (0.5) but only
# one item (0.25) which sums to 0.75. However, we can also get both items
# (0.5) and a few beads (e.g., 0.1) which sums to 0.6.
if args.task == 'bag-items-hard':
return (last_extras['zone_items_rew'] < 0.5 or
last_extras['zone_beads_rew'] == 0)
return ignore
if __name__ == '__main__':
# Parse command line arguments.
parser = argparse.ArgumentParser()
parser.add_argument('--gpu', default='0')
parser.add_argument('--disp', action='store_true')
parser.add_argument('--task', default='hanoi')
parser.add_argument('--agent', default='transporter')
parser.add_argument('--num_demos', default='100')
parser.add_argument('--num_rots', default=24, type=int)
parser.add_argument('--hz', default=240.0, type=float)
parser.add_argument('--gpu_mem_limit', default=None)
parser.add_argument('--subsamp_g', action='store_true')
parser.add_argument('--crop_bef_q', default=0, type=int, help='CoRL paper used 1')
parser.add_argument('--save_zero', action='store_true', help='Save snapshot at 0 iterations')
args = parser.parse_args()
# Configure which GPU to use.
cfg = tf.config.experimental
gpus = cfg.list_physical_devices('GPU')
if len(gpus) == 0:
print('No GPUs detected. Running with CPU.')
else:
cfg.set_visible_devices(gpus[int(args.gpu)], 'GPU')
# Configure how much GPU to use.
if args.gpu_mem_limit is not None:
MEM_LIMIT = 1024 * int(args.gpu_mem_limit)
print(args.gpu_mem_limit)
dev_cfg = [cfg.VirtualDeviceConfiguration(memory_limit=MEM_LIMIT)]
cfg.set_virtual_device_configuration(gpus[0], dev_cfg)
# Initialize task. Later, initialize Environment if necessary.
task = tasks.names[args.task]()
dataset = Dataset(os.path.join('data', args.task))
if args.subsamp_g:
dataset.subsample_goals = True
# Collect training data from oracle demonstrations.
max_demos = 10**MAX_ORDER
task.mode = 'train'
seed_to_add = 0 # Daniel: check carefully if resuming the bag-items tasks.
# If continuing from prior calls, the demo index starts counting based on
# the number of demos that exist in `data/{task}`. Make the environment
# here, to issues with cloth rendering + multiple Environment calls.
make_new_env = (dataset.num_episodes < max_demos)
if make_new_env:
env = Environment(args.disp, hz=args.hz)
# For some tasks, call reset() again with a new seed if init state is 'done'.
while dataset.num_episodes < max_demos:
seed = dataset.num_episodes + seed_to_add
print(f'\nNEW DEMO: {dataset.num_episodes+1}/{max_demos}, seed {seed}\n')
np.random.seed(seed)
demo_reward, episode, t, last_obs_info = rollout(task.oracle(env), env, task, args)
last_extras = last_obs_info[1]['extras']
# Check if we should ignore or include this demo in the dataset.
if ignore_this_demo(args, demo_reward, t, last_extras):
seed_to_add += 1
print(f'ignore_this_demo=True, last_i: {last_extras}, re-sample seed: {seed_to_add}')
else:
dataset.add(episode, last_obs_info)
print(f'\ndemo reward: {demo_reward:0.5f}, len {t}, last_i: {last_extras}')
if make_new_env:
env.stop()
del env
if has_deformables(args.task):
print(f'Exiting due to task={args.task}, only generating demos.')
print(f'We cannot call Environment() multiple times (remotely).')
sys.exit()
# Evaluate on increasing orders of magnitude of demonstrations.
num_train_runs = 3 # to measure variance over random initialization
num_train_iters = 20000
test_interval = 2000
if args.save_zero:
num_train_runs = 1 # let's keep it simple
test_interval = 0 # this is what gets passed to agent.train()
num_test_episodes = 20
if not os.path.exists('test_results'):
os.makedirs('test_results')
# Check if it's goal-conditioned.
goal_conditioned = is_goal_conditioned(args)
# Do multiple training runs from scratch with TensorFlow random initialization.
for train_run in range(num_train_runs):
# Set up tensorboard logger.
current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
train_log_dir = os.path.join('logs', args.agent, args.task, current_time, 'train')
train_summary_writer = tf.summary.create_file_writer(train_log_dir)
# Set the beginning of the agent name.
name = f'{args.task}-{args.agent}-{args.num_demos}-{train_run}'
# Initialize agent and limit random dataset sampling to fixed set.
tf.random.set_seed(train_run)
if args.agent == 'transporter':
name = f'{name}-rots-{args.num_rots}-crop_bef_q-{args.crop_bef_q}'
agent = agents.names[args.agent](name,
args.task,
num_rotations=args.num_rots,
crop_bef_q=(args.crop_bef_q == 1))
elif 'transporter-goal' in args.agent:
assert goal_conditioned
name = f'{name}-rots-{args.num_rots}'
if args.subsamp_g:
name += '-sub_g'
else:
name += '-fin_g'
agent = agents.names[args.agent](name,
args.task,
num_rotations=args.num_rots)
elif 'gt_state' in args.agent:
agent = agents.names[args.agent](name,
args.task,
goal_conditioned=goal_conditioned)
else:
agent = agents.names[args.agent](name, args.task)
# Limit random data sampling to fixed set.
np.random.seed(train_run)
num_demos = int(args.num_demos)
# Given `num_demos`, only sample up to that point, and not w/replacement.
train_episodes = np.random.choice(range(max_demos), num_demos, False)
dataset.set(train_episodes)
performance = []
while agent.total_iter < num_train_iters:
# Train agent.
tf.keras.backend.set_learning_phase(1)
agent.train(dataset, num_iter=test_interval, writer=train_summary_writer)
tf.keras.backend.set_learning_phase(0)
# agent.train() concludes with agent.save() inside it, then exit.
if args.save_zero:
print('We are now exiting due to args.save_zero...')
agent.total_iter = num_train_iters
continue
# Evaluate agent ONLY if non-deformables environment.
if has_deformables(args.task):
continue
# For now, until we get the evaluation working. I just want to see losses.
if 'transporter-goal' in args.agent or args.task == 'insertion-goal' or goal_conditioned:
continue
task.mode = 'test'
env = Environment(args.disp, hz=args.hz)
for episode in range(num_test_episodes):
seed = 10**MAX_ORDER + episode
np.random.seed(seed)
total_reward, _, t, _ = rollout(agent, env, task, args)
print(f'Test (seed: {seed}): {episode} Total Reward: {total_reward:.2f}, len: {t}')
performance.append((agent.total_iter, total_reward))
env.stop()
del env
# Save results.
fname = os.path.join('test_results', f'{name}.pkl')
pickle.dump(performance, open(fname, 'wb'))