You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to train with tensorflow-gpu==1.14 and cuda and cudnn is loaded correctly, However, when tensorflow has finished loading it gets stuck here:
'
After some time, I get this error:
2021-04-15 20:14:04.341576: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
[[{{node ppo2_model/pi_1/MatMul}}]]
[[ppo2_model/ArgMax/_443]]
(1) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
[[{{node ppo2_model/pi_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/hadi/anaconda3/envs/train/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/hadi/anaconda3/envs/train/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/hadi/Downloads/train-procgen/train_procgen/train.py", line 109, in <module>
main()
File "/home/hadi/Downloads/train-procgen/train_procgen/train.py", line 106, in main
comm=comm)
File "/home/hadi/Downloads/train-procgen/train_procgen/train.py", line 75, in train_fn
max_grad_norm=0.5,
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/ppo2.py", line 142, in learn
obs, returns, masks, actions, values, neglogpacs, states, epinfos = runner.run() #pylint: disable=E0632
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/runner.py", line 29, in run
actions, values, self.states, neglogpacs = self.model.step(self.obs, S=self.states, M=self.dones)
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 93, in step
a, v, state, neglogp = self._evaluate([self.action, self.vf, self.state, self.neglogp], observation, **extra_feed)
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 75, in _evaluate
return sess.run(variables, feed_dict)
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
[[node ppo2_model/pi_1/MatMul (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:63) ]]
[[ppo2_model/ArgMax/_443]]
(1) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
[[node ppo2_model/pi_1/MatMul (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:63) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node ppo2_model/pi_1/MatMul:
ppo2_model/flatten_1/Reshape (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py:44)
ppo2_model/pi/w/read (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:61)
Input Source operations connected to node ppo2_model/pi_1/MatMul:
ppo2_model/flatten_1/Reshape (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py:44)
ppo2_model/pi/w/read (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:61)
Original stack trace for 'ppo2_model/pi_1/MatMul':
File "/anaconda3/envs/train/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/anaconda3/envs/train/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Downloads/train-procgen/train_procgen/train.py", line 109, in <module>
main()
File "/Downloads/train-procgen/train_procgen/train.py", line 106, in main
comm=comm)
File "/Downloads/train-procgen/train_procgen/train.py", line 75, in train_fn
max_grad_norm=0.5,
File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/ppo2.py", line 109, in learn
max_grad_norm=max_grad_norm, comm=comm, mpi_rank_weight=mpi_rank_weight)
File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/model.py", line 37, in __init__
act_model = policy(nbatch_act, 1, sess)
File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 175, in policy_fn
**extra_tensors
File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 49, in __init__
self.pd, self.pi = self.pdtype.pdfromlatent(latent, init_scale=0.01)
File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/distributions.py", line 65, in pdfromlatent
pdparam = _matching_fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/distributions.py", line 355, in _matching_fc
return fc(tensor, name, size, init_scale=init_scale, init_bias=init_bias)
File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py", line 63, in fc
return tf.matmul(x, w)+b
File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 2647, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5925, in mat_mul
name=name)
File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
It seems like this is a general issue with tf-1.14, so I am wondering how you guys had any luck with gpu training on this. I am training with command: mpiexec --mca opal_cuda_support 1 -np 2 python -m train_procgen.train --env_name starpilot --num_levels 200 --distribution_mode easy --test_worker_interval 2
The text was updated successfully, but these errors were encountered:
I am trying to train with tensorflow-gpu==1.14 and cuda and cudnn is loaded correctly, However, when tensorflow has finished loading it gets stuck here:
'
After some time, I get this error:
It seems like this is a general issue with tf-1.14, so I am wondering how you guys had any luck with gpu training on this. I am training with command:
mpiexec --mca opal_cuda_support 1 -np 2 python -m train_procgen.train --env_name starpilot --num_levels 200 --distribution_mode easy --test_worker_interval 2
The text was updated successfully, but these errors were encountered: