Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot train with GPU #18

Open
HadiSDev opened this issue Apr 15, 2021 · 1 comment
Open

Cannot train with GPU #18

HadiSDev opened this issue Apr 15, 2021 · 1 comment

Comments

@HadiSDev
Copy link

I am trying to train with tensorflow-gpu==1.14 and cuda and cudnn is loaded correctly, However, when tensorflow has finished loading it gets stuck here:
image
'
After some time, I get this error:

2021-04-15 20:14:04.341576: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
	 [[{{node ppo2_model/pi_1/MatMul}}]]
	 [[ppo2_model/ArgMax/_443]]
  (1) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
	 [[{{node ppo2_model/pi_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hadi/Downloads/train-procgen/train_procgen/train.py", line 109, in <module>
    main()
  File "/home/hadi/Downloads/train-procgen/train_procgen/train.py", line 106, in main
    comm=comm)
  File "/home/hadi/Downloads/train-procgen/train_procgen/train.py", line 75, in train_fn
    max_grad_norm=0.5,
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/ppo2.py", line 142, in learn
    obs, returns, masks, actions, values, neglogpacs, states, epinfos = runner.run() #pylint: disable=E0632
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/runner.py", line 29, in run
    actions, values, self.states, neglogpacs = self.model.step(self.obs, S=self.states, M=self.dones)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 93, in step
    a, v, state, neglogp = self._evaluate([self.action, self.vf, self.state, self.neglogp], observation, **extra_feed)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 75, in _evaluate
    return sess.run(variables, feed_dict)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
	 [[node ppo2_model/pi_1/MatMul (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:63) ]]
	 [[ppo2_model/ArgMax/_443]]
  (1) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
	 [[node ppo2_model/pi_1/MatMul (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:63) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node ppo2_model/pi_1/MatMul:
 ppo2_model/flatten_1/Reshape (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py:44)	
 ppo2_model/pi/w/read (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:61)

Input Source operations connected to node ppo2_model/pi_1/MatMul:
 ppo2_model/flatten_1/Reshape (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py:44)	
 ppo2_model/pi/w/read (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:61)

Original stack trace for 'ppo2_model/pi_1/MatMul':
  File "/anaconda3/envs/train/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/envs/train/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Downloads/train-procgen/train_procgen/train.py", line 109, in <module>
    main()
  File "/Downloads/train-procgen/train_procgen/train.py", line 106, in main
    comm=comm)
  File "/Downloads/train-procgen/train_procgen/train.py", line 75, in train_fn
    max_grad_norm=0.5,
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/ppo2.py", line 109, in learn
    max_grad_norm=max_grad_norm, comm=comm, mpi_rank_weight=mpi_rank_weight)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/model.py", line 37, in __init__
    act_model = policy(nbatch_act, 1, sess)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 175, in policy_fn
    **extra_tensors
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 49, in __init__
    self.pd, self.pi = self.pdtype.pdfromlatent(latent, init_scale=0.01)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/distributions.py", line 65, in pdfromlatent
    pdparam = _matching_fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/distributions.py", line 355, in _matching_fc
    return fc(tensor, name, size, init_scale=init_scale, init_bias=init_bias)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py", line 63, in fc
    return tf.matmul(x, w)+b
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 2647, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5925, in mat_mul
    name=name)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

It seems like this is a general issue with tf-1.14, so I am wondering how you guys had any luck with gpu training on this. I am training with command:
mpiexec --mca opal_cuda_support 1 -np 2 python -m train_procgen.train --env_name starpilot --num_levels 200 --distribution_mode easy --test_worker_interval 2

@zzuchen
Copy link

zzuchen commented Apr 2, 2022

l met the same problem with you,have you solved it? My environment is: CUDA 10.0+cudnn7.6.4+Tensorflow 1.14.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants