Model device is not being set from `train_model` #123

rusheb · 2023-03-24T16:05:07Z

Description

In train_model.py, the device arg passed to train is obtained via the get_device function. However this device is not being passed to the model. Instead, the model is using the default HookedTransformer device, which is "cuda" if available or else "cpu".

Steps to reproduce

Train a model from a M1 mac, e.g.

poetry run python scripts/train_model.py ./data/maze/g4-n10

Check the logs. device will be reported as "mps" while model.device will be set to "cpu".

Mitigation

One possible fix would be to add a device parameter to ConfigHolder.create_model() and set the device on the HookedTransfomrmerConfig.

I attempted this and got the following error:

Traceback (most recent call last):
  File "/Users/rusheb/code/maze-transformer/scripts/train_model.py", line 75, in <module>
    fire.Fire(train_model)
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/Users/rusheb/code/maze-transformer/scripts/train_model.py", line 69, in train_model
    train(dataloader, cfg, logger, output_path, device)
  File "/Users/rusheb/code/maze-transformer/maze_transformer/training/training.py", line 88, in train
    loss = model(batch_on_device[:, :-1], return_type="loss")
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/transformer_lens/HookedTransformer.py", line 302, in forward
    residual = block(
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/transformer_lens/components.py", line 693, in forward
    self.attn(
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/transformer_lens/components.py", line 408, in forward
    attn_scores = self.apply_causal_mask(
  File "/Users/rusheb/code/maze-transformer/.venv/lib/python3.10/site-packages/transformer_lens/components.py", line 471, in apply_causal_mask
    return torch.where(
RuntimeError: 0'th index 32 of x tensor does not match the other tensors

I'm not sure of the cause of this error. It might be that HookedTransformer does not support mac acceleration.

The text was updated successfully, but these errors were encountered:

luciaquirke · 2023-03-25T06:50:58Z

TransformerLens does not support mac acceleration, possibly due to all the issues: pytorch/pytorch#77764

rusheb · 2023-03-25T08:42:45Z

In that case does it make sense to move the mps branch from get_device()?

luciaquirke · 2023-03-26T00:40:32Z

Yeah maybe. I'm talking to Joseph about adding MPS support to transformerlens e.g.: https://github.com/neelnanda-io/TransformerLens/pull/221/files but need to change pinned pytorch version in both circuitsvis and transformerlens for that. Maybe we should remove the MPS stuff in the meantime

rusheb added the bug Something isn't working label Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model device is not being set from `train_model` #123

Model device is not being set from `train_model` #123

rusheb commented Mar 24, 2023

luciaquirke commented Mar 25, 2023

rusheb commented Mar 25, 2023

luciaquirke commented Mar 26, 2023 •

edited

Loading

Model device is not being set from train_model #123

Model device is not being set from train_model #123

Comments

rusheb commented Mar 24, 2023

Description

Steps to reproduce

Mitigation

luciaquirke commented Mar 25, 2023

rusheb commented Mar 25, 2023

luciaquirke commented Mar 26, 2023 • edited Loading

Model device is not being set from `train_model` #123

Model device is not being set from `train_model` #123

luciaquirke commented Mar 26, 2023 •

edited

Loading