Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some question about train #8

Open
FFY0207 opened this issue Jul 6, 2024 · 6 comments
Open

some question about train #8

FFY0207 opened this issue Jul 6, 2024 · 6 comments

Comments

@FFY0207
Copy link

FFY0207 commented Jul 6, 2024

Epoch 1, Batch 3, Loss: 7.225614070892334
Train step: 2it [00:05, 2.95s/it]
Traceback (most recent call last):
File "/mnt/e/code/silent_speech/transduction_model.py", line 365, in
main()
File "/mnt/e/code/silent_speech/transduction_model.py", line 361, in main
model = train_model(trainset, devset, device, save_sound_outputs=save_sound_outputs)
File "/mnt/e/code/silent_speech/transduction_model.py", line 260, in train_model
loss.backward() # 反向传播
File "/home/ffy/anaconda3/envs/ffy112/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/ffy/anaconda3/envs/ffy112/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: unknown error
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

What problem did I encounter? I lowered the size of the batch, but it didn't work and the error still occurred

@FFY0207
Copy link
Author

FFY0207 commented Jul 9, 2024

b4ef614d5a7e66324bc6e75384c4ed3

This is my training log. Why did the loss and accuracy suddenly become very poor from the 21st cycle? How should I handle it

@dgaddy
Copy link
Owner

dgaddy commented Jul 10, 2024

The first error sounds like some sort of hardware, driver, or pytorch error. It is probably unrelated to the code of this repository - maybe check your CUDA and pytorch installations.
About the loss and accuracy suddenly getting worse, are you using the same batch size as the original code or is this with a smaller batch? A batch size that is too small is the most likely issue.

@FFY0207
Copy link
Author

FFY0207 commented Jul 11, 2024

image
Why can the evaluation. py run normally with the transduction model. pt you provided, but the model I trained myself encountered the following error?can you help me?

@Gray-ly
Copy link

Gray-ly commented Aug 30, 2024

It seems you loaded a false model, the output should be 80, whice matches the num_speech_features

@Gray-ly
Copy link

Gray-ly commented Aug 30, 2024

b4ef614d5a7e66324bc6e75384c4ed3

This is my training log. Why did the loss and accuracy suddenly become very poor from the 21st cycle? How should I handle it

I encounter the problem when I reproduce the normalizers.pkl by running make_normalizers() in read_emg.py. Obviously, doing so resulted in the pkl being different from the original files in the repository . Do you know why this is? Thanks for your contribution! @dgaddy

@dgaddy
Copy link
Owner

dgaddy commented Sep 10, 2024

I encounter the problem when I reproduce the normalizers.pkl by running make_normalizers() in read_emg.py. Obviously, doing so resulted in the pkl being different from the original files in the repository . Do you know why this is? Thanks for your contribution! @dgaddy

It's been quite a while so I don't really remember, but it's possible I may have manually adjusted the normalizers to scale down the size of the inputs or outputs. Sometimes larger values for inputs or outputs can make training less stable. You could try adjusting them and see if that helps. (Inputs seems more likely to help. You would want to increase the normalizer feature_stddevs values to decrease the feature scales. Multiplying by something like 2 or 5 seems reasonable. It might also help to compare the values in your normalizers file vs the one in the repository.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants