Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in first train gen loss=0.0 #206

Open
lpscr opened this issue Feb 29, 2024 · 6 comments
Open

error in first train gen loss=0.0 #206

lpscr opened this issue Feb 29, 2024 · 6 comments

Comments

@lpscr
Copy link

lpscr commented Feb 29, 2024

Hi, thank you very much for your amazing work! It's truly incredible. I want to train from scratch When I start the first training session, everything seems to be going well, but all the epochs show loss=0.0, disc loss=0.0, etc. Only the Mel Loss updates. After completing the 50 epochs, I'll send an image to show how it looks

er1

I've completed 50 epochs in four hours on the testing database.

"I started the second training session now, but I encountered an error. It refuses to start and prompts me with a trace error, leaving me stuck. so need closing the terminal or notebook because freeze

#in this line code
 running_loss += loss_mel.item()
            g_loss.backward()
            if torch.isnan(g_loss):
                from IPython.core.debugger import set_trace
                set_trace()

            optimizer.step('bert_encoder')
            optimizer.step('bert')
            optimizer.step('predictor')
            optimizer.step('predictor_encoder')

Here's a screenshot showing how it looks when it gets stuck during the second training session.

er2

I also attempted to use the 'accelerate launch --mixed_precision=fp16' command both with and without '--mixed_precision', and even tried running without acceleration using simple Python commands, but encountered the same issue.

I'm using the default 'config.yml' with a batch size of 20 and a maximum length of 300. I experimented with different batch sizes (e.g., 16) and maximum lengths (e.g., 100), but the problem persisted.

I've tested on rented GPUs such as the A6000 and on my own machine with an RTX4090 using WSL Linux, but encountered the same issue each time. I also tried using different databases, including ones with varying durations (e.g., 1, 4, and 6 hours) from a single speaker.

For the database format, I have around 5000 samples for training and 500 for validation, structured as follows format:
filename.wav|transcription|speaker

"I replaced the default dataset with my own, using 'data/train.txt' for training and 'data/val.txt' for validation. However, I'm unsure about the purpose of the 'OOD_text.txt' file. Should I modify or include this file in some way?

Could someone please help me understand what I might be doing wrong here?"

@lpscr
Copy link
Author

lpscr commented Feb 29, 2024

Today, I created a new training session, leaving it to run for about 60 epochs after 50th epoch when it began to update the gen-loss but when i run the train second stuck again :( Could you please check the Tensor Board to see if this behavior is normal?
here the Tenterboard

image

@lpscr
Copy link
Author

lpscr commented Mar 3, 2024

i try train again 30h dataset new complete take 3 days now for train first, i stop in epoch 75 and when i run the second train again same problem :(

i am not sure what i do wrong , i need more epochs ? to work this problem please someone if can tell me to because this take days to finish now i use RTX 4090 why i get this error in second train i dont understand

@yl4579
Copy link
Owner

yl4579 commented Mar 7, 2024

I think it means your loss becomes NaN, can you print all the variables and see if any of them is NaN?

@lpscr
Copy link
Author

lpscr commented Mar 8, 2024

@yl4579 hi thank you very much for the reply there is not NAN in loss only 0 after 50 epoch i get values here the train log

train.log

@lpscr
Copy link
Author

lpscr commented Mar 8, 2024

here also tenser board 30h dataset i try 76 epochs i stop
image

@martinambrus
Copy link

issue #254 as well as its connected PR #253 solved this issue for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants