Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference with noisy source #21

Open
Charlottecuc opened this issue Nov 10, 2021 · 24 comments
Open

Inference with noisy source #21

Charlottecuc opened this issue Nov 10, 2021 · 24 comments
Labels
discussion New research topic

Comments

@Charlottecuc
Copy link

Hi. I tested the model with various kinds of wave files as source. I notice that at inference time, the model performs well with clean source files, but for those not so clean audio files (e.g. 24khz speech recorded by mobile phone, with background of air conditioning, or heavy breathing, which is quite common in real life application), the converted speech is sometime incomprehensible and usually with annoying noise.

I also tried denosing these noisy source files (e.g. using Audition, or other speech enhancement tools), but the converted speech became even worse.

Besides, do you think this line mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std to some extent enlarges the noise...?

Could you please give some ideas of making the model more robust with noisy data? Thank you very much.

@yl4579
Copy link
Owner

yl4579 commented Nov 20, 2021

The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real and x_input and only make x_input noisy.

@yl4579 yl4579 closed this as completed Nov 20, 2021
@Charlottecuc
Copy link
Author

Charlottecuc commented Dec 7, 2021

@yl4579 Hi. Thank you for your reply.
Could you give any advice on the percentage of noisy training files ? Or, should all the x_input files be corrupted?
I did some experiments and the results are not good. I'm not quite sure whether I wrongly separated x_real and x_input in https://github.com/yl4579/StarGANv2-VC/blob/main/losses.py
Thank you very much.

@yl4579
Copy link
Owner

yl4579 commented Jan 3, 2022

@Charlottecuc Sorry for the late reply. I was pretty busy at the end of the year. You can make all x_input corrupted, but I'd recommend you set each transformation with a probability of 0.3, so there will be some samples that are not corrupted.

@Charlottecuc
Copy link
Author

Charlottecuc commented Jan 17, 2022

The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real and x_input and only make x_input noisy.

@yl4579 Thank you for your reply. Just to make sure, when you mean "making the model noise-robust during training", do you mean only corrputing the inputs of the cycle-consistency loss of generator, or, corrupting the inputs of the whole adversial training process (e.g. adding something like "denoising loss" to make the discriminator capable of classifing between clean and noisy inputs and force the generator to produce clean outputs)? Could you give more details?

Thank you very much.

@yl4579
Copy link
Owner

yl4579 commented Mar 25, 2022

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

@MMMMichaelzhang
Copy link

mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std
have you fixed the noise problem? when i change mean =0, std =1 the noise gone,but it is too loud. @Charlottecuc

@skol101
Copy link

skol101 commented Jun 5, 2022

The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real and x_input and only make x_input noisy.

@yl4579
Please, help me here:
where's x_input actually? There's only x_real in trainer.py

x_real, y_org, x_ref, x_ref2, y_trg, z_trg, z_trg2 = batch

@yl4579
Copy link
Owner

yl4579 commented Jun 8, 2022

@skol101 You need to pass in a noisy version here, call it x_input. The x_input is processed in meldataset.py with noises and reveberations.

@skol101
Copy link

skol101 commented Jun 8, 2022

I see, because I thought reverb and noise should be added right in StyleEncoder as per #6 (comment)

@Kristopher-Chen
Copy link

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

@Charlottecuc @yl4579 Are the noisy inputs added only when training the generator, or both the generator and the discriminator? Thank you!

@skol101
Copy link

skol101 commented Jul 11, 2022

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

Style encoder is being called several times in the generator, but the only time it's called with x_real param is in the cycle consistency loss. So I guess that's there x_input should be used (but only 30% of the time).

What do you think @Charlottecuc ?

@skol101
Copy link

skol101 commented Jul 15, 2022

@yl4579

I'm either doing something wrong or adding reverbs and background noises does nothing. When the source (like VCTK p303_013.wav) has breathing, the converted speech has distortions. Maybe the issue is with the HifiGan vocoder, and I shall try a vocoder more tolerant of breathing/noises.

# cycle-consistency loss
    s_org = nets.style_encoder(x_input, y_org)
    x_rec = nets.generator(x_fake, s_org, masks=None, F0=GAN_F0_fake)
    loss_cyc = torch.mean(torch.abs(x_rec - x_real))

In meldataset.py

def __getitem__(self, idx):
        data = self.data_list[idx]
        mel_tensor, label = self._load_data(data)
        ref_data = random.choice(self.data_list)
        ref_mel_tensor, ref_label = self._load_data(ref_data)
        ref2_data = random.choice(self.data_list_per_class[ref_label])
        ref2_mel_tensor, _ = self._load_data(ref2_data)
        x_input, _ = self._load_data(data, True) #x_input is the same as mel_tensor (aka x_real) but with augmenter corruptions
        return mel_tensor, label, ref_mel_tensor, ref2_mel_tensor, ref_label, x_input


 def _load_tensor(self, data, corrupt_x_input=False):
        wave_path, label = data
        label = int(label)
        wave, sr = sf.read(wave_path)

        if corrupt_x_input and random.uniform(0, 1) <= 0.3:
            augmenter = Compose(
                [
                    RoomSimulator(
                        p=0.8,
                        leave_length_unchanged=True,
                    ),
                    AddBackgroundNoise(
                        sounds_path=BACKGROUND_NOISE_FILES,
                        min_snr_in_db=20,
                        max_snr_in_db=35,
                        p=0.5,
                    )
                ]
            )
            try:
                wave = augmenter(samples=wave, sample_rate=sr)
            except IndexError as error:
                print('error index error with wav file', wave_path)
            except ValueError as errorValue:
                print('error value error with wav file', wave_path)
        wave_tensor = torch.from_numpy(wave).float()
        return wave_tensor, label

@skol101
Copy link

skol101 commented Jul 15, 2022

@Charlottecuc this issue should be reopened to discuss further.

@skol101
Copy link

skol101 commented Jul 20, 2022

Wow, is this really a mystery @yl4579 ?

@Charlottecuc
Copy link
Author

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

Style encoder is being called several times in the generator, but the only time it's called with x_real param is in the cycle consistency loss. So I guess that's there x_input should be used (but only 30% of the time).

What do you think @Charlottecuc ?

I think it's not a good idea to add data aug in style encoder since source audio stream will not flow into style encoder at inference time.
In fact I tried almost all the mentioned denoising methods, either suggested in this post or others. Some of them can to some extent reduce the artifacts, but in total, the model is not stable with noisy source wave. It's not like PPG-based models, on which you can easily and clearly design some denoising loss.

@Charlottecuc
Copy link
Author

Charlottecuc commented Jul 20, 2022

@Charlottecuc this issue should be reopened to discuss further.

The issue was closed by @yl4579 , and I am not able to reopen it.
I argee that it should be reopened.

@Charlottecuc
Copy link
Author

@yl4579

I'm either doing something wrong or adding reverbs and background noises does nothing. When the source (like VCTK p303_013.wav) has breathing, the converted speech has distortions. Maybe the issue is with the HifiGan vocoder, and I shall try a vocoder more tolerant of breathing/noises.

# cycle-consistency loss
    s_org = nets.style_encoder(x_input, y_org)
    x_rec = nets.generator(x_fake, s_org, masks=None, F0=GAN_F0_fake)
    loss_cyc = torch.mean(torch.abs(x_rec - x_real))

In meldataset.py

def __getitem__(self, idx):
        data = self.data_list[idx]
        mel_tensor, label = self._load_data(data)
        ref_data = random.choice(self.data_list)
        ref_mel_tensor, ref_label = self._load_data(ref_data)
        ref2_data = random.choice(self.data_list_per_class[ref_label])
        ref2_mel_tensor, _ = self._load_data(ref2_data)
        x_input, _ = self._load_data(data, True) #x_input is the same as mel_tensor (aka x_real) but with augmenter corruptions
        return mel_tensor, label, ref_mel_tensor, ref2_mel_tensor, ref_label, x_input


 def _load_tensor(self, data, corrupt_x_input=False):
        wave_path, label = data
        label = int(label)
        wave, sr = sf.read(wave_path)

        if corrupt_x_input and random.uniform(0, 1) <= 0.3:
            augmenter = Compose(
                [
                    RoomSimulator(
                        p=0.8,
                        leave_length_unchanged=True,
                    ),
                    AddBackgroundNoise(
                        sounds_path=BACKGROUND_NOISE_FILES,
                        min_snr_in_db=20,
                        max_snr_in_db=35,
                        p=0.5,
                    )
                ]
            )
            try:
                wave = augmenter(samples=wave, sample_rate=sr)
            except IndexError as error:
                print('error index error with wav file', wave_path)
            except ValueError as errorValue:
                print('error value error with wav file', wave_path)
        wave_tensor = torch.from_numpy(wave).float()
        return wave_tensor, label

Training a denoising HiFi-GAN can not largely improve the results for the current issue. Because if you look at the mel-spectrograms, you can see that some parts are vague and unclear if the quality of source wave is low.

@skol101
Copy link

skol101 commented Jul 20, 2022

Here it was reported that added reverb/background noises did help #6 (comment)

Maybe the solution is to denoise the input wave before proceeding with inference, so something like Facebook denoiser can be used, but this suggestion points to using noise trained vocoder #6 (comment)

https://github.com/facebookresearch/denoiser
https://github.com/rishikksh20/hifigan-denoiser

@Charlottecuc
Copy link
Author

Charlottecuc commented Jul 20, 2022

Here it was reported that added reverb/background noises did help #6 (comment)

Maybe the solution is to denoise the input wave before proceeding with inference, so something like Facebook denoiser can be used, but this suggestion points to using noise trained vocoder #6 (comment)

If you would like to train an any-to-any model, then add data aug to style encoder will help.
If you denoise the source wave before inference, some distortions caused by noise will disappear, but new issue will come, since most denosing models will weaken the voice when eliminating noises and lead to new VC distortions. The problem was confirmed by many VC papers.
Training denoising HiFi-GAN might help, but may not achieve what you expect for VC issues. Because when inferencing with noisy wave, there might be some mispronunciations in converted mels, which can not be post-corrected by vocoder.

@yl4579 yl4579 reopened this Jul 20, 2022
@yl4579
Copy link
Owner

yl4579 commented Jul 20, 2022

@Charlottecuc Sorry I'm pretty busy with my other paper submissions so I can't join the discussion at this point, but I have reopened the issue for further discussion and will provide some feedback after I finished my work.

@yl4579
Copy link
Owner

yl4579 commented Aug 22, 2022

@Charlottecuc I do have some time now to discuss this problem. I have noticed similar problems with noisy input and have not yet come up with a good solution. The major problem with the GAN-based model is that it is difficult to design denoise loss functions because the target is not as clear as in PPG or TTS based VC models (in that case you have L1 reconstruction loss directly). Not sure if you have got any good solution to this problem, but I would suggest adding some noises in the time-frequency domain by reverse mel-scale and recomputing the mel scale (or you can train a model end-to-end if you prefer).

The key here is to add noise to the converted speech and force the model to convert the converted speech back to the clean output. Because one problem I noticed is that even if you add noise to the input during training, the model does not produce good converted examples sometimes. It somehow finds a way to trick the loss function so that the converted speech is not clear, but the second time conversion back to the source domain works quite well so the cycle consistency loss is still low. Adding noises to the converted results force the model to denoise the noisy speech directly. Another way is to add a denoise loss directly where the input is a noisy speech with the source style vector and the output is a clean speech. This might make the model overfit however so the converted speech might not sound similar to the target. This is in general a challenge in this field and there's still a lot of work to be done.

@yl4579 yl4579 added the discussion New research topic label Sep 18, 2022
@skol101
Copy link

skol101 commented Nov 9, 2022

This does a pretty good job of removing noises from the speech https://github.com/Rikorose/DeepFilterNet

@mayank-git-hub
Copy link

Another approach which works is to first train the model on a clean dataset and once the model is trained, freeze the model parameters and add two enhancement blocks to the encoder and the style encoder to enhance the noisy voice
in the feature domains using synthetically distorted data. We use the embeddings extracted from clean samples by
the original frozen encoder as targets and train the newly added enhancement blocks by minimizing the L1 distance between the target and the outputs obtained by the encoders with enhancement blocks from the distorted samples.

You can refer to our paper https://arxiv.org/pdf/2210.11096.pdf which shows the figures and results on distorted/noisy samples using StarGANv2-vc model architecture.

@Patelraj8694
Copy link

Hi @mayank-git-hub , I have similar application to your idea and I want speech conversion whisper or distorted speech. I do not have that much knowledge in fine tuning model , can you help me out ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion New research topic
Projects
None yet
Development

No branches or pull requests

7 participants