Inference with noisy source #21

Charlottecuc · 2021-11-10T03:14:14Z

Hi. I tested the model with various kinds of wave files as source. I notice that at inference time, the model performs well with clean source files, but for those not so clean audio files (e.g. 24khz speech recorded by mobile phone, with background of air conditioning, or heavy breathing, which is quite common in real life application), the converted speech is sometime incomprehensible and usually with annoying noise.

I also tried denosing these noisy source files (e.g. using Audition, or other speech enhancement tools), but the converted speech became even worse.

Besides, do you think this line mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std to some extent enlarges the noise...?

Could you please give some ideas of making the model more robust with noisy data? Thank you very much.

The text was updated successfully, but these errors were encountered:

yl4579 · 2021-11-20T03:39:08Z

The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real and x_input and only make x_input noisy.

Charlottecuc · 2021-12-07T12:25:40Z

@yl4579 Hi. Thank you for your reply.
Could you give any advice on the percentage of noisy training files ? Or, should all the x_input files be corrupted?
I did some experiments and the results are not good. I'm not quite sure whether I wrongly separated x_real and x_input in https://github.com/yl4579/StarGANv2-VC/blob/main/losses.py
Thank you very much.

yl4579 · 2022-01-03T06:45:30Z

@Charlottecuc Sorry for the late reply. I was pretty busy at the end of the year. You can make all x_input corrupted, but I'd recommend you set each transformation with a probability of 0.3, so there will be some samples that are not corrupted.

Charlottecuc · 2022-01-17T06:36:38Z

The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real and x_input and only make x_input noisy.

@yl4579 Thank you for your reply. Just to make sure, when you mean "making the model noise-robust during training", do you mean only corrputing the inputs of the cycle-consistency loss of generator, or, corrupting the inputs of the whole adversial training process (e.g. adding something like "denoising loss" to make the discriminator capable of classifing between clean and noisy inputs and force the generator to produce clean outputs)? Could you give more details?

Thank you very much.

yl4579 · 2022-03-25T21:43:58Z

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

MMMMichaelzhang · 2022-04-16T17:04:50Z

mel_tensor = (torch.log(1e-5 + mel_tensor) - self.mean) / self.std
have you fixed the noise problem? when i change mean =0, std =1 the noise gone,but it is too loud. @Charlottecuc

skol101 · 2022-06-05T19:11:04Z

The normalization may have possibly amplified the noises slightly, yet the point of log mel spec is actually the opposite: it tries to emphasize the speech instead of the noise. Probably the arbitrary mean and standard deviation may have some side effect, but if during training you make the model noise-robust, it should have no problems taking noisy input. As mentioned earlier here, if you corrupt your input with Audiomentations, it should have no problems dealing with noisy input. Just make sure you separate your x_real and x_input and only make x_input noisy.

@yl4579
Please, help me here:
where's x_input actually? There's only x_real in trainer.py

x_real, y_org, x_ref, x_ref2, y_trg, z_trg, z_trg2 = batch

yl4579 · 2022-06-08T02:50:57Z

@skol101 You need to pass in a noisy version here, call it x_input. The x_input is processed in meldataset.py with noises and reveberations.

skol101 · 2022-06-08T06:20:53Z

I see, because I thought reverb and noise should be added right in StyleEncoder as per #6 (comment)

Kristopher-Chen · 2022-06-22T02:28:15Z

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

@Charlottecuc @yl4579 Are the noisy inputs added only when training the generator, or both the generator and the discriminator? Thank you!

skol101 · 2022-07-11T08:15:56Z

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

Style encoder is being called several times in the generator, but the only time it's called with x_real param is in the cycle consistency loss. So I guess that's there x_input should be used (but only 30% of the time).

What do you think @Charlottecuc ?

skol101 · 2022-07-15T07:57:26Z

@yl4579

I'm either doing something wrong or adding reverbs and background noises does nothing. When the source (like VCTK p303_013.wav) has breathing, the converted speech has distortions. Maybe the issue is with the HifiGan vocoder, and I shall try a vocoder more tolerant of breathing/noises.

# cycle-consistency loss
    s_org = nets.style_encoder(x_input, y_org)
    x_rec = nets.generator(x_fake, s_org, masks=None, F0=GAN_F0_fake)
    loss_cyc = torch.mean(torch.abs(x_rec - x_real))

In meldataset.py

def __getitem__(self, idx):
        data = self.data_list[idx]
        mel_tensor, label = self._load_data(data)
        ref_data = random.choice(self.data_list)
        ref_mel_tensor, ref_label = self._load_data(ref_data)
        ref2_data = random.choice(self.data_list_per_class[ref_label])
        ref2_mel_tensor, _ = self._load_data(ref2_data)
        x_input, _ = self._load_data(data, True) #x_input is the same as mel_tensor (aka x_real) but with augmenter corruptions
        return mel_tensor, label, ref_mel_tensor, ref2_mel_tensor, ref_label, x_input


 def _load_tensor(self, data, corrupt_x_input=False):
        wave_path, label = data
        label = int(label)
        wave, sr = sf.read(wave_path)

        if corrupt_x_input and random.uniform(0, 1) <= 0.3:
            augmenter = Compose(
                [
                    RoomSimulator(
                        p=0.8,
                        leave_length_unchanged=True,
                    ),
                    AddBackgroundNoise(
                        sounds_path=BACKGROUND_NOISE_FILES,
                        min_snr_in_db=20,
                        max_snr_in_db=35,
                        p=0.5,
                    )
                ]
            )
            try:
                wave = augmenter(samples=wave, sample_rate=sr)
            except IndexError as error:
                print('error index error with wav file', wave_path)
            except ValueError as errorValue:
                print('error value error with wav file', wave_path)
        wave_tensor = torch.from_numpy(wave).float()
        return wave_tensor, label

skol101 · 2022-07-15T08:01:28Z

@Charlottecuc this issue should be reopened to discuss further.

skol101 · 2022-07-20T12:59:49Z

Wow, is this really a mystery @yl4579 ?

Charlottecuc · 2022-07-20T13:41:32Z

@Charlottecuc I'm sorry for the late reply because this issue was closed and I didn't get any notification. Not sure if it has been resolved, but what I meant was simply corrupting the input to the encoder but asking the model to reconstruct the clean (uncorrupted) version.

Style encoder is being called several times in the generator, but the only time it's called with x_real param is in the cycle consistency loss. So I guess that's there x_input should be used (but only 30% of the time).

What do you think @Charlottecuc ?

I think it's not a good idea to add data aug in style encoder since source audio stream will not flow into style encoder at inference time.
In fact I tried almost all the mentioned denoising methods, either suggested in this post or others. Some of them can to some extent reduce the artifacts, but in total, the model is not stable with noisy source wave. It's not like PPG-based models, on which you can easily and clearly design some denoising loss.

Charlottecuc · 2022-07-20T13:43:45Z

@Charlottecuc this issue should be reopened to discuss further.

The issue was closed by @yl4579 , and I am not able to reopen it.
I argee that it should be reopened.

Charlottecuc · 2022-07-20T13:49:54Z

@yl4579

I'm either doing something wrong or adding reverbs and background noises does nothing. When the source (like VCTK p303_013.wav) has breathing, the converted speech has distortions. Maybe the issue is with the HifiGan vocoder, and I shall try a vocoder more tolerant of breathing/noises.

# cycle-consistency loss
    s_org = nets.style_encoder(x_input, y_org)
    x_rec = nets.generator(x_fake, s_org, masks=None, F0=GAN_F0_fake)
    loss_cyc = torch.mean(torch.abs(x_rec - x_real))

In meldataset.py

def __getitem__(self, idx):
        data = self.data_list[idx]
        mel_tensor, label = self._load_data(data)
        ref_data = random.choice(self.data_list)
        ref_mel_tensor, ref_label = self._load_data(ref_data)
        ref2_data = random.choice(self.data_list_per_class[ref_label])
        ref2_mel_tensor, _ = self._load_data(ref2_data)
        x_input, _ = self._load_data(data, True) #x_input is the same as mel_tensor (aka x_real) but with augmenter corruptions
        return mel_tensor, label, ref_mel_tensor, ref2_mel_tensor, ref_label, x_input


 def _load_tensor(self, data, corrupt_x_input=False):
        wave_path, label = data
        label = int(label)
        wave, sr = sf.read(wave_path)

        if corrupt_x_input and random.uniform(0, 1) <= 0.3:
            augmenter = Compose(
                [
                    RoomSimulator(
                        p=0.8,
                        leave_length_unchanged=True,
                    ),
                    AddBackgroundNoise(
                        sounds_path=BACKGROUND_NOISE_FILES,
                        min_snr_in_db=20,
                        max_snr_in_db=35,
                        p=0.5,
                    )
                ]
            )
            try:
                wave = augmenter(samples=wave, sample_rate=sr)
            except IndexError as error:
                print('error index error with wav file', wave_path)
            except ValueError as errorValue:
                print('error value error with wav file', wave_path)
        wave_tensor = torch.from_numpy(wave).float()
        return wave_tensor, label

Training a denoising HiFi-GAN can not largely improve the results for the current issue. Because if you look at the mel-spectrograms, you can see that some parts are vague and unclear if the quality of source wave is low.

skol101 · 2022-07-20T13:55:26Z

Here it was reported that added reverb/background noises did help #6 (comment)

Maybe the solution is to denoise the input wave before proceeding with inference, so something like Facebook denoiser can be used, but this suggestion points to using noise trained vocoder #6 (comment)

https://github.com/facebookresearch/denoiser
https://github.com/rishikksh20/hifigan-denoiser

Charlottecuc · 2022-07-20T14:12:37Z

Here it was reported that added reverb/background noises did help #6 (comment)

Maybe the solution is to denoise the input wave before proceeding with inference, so something like Facebook denoiser can be used, but this suggestion points to using noise trained vocoder #6 (comment)

If you would like to train an any-to-any model, then add data aug to style encoder will help.
If you denoise the source wave before inference, some distortions caused by noise will disappear, but new issue will come, since most denosing models will weaken the voice when eliminating noises and lead to new VC distortions. The problem was confirmed by many VC papers.
Training denoising HiFi-GAN might help, but may not achieve what you expect for VC issues. Because when inferencing with noisy wave, there might be some mispronunciations in converted mels, which can not be post-corrected by vocoder.

yl4579 · 2022-07-20T14:48:26Z

@Charlottecuc Sorry I'm pretty busy with my other paper submissions so I can't join the discussion at this point, but I have reopened the issue for further discussion and will provide some feedback after I finished my work.

yl4579 · 2022-08-22T18:57:59Z

@Charlottecuc I do have some time now to discuss this problem. I have noticed similar problems with noisy input and have not yet come up with a good solution. The major problem with the GAN-based model is that it is difficult to design denoise loss functions because the target is not as clear as in PPG or TTS based VC models (in that case you have L1 reconstruction loss directly). Not sure if you have got any good solution to this problem, but I would suggest adding some noises in the time-frequency domain by reverse mel-scale and recomputing the mel scale (or you can train a model end-to-end if you prefer).

The key here is to add noise to the converted speech and force the model to convert the converted speech back to the clean output. Because one problem I noticed is that even if you add noise to the input during training, the model does not produce good converted examples sometimes. It somehow finds a way to trick the loss function so that the converted speech is not clear, but the second time conversion back to the source domain works quite well so the cycle consistency loss is still low. Adding noises to the converted results force the model to denoise the noisy speech directly. Another way is to add a denoise loss directly where the input is a noisy speech with the source style vector and the output is a clean speech. This might make the model overfit however so the converted speech might not sound similar to the target. This is in general a challenge in this field and there's still a lot of work to be done.

skol101 · 2022-11-09T09:34:08Z

This does a pretty good job of removing noises from the speech https://github.com/Rikorose/DeepFilterNet

mayank-git-hub · 2023-05-16T12:36:42Z

Another approach which works is to first train the model on a clean dataset and once the model is trained, freeze the model parameters and add two enhancement blocks to the encoder and the style encoder to enhance the noisy voice
in the feature domains using synthetically distorted data. We use the embeddings extracted from clean samples by
the original frozen encoder as targets and train the newly added enhancement blocks by minimizing the L1 distance between the target and the outputs obtained by the encoders with enhancement blocks from the distorted samples.

You can refer to our paper https://arxiv.org/pdf/2210.11096.pdf which shows the figures and results on distorted/noisy samples using StarGANv2-vc model architecture.

Patelraj8694 · 2024-05-15T14:34:37Z

Hi @mayank-git-hub , I have similar application to your idea and I want speech conversion whisper or distorted speech. I do not have that much knowledge in fine tuning model , can you help me out ?

yl4579 closed this as completed Nov 20, 2021

skol101 mentioned this issue Jun 21, 2022

About the speech rate of generated voice insunhwang89/StyleVC#2

Open

yl4579 reopened this Jul 20, 2022

yl4579 added the discussion New research topic label Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference with noisy source #21

Inference with noisy source #21

Charlottecuc commented Nov 10, 2021

yl4579 commented Nov 20, 2021 •

edited

Loading

Charlottecuc commented Dec 7, 2021 •

edited

Loading

yl4579 commented Jan 3, 2022

Charlottecuc commented Jan 17, 2022 •

edited

Loading

yl4579 commented Mar 25, 2022

MMMMichaelzhang commented Apr 16, 2022

skol101 commented Jun 5, 2022 •

edited

Loading

yl4579 commented Jun 8, 2022

skol101 commented Jun 8, 2022

Kristopher-Chen commented Jun 22, 2022

skol101 commented Jul 11, 2022

skol101 commented Jul 15, 2022 •

edited

Loading

skol101 commented Jul 15, 2022

skol101 commented Jul 20, 2022

Charlottecuc commented Jul 20, 2022

Charlottecuc commented Jul 20, 2022 •

edited

Loading

Charlottecuc commented Jul 20, 2022

skol101 commented Jul 20, 2022 •

edited

Loading

Charlottecuc commented Jul 20, 2022 •

edited

Loading

yl4579 commented Jul 20, 2022

yl4579 commented Aug 22, 2022 •

edited

Loading

skol101 commented Nov 9, 2022

mayank-git-hub commented May 16, 2023

Patelraj8694 commented May 15, 2024

Inference with noisy source #21

Inference with noisy source #21

Comments

Charlottecuc commented Nov 10, 2021

yl4579 commented Nov 20, 2021 • edited Loading

Charlottecuc commented Dec 7, 2021 • edited Loading

yl4579 commented Jan 3, 2022

Charlottecuc commented Jan 17, 2022 • edited Loading

yl4579 commented Mar 25, 2022

MMMMichaelzhang commented Apr 16, 2022

skol101 commented Jun 5, 2022 • edited Loading

yl4579 commented Jun 8, 2022

skol101 commented Jun 8, 2022

Kristopher-Chen commented Jun 22, 2022

skol101 commented Jul 11, 2022

skol101 commented Jul 15, 2022 • edited Loading

skol101 commented Jul 15, 2022

skol101 commented Jul 20, 2022

Charlottecuc commented Jul 20, 2022

Charlottecuc commented Jul 20, 2022 • edited Loading

Charlottecuc commented Jul 20, 2022

skol101 commented Jul 20, 2022 • edited Loading

Charlottecuc commented Jul 20, 2022 • edited Loading

yl4579 commented Jul 20, 2022

yl4579 commented Aug 22, 2022 • edited Loading

skol101 commented Nov 9, 2022

mayank-git-hub commented May 16, 2023

Patelraj8694 commented May 15, 2024

yl4579 commented Nov 20, 2021 •

edited

Loading

Charlottecuc commented Dec 7, 2021 •

edited

Loading

Charlottecuc commented Jan 17, 2022 •

edited

Loading

skol101 commented Jun 5, 2022 •

edited

Loading

skol101 commented Jul 15, 2022 •

edited

Loading

Charlottecuc commented Jul 20, 2022 •

edited

Loading

skol101 commented Jul 20, 2022 •

edited

Loading

Charlottecuc commented Jul 20, 2022 •

edited

Loading

yl4579 commented Aug 22, 2022 •

edited

Loading