- To generate adversarial examples for your own files, please ensure that the file is sampled at 16KHz and uses signed 16-bit ints as the data type. Our method is based on multi-objective evolutionary algorithm with three evaluated objectives, namely, CTC loss, speech similarity, and speech signal-to-noise ratio.
- Datasets of our experiments: We selected the datasets THCHS-30 and AISHELL-1. We randomly select 100 audio samples in wav format from each of these two datasets as the experimental subjects of our adversarial attack.
- Chinese ASR System: The Chinese ASR system we selected is DeepSpeech2 developed by Baidu.
Ensure to Install DeepSpeech2 system first. One of the Implementations for DeepSpeech2 can be find here. This project is developed based on the DeepSpeech2 project based on PaddlePaddle. The paper of DeepSpeech2 is "Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin". The project supports for training and prediction under Windows, Linux, and support for development board reasoning predictions such as NVIDIA Jetson.
Please copy files: sadversarial_tools.py, nsga3based.py and adversarial_model.py into the DeepSpeech2 project directory. Now create and run an attack:
python nsga3based.py
We use this script to recognized the audio by DeepSpeech2 in roder to verify that the attack succeeded:
python recognization.py
We encourage readers to listen to our chinese audio adversarial examples and the original one in the attacking_samples directory.
The chinese_audio.wav
will be recognized as "想听歌曲父亲"
and adversarial_audio.wav
will be recognized as "想听歌曲母亲"
by the DeepSpeech2 system.
ctcloss: 0.94573337
final_text decoded as: 想听歌曲母亲
Audio similarity to input: 0.9966
The chinese_audio_phrase.wav
will be recognized as "可怜好哦"
and adversarial_audio_phrase.wav
will be recognized as "取款机"
by the DeepSpeech2 system.
ctcloss: 11.0739765
final_text decoded as: 取款机
Audio similarity to input: 0.8445