A toolkit to implement segmentation on speech based on BIC and nerual network, such as BiLSTM
- Python>=3.6
- tensorflow=1.13.1
- keras=2.2.4
- Librosa
- Numpy
- Scipy
You can use the installation of Anaconda to satisfy the required packages except Librosa.
To install librosa, you can try the following command:
conda install -c conda-forge librosa
-
Run script multi_detect.py to test the segmentation on a simple wav file:
python multi_detect_BIC.py
And you can get a speech segmentation result as showm below:
-
In the python script of multi_detect.py, there is a function call after some parameter settings:
seg_point = seg.multi_segmentation("dialog4.wav",sr,frame_size,frame_shift,plot_seg=False,save_seg=True)
To save the segmented audio into wav files, set the flag
save_seg=True
To plot out the wave figure in time domain with segmentation lines on, set the flag
plot_seg=True
-
Add a new parameter interface to enable the "Clustering segmented audio fragment using Kmeans method", just set the flag:
classify_seg=True
To determine the number of cluster number, I plot out a figure with X axis the number of clusters, Y axis is the "Sum of squared distances of samples to their closest cluster center" for each Kmeans clustering. Choose the best K value under Elbow Criterion:
From the figure shown abvove, I choose K = 2 to be the best cluster numbers:
Please input the best K value: 2
The lables for 4 speech segmentation belongs to the clusters below:
0 1 0 1
From the audio files stored in folder "save_audio", we can check that the clustering result is right.
-
Change the interface in 3 to be the definition of the clustering method you choose. Now the supported methods are "Kmeans" and "BIC distance". Also, the clustering method based on "BIC distance" is inspired by the Reference article.
Meanwhile, I use a longer audio file to test the new clustering method, there are totally 7 segments in "duihua_sample.wav". The final clustering results is as below:
There are total 2 clusters and they are listed below:
cluster 0 : ['1', '3', '5']
cluster 1 : ['0', '2', '4', '6']
-
Train the network
python train_bilstm_model.py
-
Predict the segmentation points
python multi_detect_Nerual.py
Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion, by IBM T.J. Watson Research Center
*Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks, by Ruiqing Yin, Herve Bredin, Claude Barras