Skip to content
Neumair Günther edited this page Jan 2, 2019 · 6 revisions

How it works

The Feature extractor

The Feature extractor is responsible for turning PCM encoded audio data into 8bit mel-spectrogram features. You might question yourself why the feature extractor is separate and why it uses only 8bits. Some applications like verifying if a hotword is issued by a certain speaker require two models running the same audio data. Having the feature extractor as separate entity saves the duplicate computation of the mel-features. Secondly, it can be a convenient way of compressing and transmitting data. One second of audio contains 40X98 mel-features. You can capture your audio on a lightweight system (like ESP32) and transmit the features to a more powerful system. This only requires 40X98X8bit = 3kbit per second. Using 8bit looses almost no audible information.

AudioRecognition

The audio recognition module detects audio events. Depending on the model this can be a hotword, command, or any other audio event. Currently, all models watch a 1-second sliding frame (40x98 mel features) with a 200ms sliding step. So 5 predictions per second are made. The recognition module returns nothing if an unknown occurrence has been detected, and an index if a known event occurred.

Audio Input format

The feature extractor works on 16 bit PCM encoded signed integer data with one channel at a sample-rate of 16kHz. It expects input frames of 200ms length.

Capturing Microphone Input

In Python two implementations are available. cross_record.py uses pyaudio and works for multiple platforms. record.py uses arecord and only works for Linux. This version causes less trouble with under/overflows under heavy CPU usage and should be used if possible.

Clone this wiki locally