- Accuracy instead of lexical error rate
- Decoding real-time audio data
- Try using some kind of feature extraction. Real-valued DFT looks promising, since it can radically decrease the amount of training data (the signal is very narrow band, only a few DFT buckets contain most of the energy).
- Train a network to identify multiple signals in audio data and mark them with their approximate frequencies.
- ???
- Profit.