-
Notifications
You must be signed in to change notification settings - Fork 5
Findings
Google has released a new version of the API that relaxes the 80MB file limit. I intend to reprocess the files that exceeded this limit as well as others that have accumulated since the previous run.
The following were obtained over a corpus of 7343 audio files in WAV, MP3, and M4A format. A few hundred were collected using a laptop microphone in a quiet setting but most of them were collected in noisy environments using a mobile audio recorder.
- IBM Transcribed/Processed 7281/7288 (99.9%)
- Google Transcribed/Processed 3521/7343 (48.0%)
- Average number of transcript words per minute of audio, IBM : 102.0
- Average number of transcript words per minute of audio, Google) : 9.8
- Processing seconds per minute : about the same
Firstly note that Google returned a transcript for only 48% of the files, whereas IBM did for almost all of them. What explains this? It turned out that Google is unable to process many of the audio files because they exceed a size quota of ~80MB.
Google also failed to glean any transcript words for many other small audio files for which IBM was able to glean something, albeit possibly containing many errors.
The Google Cloud SDK version for the results reported above was v.138.0.0. Since then I have updated the Cloud SDK version to version 143.0.0, and it turns out that the upgraded version did not make an appreciable difference.
Secondly note how the average number of transcript words per minute of audio processed returned by IBM is 10x that returned by Google. Of course, it could just be that IBM is just returning loads of gibberish while Google is biding its time waiting until it is absolutely sure, and then nailing it spot on. However, my spot checks don't support this explanation.
IBM Watson has an option to explicitly use its narrowband model for low bit rate recordings. The narrowband model is much more robust at gleaning something usable from noisy recordings that its broadband model, although the broadband model gives better accuracy for higher quality recordings. My code tries the broadband model first, falling back to the narrowband model as necessary. Unfortunately Google's api doesn't provide an option for handling low-bit noisy audio.