-
Notifications
You must be signed in to change notification settings - Fork 5
Comparison of transcript word count
Marked differences between IBM Watson and Google transcription arise when comparing transcription rates and number of words generated when run on audio collected out in the wild, many of which had high levels of ambient noise.
The following comparisons were made over 8,415 audio files that were submitted to each service.
Of 8,415 such audio, IBM generated transcripts for 7,227, while Google was able to generate a transcript for 3,521.
Out of 8,415 audio files attempted, Google generated 3,521 transcripts. Those 3,521 transcripts contain total of 485,334 words, an average of 137 words per transcript.
IBM Watson generated 7,227 transcripts, extracting 9,511,743 words out of those transcripts. This gives an average of 1,316 words per transcript. [TODO: update with April returns]
Many of these transcripts that Google failed to generate were simply due to the file size exceeding quota.
However Google also failed to generate any transcript words for many other files that did not exceed the file size quota. It also generated a much lower word count per transcript for audio that was from a noisy or low bit rate recording.
One way to illustrate this is by examining the word count deciles over the transcripts that were successfully generated.
The following table gives the word counts deciles over the transcripts generated by each service.
API | min | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | max |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 2 | 3 | 5 | 8 | 12 | 19 | 58 | 459 | 4892 | |
IBM | 1 | 278 | 501 | 698 | 916 | 1137 | 1409 | 1722 | 2080 | 2450 | 8490 |
Currently IBM is winning at the task of extracting usable transcripts from my admittedly noisy data. I think I now have enough data to commence with the next steps, primary among which is the trend analysis.
Google recently announced that it’s speech recognition technology has achieved a 4.9 percent word error rate. However, if you examine the dates then the accuracy of the cloud api that I used in my results probably ranges between 6.1% to 8.5%, depending on how aggressively the company is rolling out its best algorithms to the cloud, as well as on whether the published result is for speaker-independent or for trained-speaker data. I was not consistently getting anywhere near this level of accuracy even on my reference audio. This suggests that I may have more tuning to do. It may also mean that I may have better accuracy to look forwards to in the near future as better algorithms are rolled out to the cloud.
I am committed to progressing on this task. I am optimistic that I'll eventually be able to extract meaningful transcripts from my data usable for trend analysis and indexing. This status is much improved over where I was on this project just a year ago, so I am especially pleased with where things are thus far.