-
-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Speech Recognition to Transcribe Oral Argument Audio #440
Comments
I'm told by a friend that Trint might be an option to look at. Looks more integrated than we probably want though. |
FWIW, I tested the quality of court audio transcription (using Virginia state court audio), and posted my conclusions here. Speechmatics offered the best bang for the buck. |
Really good to know where the quality is, @waldoj. Their prices are crazy though. Our 7500 hours of content at the price you mentioned (6¢/minute) comes to $27k. We'd need some sort of non-profit agreement... |
...or money. |
I sent a request to Google Cloud Speech for beta access as it would probably be the most accurate with their natural language processing system. Unfortunately, each audio clip can be a maximum of 2 minutes long. |
This project looks promising for long term viability: Google has open sourced the software for neural network and Facebook open sourced the hardware. |
@djeraseit We've got beta access to Google Cloud Speech and we're playing with it. It seems to work, but the quality is actually very low. Right now we're attempting to process as much audio as possible before Google makes it a pay service. The max of 2 minutes was also lifted, btw. |
IBM Watson has speech to text API. First 1,000 minutes per month are free. http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/speech-to-text.html |
Yep, this is in my original ticket, above:
|
Big news here. We've signed a contract with Google, and they are giving us credits to transcribe all our existing oral argument audio recordings using their Speech to Text API. They will also be giving us credits going forward to do this on an ongoing basis. We'll be announcing this properly once we've wrapped up this bug, but for now I want to get to work building this feature. I've outlined the process below, but could really use help getting it done, as I'm stretched quite thin:
|
Yay! Are you worried about the transcript quality? (Or is that question mooted by the fact that you can get_ a transcript with Google, so comparing it to a hypothetical more expensive system is an exercise in futility?) |
My concerns are mostly mooted by getting transcripts from Google in the first place. But also, the other pretty awesome part of this deal is what's in it for Google: We're giving them all of our audio as a zip (it's like 700GB or something), and they're going to use that for their machine learning, because they need big data sources with multiple speakers. So, I'm hopeful that they'll use our data to train their system, and boom, quality will be really good. Anyway, even if that weren't true, this is an amazing offer. Everything I've heard is that they've got the best quality speech to text of anybody. |
Some big progress here that will be landing as soon as tests complete successfully:
There's still more to do here, but this is the bulk of it. The missing pieces are:
|
Honestly...I'm not sure! I haven't looked at this in a while, but there's a lot of code for doing text-to-speech. I think where it wound up was that the quality wasn't good enough to read as a human, but it probably would be good enough for alerts and search. I think we still have credits with Google to convert stuff, so if you wanted to pick this up and run with it, that could be pretty great. |
FWIW, I just yanked a bunch of this code in 293b3b3. That's not to say we shouldn't do this, but I'm a code-deleting kick and I'm deleting any code that has lingered without use. |
Hi @mlissner got it. I'm thinking that trying to get really good transcripts is beyond the scope of my NLP knowledge at this point, but would be great to come back to this later on as I work on that! |
Man oh man, I haven't updated this ticket in a while. This is now quite feasible and at high quality by using Whisper. This would be a great project for a volunteer to run with. |
I love it when enough time passes that a long-standing open issue goes from implausible to highly plausible. |
FWIW, if the files are stereo, you'd want to drop them to mono. And you might also experiment with reducing the sample rate. As low as 8 kHz may get equally-good results. (I'm sorry if I'm just driving by and dropping in advice that's quite obvious to you!) |
oh thats fantastic - Im not an audiophile or well informed on audio formats. But reducing to mono 8k would probably get everything under the magic 25mb threshold |
On the OpenAI forums there are comments about doing that, and from the feedback, it seems to work
I haven't tried it myself yet, but @flooie sent me a 5 hours - 20MB file that returned an InternalServerError from the API, repeteadly: Maybe there was a problem with the re-encoding, maybe there is an undocumented audio length limit? |
Some datapoints from testing a draft of the transcription command #4102 Using the case name as prompt helps getting names right.
With case name:
Response time is fast, in a direct linear relationship with the audio duration. As a caveat, they were sequential requests I did some manual testing, I think around ~30 requests. Also, I tested the command by running it against 97 audio files. However, we got billed 319 API requests. I don't know the reason for the difference |
I went and looked. No idea either, but we should keep an eye on this and make sure we understand it before we run up a 3× larger bill than we expect. |
That's a great observation on priming the transcription with the case name! That would seem to indicate that it would be helpful to include any metadata in the prompt that might appear within the audio. |
Maybe a bunch of legalese, but I can't think of a good list? I guess we'll have to see what's bad and plug holes as we see them. |
I was only thinking in terms of the name of the judge, the names of the lawyers, that kind of thing, but that's a good point! If there were any specialized legal terms that transcription consistently got wrong, the prompt could prime for them. |
What about using the final opinion text for priming the audio |
We won't have that when we get recordings in the future, but even at this point we don't have them linked (though we should). V2, perhaps! |
You know what are posted with oral arguments are briefs |
Just to follow up here: We've generated transcripts for nearly every oral argument file in CL. We are doing some cleanup for:
|
OK, this is getting closer to done:
A few remaining things:
What else? |
We currently have about 7500 hours of oral argument audio without transcriptions. We need to go through these audio files and run a speech to text tool on them. This would have massive benefits:
The research in this area seems to be taking a few different paths from what I've gathered. The tech industry mostly needs this for when people are talking to Siri, so most of the research is making it better able to hear little phrases rather than complex transcriptions like we have.
The other fission is between cloud-based APIs and software that you can install. Cloud based APIs have the best quality, and tend to be fairly turnkey. OTOH, installable software can be tuned to the corpus we have (legal audio), and doesn't have API limits or costs associated with it.
The good news seems to be that unified APIs seem to be bubbling to the surface. For example, here's a Python library that lets you use:
Pretty solid number of choices in a single library. On top of these, there are a few other providers that only do speech recognition, and even YouTube (where @brianwc now works) does captions on videos. We've talked to a few of the speech-to-text startups, but none have had any interest in helping out a non-profit. Start-ups, am I right?
Anyway, there's clearly a lot to do here. An MVP might be to figure out the API limits and start pushing to the cloud using as many of the APIs as needed, though that probably brings a lot of complexity and variance in quality. Even using IBM's free tool, we could knock out our current collection in about eight or nine months. More comments on this over on hacker news too.
PS: I swear there used to be a bug for this, but I can't find it, so I'm loading this one with keywords like transcription, audio, oral arguments, transcribe, recognition...
The text was updated successfully, but these errors were encountered: