Use Speech Recognition to Transcribe Oral Argument Audio #440

mlissner · 2016-02-25T17:33:15Z

We currently have about 7500 hours of oral argument audio without transcriptions. We need to go through these audio files and run a speech to text tool on them. This would have massive benefits:

Alerts based on things said in a court!
Transcription search
Written transcriptions

The research in this area seems to be taking a few different paths from what I've gathered. The tech industry mostly needs this for when people are talking to Siri, so most of the research is making it better able to hear little phrases rather than complex transcriptions like we have.

The other fission is between cloud-based APIs and software that you can install. Cloud based APIs have the best quality, and tend to be fairly turnkey. OTOH, installable software can be tuned to the corpus we have (legal audio), and doesn't have API limits or costs associated with it.

The good news seems to be that unified APIs seem to be bubbling to the surface. For example, here's a Python library that lets you use:

CMU Sphinx (an installable)
Google Speech Recognition
Wit.ia (some thing that Facebook apparently now owns)
IBM Speech to Text (1000 hours/month free, I think)
AT&T Speech to Text

Pretty solid number of choices in a single library. On top of these, there are a few other providers that only do speech recognition, and even YouTube (where @brianwc now works) does captions on videos. We've talked to a few of the speech-to-text startups, but none have had any interest in helping out a non-profit. Start-ups, am I right?

Anyway, there's clearly a lot to do here. An MVP might be to figure out the API limits and start pushing to the cloud using as many of the APIs as needed, though that probably brings a lot of complexity and variance in quality. Even using IBM's free tool, we could knock out our current collection in about eight or nine months. More comments on this over on hacker news too.

PS: I swear there used to be a bug for this, but I can't find it, so I'm loading this one with keywords like transcription, audio, oral arguments, transcribe, recognition...

mlissner · 2016-02-25T18:45:15Z

I'm told by a friend that Trint might be an option to look at. Looks more integrated than we probably want though.

waldoj · 2016-02-25T19:42:11Z

FWIW, I tested the quality of court audio transcription (using Virginia state court audio), and posted my conclusions here. Speechmatics offered the best bang for the buck.

mlissner · 2016-02-25T19:48:51Z

Really good to know where the quality is, @waldoj. Their prices are crazy though. Our 7500 hours of content at the price you mentioned (6¢/minute) comes to $27k. We'd need some sort of non-profit agreement...

mlissner · 2016-02-25T19:49:38Z

...or money.

djeraseit · 2016-06-10T23:24:45Z

I sent a request to Google Cloud Speech for beta access as it would probably be the most accurate with their natural language processing system. Unfortunately, each audio clip can be a maximum of 2 minutes long.

djeraseit · 2016-06-10T23:36:00Z

This project looks promising for long term viability:
https://github.com/pannous/tensorflow-speech-recognition

Google has open sourced the software for neural network and Facebook open sourced the hardware.

mlissner · 2016-06-13T16:24:44Z

@djeraseit We've got beta access to Google Cloud Speech and we're playing with it. It seems to work, but the quality is actually very low. Right now we're attempting to process as much audio as possible before Google makes it a pay service.

The max of 2 minutes was also lifted, btw.

djeraseit · 2016-06-28T14:23:53Z

IBM Watson has speech to text API. First 1,000 minutes per month are free.

http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/speech-to-text.html

mlissner · 2016-06-28T15:24:06Z

Yep, this is in my original ticket, above:

IBM Speech to Text (1000 hours/month free, I think)

mlissner · 2016-09-06T17:46:15Z

waldoj · 2016-09-07T03:53:21Z

Yay! Are you worried about the transcript quality? (Or is that question mooted by the fact that you can get_ a transcript with Google, so comparing it to a hypothetical more expensive system is an exercise in futility?)

mlissner · 2016-09-07T04:03:54Z

My concerns are mostly mooted by getting transcripts from Google in the first place. But also, the other pretty awesome part of this deal is what's in it for Google: We're giving them all of our audio as a zip (it's like 700GB or something), and they're going to use that for their machine learning, because they need big data sources with multiple speakers. So, I'm hopeful that they'll use our data to train their system, and boom, quality will be really good.

Anyway, even if that weren't true, this is an amazing offer. Everything I've heard is that they've got the best quality speech to text of anybody.

mlissner · 2016-10-29T22:33:41Z

Some big progress here that will be landing as soon as tests complete successfully:

Our database models are updated.
I've set this up to run as a celery "chain". This means that things will be processed by a series of async tasks:
- The first task will re-encode the file in LINEAR16 format, then upload it to google storage.
- The second task will request that google do speech to text on the uploaded file.
- The third task will poll Google for results to the second task, with an exponential back off. It will begin looking for results after five minutes, then ten, twenty, fourty, etc. It will give up after waiting 320 minutes. (This API takes about as long to run as the uploaded files and our longest is about 240 minutes). Once the poll completes successfully, it will be saved to the DB, and from there to Solr.
- A final task comes through and does cleanup (though this isn't strictly necessary, because Google is set to auto-delete all content after seven days).
I've updated Solr to have the transcript results in search results.

There's still more to do here, but this is the bulk of it. The missing pieces are:

Add snippets to search results (these were terrible before, should be better now).
Kick off a script to do all the old content (or just do it in a screen session). For this, it looks like we should wait for some backend upgrades from Google. In particular, they plan to soon have timestamps in the results, which they don't currently.
Figure out how to handle API limits, if they happen (I don't think we'll hit these unless we further optimize the pipeline).
Update the scraper.
Expose the transcripts on the oral argument pages (I'll want to look at the quality some more before doing this though).
Update the FAQ to mention transcripts.

ghost · 2019-12-20T21:46:42Z

Hi @mlissner, I just heard about this issue from @azeemba at the supreme court transcripts project. Would love to contribute. What's the current state of generating transcripts?

mlissner · 2019-12-20T22:29:03Z

Honestly...I'm not sure! I haven't looked at this in a while, but there's a lot of code for doing text-to-speech. I think where it wound up was that the quality wasn't good enough to read as a human, but it probably would be good enough for alerts and search. I think we still have credits with Google to convert stuff, so if you wanted to pick this up and run with it, that could be pretty great.

mlissner · 2020-01-29T01:04:01Z

FWIW, I just yanked a bunch of this code in 293b3b3. That's not to say we shouldn't do this, but I'm a code-deleting kick and I'm deleting any code that has lingered without use.

ghost · 2020-01-29T01:24:39Z

Hi @mlissner got it. I'm thinking that trying to get really good transcripts is beyond the scope of my NLP knowledge at this point, but would be great to come back to this later on as I work on that!

mlissner · 2023-08-03T16:32:13Z

Man oh man, I haven't updated this ticket in a while. This is now quite feasible and at high quality by using Whisper. This would be a great project for a volunteer to run with.

waldoj · 2023-08-03T17:28:06Z

I love it when enough time passes that a long-standing open issue goes from implausible to highly plausible.

waldoj · 2024-06-06T17:29:46Z

FWIW, if the files are stereo, you'd want to drop them to mono. And you might also experiment with reducing the sample rate. As low as 8 kHz may get equally-good results.

(I'm sorry if I'm just driving by and dropping in advice that's quite obvious to you!)

flooie · 2024-06-06T17:43:57Z

@waldoj

oh thats fantastic - Im not an audiophile or well informed on audio formats. But reducing to mono 8k would probably get everything under the magic 25mb threshold

grossir · 2024-06-07T14:49:42Z

On the OpenAI forums there are comments about doing that, and from the feedback, it seems to work

I take a 64k stereo mp3 and mash it with OPUS in an OGG container down to 12kbps mono, also using the speech optimizations. Command line is below:
ffmpeg -i audio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip audio.ogg

I haven't tried it myself yet, but @flooie sent me a 5 hours - 20MB file that returned an InternalServerError from the API, repeteadly:
InternalServerError: Error code: 500 - {'error': {'message': 'The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID req_b3716aa50fcb212e6a813a807215a382 in your email.)', 'type': 'server_error', 'param': None, 'code': None}}

Maybe there was a problem with the re-encoding, maybe there is an undocumented audio length limit?

grossir · 2024-06-07T17:39:26Z

Some datapoints from testing a draft of the transcription command #4102

Using the case name as prompt helps getting names right.
For this example case "Tafolla v. S.C.D.A.O.
". Without case name:

Tofolla, I'm not sure of the pronunciation, versus Heilig. Thank you. So Mr. Bergstein, before we start, pronounce your client's last name. Tofolla. Tofolla. That's how I do it. OK, well, you know better than we do. All right, so you have 10 minutes, but you reserve three minutes for rebuttal. Correct. So you may proceed. OK, we have two reasonable accommodation periods relevant to this case. The first one involved plaintiff's office interaction with Joseph Carroll on January 7, 2014, when we argue the jury can find that defendants violated the ADA when Carroll ordered plaintiff to archive the closed files because the order contradicted the medical note that plaintiff's doctor prepared a couple days ago. Let me ask you about that, Mr. Tofolla, because as I was reading the briefs on this, it just seemed like this case- He's Bergstein. I'm sorry. Bergstein. Thank you. You got Tofolla in my head. This case seems to be a disagreement over what the doctor notes said. It seems to me that the

With case name:

or Tafolla, I'm not sure of the pronunciation, versus Heilig. So Mr. Bergstein, before we start, pronounce your client's last name. Tafolla. Tafolla, that's how I do it. Okay, well, you know better than we do. All right, so you have ten minutes, but you reserve three minutes for rebuttal. Correct. So you may proceed. Okay, we have two reasonable accommodation periods relevant to this case. The first one involved plaintiff's office interaction with Joseph Carroll on January 7, 2014, when we argue the jury can find that defendants violated the ADA when Carroll ordered plaintiff to archive the closed files because the order contradicted the medical note that plaintiff's doctor prepared a couple days earlier. Let me ask you about that, Mr. Tafolla, because as I was reading the briefs on this, it just seemed like this case- He's Bergstein. I'm sorry. Bergstein. You got Tafolla in my head. This case seems to be a disagreement over what the doctor notes said. It seems to me that they were wil

Response time is fast, in a direct linear relationship with the audio duration. As a caveat, they were sequential requests

I did some manual testing, I think around ~30 requests. Also, I tested the command by running it against 97 audio files. However, we got billed 319 API requests. I don't know the reason for the difference

mlissner · 2024-06-07T19:03:20Z

Also, I tested the command by running it against 97 audio files. However, we got billed 319 API requests. I don't know the reason for the difference

I went and looked. No idea either, but we should keep an eye on this and make sure we understand it before we run up a 3× larger bill than we expect.

waldoj · 2024-06-07T19:05:06Z

That's a great observation on priming the transcription with the case name! That would seem to indicate that it would be helpful to include any metadata in the prompt that might appear within the audio.

mlissner · 2024-06-07T19:26:29Z

Maybe a bunch of legalese, but I can't think of a good list? I guess we'll have to see what's bad and plug holes as we see them.

waldoj · 2024-06-07T19:32:13Z

I was only thinking in terms of the name of the judge, the names of the lawyers, that kind of thing, but that's a good point! If there were any specialized legal terms that transcription consistently got wrong, the prompt could prime for them.

flooie · 2024-06-07T20:26:49Z

What about using the final opinion text for priming the audio

mlissner · 2024-06-07T20:32:16Z

We won't have that when we get recordings in the future, but even at this point we don't have them linked (though we should). V2, perhaps!

flooie · 2024-06-07T21:26:02Z

You know what are posted with oral arguments are briefs

mlissner · 2024-06-25T20:38:37Z

Just to follow up here: We've generated transcripts for nearly every oral argument file in CL. We are doing some cleanup for:

868 audio files that didn't get processed the first time (status 0)
103 that failed (status 2)
1481 that hallucinated (status 3)
1893 that are > 25MB (status 4)
182 with missing files (status 5)

mlissner · 2024-06-28T12:56:04Z

OK, this is getting closer to done:

In fix(audio.transcribe): add argument to retry different stt_status #4137, we tweaked things to reduce the hallucination problem by removing the case name as a prompt. The remaining "hallucinations" are probably better described as OpenAI doing its darnedest to make sense of horrible recordings. A human could do better, but some of these are very bad.
In feat(audio): Add helper method to downsize audio records doctor#194, added support to doctor to downsample mp3s and convert them to ogg, so that most of our large files are small enough, and in feat(audio): Add logic to transcribe big files #4141, we added it to CourtListener.

A few remaining things:

We still have 154 files, (0.16%) that are too big to process.

There seems to be a two hour limit in the OpenAI API, so we either need to either:
- cut some of these into smaller pieces that we then merge back together;
- only do the first two hours of these and then tell the user as such;
- we need to try speeding them up to see if we still get good enough quality; or
- just throw an error for users that says, "Sorry, this file was too big to transcribe."
I think I lean towards trying a sped-up version, then trying just the first two hours, then throwing an error if none of that works.
We haven't set up a process to do this for every new item we scrape. It cost about $21k to do 90k files, so that's 23¢ each. I think we can commit to that — unless we go and get thousands of additional files!
We need a new issue to discuss how we'll add these files to the UI (and we need to do so).
We need to do a blog post, etc.

What else?

mlissner added enhancement help wanted labels Feb 25, 2016

mlissner self-assigned this Oct 24, 2016

mlissner added a commit that referenced this issue Oct 29, 2016

Lands the code for audio transcriptions, fixing #440.

8f93a84

mlissner mentioned this issue Aug 25, 2018

Upload Oral Arguments to Internet Archive #861

Closed

4 tasks

azeemba mentioned this issue Dec 20, 2019

Other court transcripts walkerdb/supreme_court_transcripts#5

Closed

mlissner removed their assignment Jul 7, 2022

mlissner added this to Volunteer backlog Jul 7, 2022

mlissner moved this to 🏗 In progress in Volunteer backlog Jul 7, 2022

mlissner mentioned this issue Oct 11, 2022

chore(deps): bump django-localflavor from 3.0.1 to 3.1 #1995

Merged

mlissner moved this from 🏗 In progress to 📋 Coding Backlog in Volunteer backlog Aug 3, 2023

grossir mentioned this issue Jun 7, 2024

fetch_audio_duration is not working properly freelawproject/doctor#192

Open

mlissner added this to @grossir's backlog Jun 25, 2024

mlissner moved this to In Progress in @grossir's backlog Jun 25, 2024

mlissner moved this from In Progress to Done in @grossir's backlog Jul 15, 2024

mlissner added this to @erosendo's backlog Jul 15, 2024

mlissner moved this to Main Backlog in @erosendo's backlog Jul 15, 2024

mlissner moved this from Main Backlog to CourtListener Backlog in @erosendo's backlog Aug 1, 2024

mlissner moved this from CourtListener Backlog to Main Backlog in @erosendo's backlog Aug 1, 2024

mlissner moved this from Main Backlog to CourtListener Backlog in @erosendo's backlog Aug 1, 2024

mlissner moved this from CourtListener Backlog to Main Backlog in @erosendo's backlog Aug 1, 2024

flooie added this to Case Law Sprint Nov 15, 2024

flooie moved this to General Backlog in Case Law Sprint Nov 19, 2024

mlissner removed this from @erosendo's backlog Nov 22, 2024

mlissner removed this from Case Law Sprint Nov 22, 2024

mlissner added this to Sprint Nov 22, 2024

mlissner removed this from @grossir's backlog Nov 22, 2024

s-taube moved this to General Backlog in Sprint Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Speech Recognition to Transcribe Oral Argument Audio #440

Use Speech Recognition to Transcribe Oral Argument Audio #440

mlissner commented Feb 25, 2016

mlissner commented Feb 25, 2016

waldoj commented Feb 25, 2016

mlissner commented Feb 25, 2016

mlissner commented Feb 25, 2016

djeraseit commented Jun 10, 2016

djeraseit commented Jun 10, 2016

mlissner commented Jun 13, 2016

djeraseit commented Jun 28, 2016

mlissner commented Jun 28, 2016

mlissner commented Sep 6, 2016 •

edited

Loading

waldoj commented Sep 7, 2016

mlissner commented Sep 7, 2016 •

edited

Loading

mlissner commented Oct 29, 2016 •

edited

Loading

ghost commented Dec 20, 2019 •

edited by ghost

Loading

mlissner commented Dec 20, 2019

mlissner commented Jan 29, 2020 •

edited

Loading

ghost commented Jan 29, 2020

mlissner commented Aug 3, 2023

waldoj commented Aug 3, 2023

waldoj commented Jun 6, 2024 •

edited

Loading

flooie commented Jun 6, 2024

grossir commented Jun 7, 2024

grossir commented Jun 7, 2024

mlissner commented Jun 7, 2024

waldoj commented Jun 7, 2024

mlissner commented Jun 7, 2024

waldoj commented Jun 7, 2024

flooie commented Jun 7, 2024

mlissner commented Jun 7, 2024

flooie commented Jun 7, 2024

mlissner commented Jun 25, 2024

mlissner commented Jun 28, 2024

Use Speech Recognition to Transcribe Oral Argument Audio #440

Use Speech Recognition to Transcribe Oral Argument Audio #440

Comments

mlissner commented Feb 25, 2016

mlissner commented Feb 25, 2016

waldoj commented Feb 25, 2016

mlissner commented Feb 25, 2016

mlissner commented Feb 25, 2016

djeraseit commented Jun 10, 2016

djeraseit commented Jun 10, 2016

mlissner commented Jun 13, 2016

djeraseit commented Jun 28, 2016

mlissner commented Jun 28, 2016

mlissner commented Sep 6, 2016 • edited Loading

waldoj commented Sep 7, 2016

mlissner commented Sep 7, 2016 • edited Loading

mlissner commented Oct 29, 2016 • edited Loading

ghost commented Dec 20, 2019 • edited by ghost Loading

mlissner commented Dec 20, 2019

mlissner commented Jan 29, 2020 • edited Loading

ghost commented Jan 29, 2020

mlissner commented Aug 3, 2023

waldoj commented Aug 3, 2023

waldoj commented Jun 6, 2024 • edited Loading

flooie commented Jun 6, 2024

grossir commented Jun 7, 2024

grossir commented Jun 7, 2024

mlissner commented Jun 7, 2024

waldoj commented Jun 7, 2024

mlissner commented Jun 7, 2024

waldoj commented Jun 7, 2024

flooie commented Jun 7, 2024

mlissner commented Jun 7, 2024

flooie commented Jun 7, 2024

mlissner commented Jun 25, 2024

mlissner commented Jun 28, 2024

mlissner commented Sep 6, 2016 •

edited

Loading

mlissner commented Sep 7, 2016 •

edited

Loading

mlissner commented Oct 29, 2016 •

edited

Loading

ghost commented Dec 20, 2019 •

edited by ghost

Loading

mlissner commented Jan 29, 2020 •

edited

Loading

waldoj commented Jun 6, 2024 •

edited

Loading