Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcription structure #74

Open
marvinmarnold opened this issue Aug 5, 2023 · 8 comments
Open

Transcription structure #74

marvinmarnold opened this issue Aug 5, 2023 · 8 comments

Comments

@marvinmarnold
Copy link
Collaborator

Proposal for how to structure transcriptions:

type IDocument = {
  // autogenerated
  id: uuid;
  name: string;
  description: string;
  original_url: string;
  original_published_at: timestampz;
  original_format: 'video'
  original_source: 'youtube'
  type: 'full_council_meeting' | 'committee_meeting' ;
  subtype: 'regular' | 'special' | 'criminal_justice' | 'budget' | ...
}

type ISpeaker = {
  id: uuid;
  // This is the label the transcriber will use
  slug: string;
  role: 'council_member' | 'public' | 'gov_agency' | 'civic_society';
  name: string;
}

type IDocumentFragment = {
  id: uuid;
  document_id: fk;
  speaker_id: fk;
  // number of miliseconds into the video
  // you don't think this is necessary, but I don't understand why
  timestamp: int;
  text: string;
}

In order to transcribe a city council video from youtube, the transcriber should:

  • Record the metadata for the video overall (the fields from IDocument)
  • Record speakers throughout transcription
  • Transcribe and create a new IDocumentFragment for each line translated

@ayyubibrahimi what tool do you think transcription should be done through? Google Sheets would be the easiest.

@ayyubibrahimi
Copy link
Collaborator

Agreed.

I'll go ahead and confirm that we've landed on those 4 speaker roles.

@marvinmarnold
Copy link
Collaborator Author

Great. I can make the spreadsheet pretty quick I think

@marvinmarnold
Copy link
Collaborator Author

@ayyubibrahimi
Copy link
Collaborator

Makes sense in the long term. I think that we need to ensure we can reliably map the text that Caitlin transcribes to the data that has timestamps before finalizing a schema.

@marvinmarnold
Copy link
Collaborator Author

Ya, I'm not clear how you are thinking about doing that.

@ayyubibrahimi
Copy link
Collaborator

Simple in theory. Planning to begin experimenting soon. Brief overview:

Example of a chunk of data that contains timestamps:

{
"timestamp": "0:00-0:24",
"page_content": "back then libraries were so important at the school that we would go there and I learned my I got education I
was educated in Carver's Library so I have connections to the um school that I'm proud of but I'm so very pleased to represent
District D which includes Carver High School very proud of you all and as you head back to carver-ram's way just remember
that we're behind you 100 thank you thank you",
"url": "https://www.youtube.com/watch?v=Bl-Tv5yuUTw&ab_channel=NewOrleansCityCouncil",
"title": "City Council Meeting 2-2-2023",
"publish_date": "2/2/2023" 
},

Example of how I think the transcribed text should be formatted:

{
    {"text": "back then libraries were so", "speaker": "civic society"},
    {"text": "important at the school that", "speaker": "governmental agency"},
    {"text": "we would go there and", "speaker": "governmental agency"},
    {"text": "I learned my I got", "speaker": "city council member"},
    {"text": "education I was educated in", "speaker": "governmental agency"},
    {"text": "Carver's Library so I have", "speaker": "governmental agency"},
    {"text": "connections to the um school", "speaker": "public"},
    {"text": "that I'm proud of but", "speaker": "public"},
    {"text": "I'm so very pleased to", "speaker": "governmental agency"},
    {"text": "represent District D which includes", "speaker": "public"},
    {"text": "Carver High School very proud", "speaker": "public"},
    {"text": "of you all and as", "speaker": "governmental agency"},
    {"text": "you head back to carver-ram's", "speaker": "public"},
    {"text": "way just remember that we're", "speaker": "governmental agency"},
    {"text": "behind you 100 thank you", "speaker": "governmental agency"},
    {"text": "thank you", "speaker": "public"}
  ]
}

Because we're currently chunking data on a roughly 5 second interlude, the amount of tokens within a chunk should be relatively consistent. If we chunk the transcribed text similarly, we should be able to perform a simple string similarity search to match the transcribed text with the timestamps.

@marvinmarnold
Copy link
Collaborator Author

Because we're currently chunking data on a roughly 5 second interlude, the amount of tokens within a chunk should be relatively consistent.

I'm not convinced. Wouldn't caitlin's transcription need to be almost perfectly lined up with youtube one for this to work? I imagine the two will start to drift pretty fast. And if Caitlin needs to track 5 sec increments, why not just have her track her own timestamps? Maybe 5 sec is easier than for every soundbyte.

@ayyubibrahimi
Copy link
Collaborator

I don't think drifting is an issue if the string similarity match has pointers to the preceding and following chunks. Alternatively, she she can transcribe in increments of 60 seconds, for the sake of efficiency, and for the purposes of matching the strings, we can chunk the timestamp data on a 60 second interlude. These chunks can always be preprocessed further before they're read into the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants