Generate Text-to-speech timing metadata #1876

jeffemandel · 2024-12-05T16:58:13Z

jeffemandel
Dec 5, 2024

I am using TTS to generate instructions to users that match visual cues. My suggestion is that this could extend OpenAiAudioSpeechMetadata. Thus, if my speech prompt is "Press the button now", I'd like the call to return something like
Response.metadata.timings = [{"word","Press","time", 0.0}, {"word","the","time", 0.04}, ...

Conversely, there could be some sort of unspoken markup character that could trigger this behavior:
String prompt = "The @0 green line indicates money and the @1 blue line depicts happiness";
would yield
Response.metadata.timings = [0.2, 1.2]

I suspect the second approach may be preferable, but I'm open to either. Personally, I don't mind having to tell tts what I mean with certain characters - "µg" doesn't sound like "microgram", so having to write "at" because "@" isn't a spoken character is no great hindrance. I understand this is something that would have to be done at the level of the api, but the end users are the ones who will understand the motivation for the feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate Text-to-speech timing metadata #1876

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Generate Text-to-speech timing metadata #1876

jeffemandel Dec 5, 2024

Replies: 0 comments

jeffemandel
Dec 5, 2024