Generate Text-to-speech timing metadata #1876
jeffemandel
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am using TTS to generate instructions to users that match visual cues. My suggestion is that this could extend OpenAiAudioSpeechMetadata. Thus, if my speech prompt is "Press the button now", I'd like the call to return something like
Response.metadata.timings = [{"word","Press","time", 0.0}, {"word","the","time", 0.04}, ...
Conversely, there could be some sort of unspoken markup character that could trigger this behavior:
String prompt = "The @0 green line indicates money and the @1 blue line depicts happiness";
would yield
Response.metadata.timings = [0.2, 1.2]
I suspect the second approach may be preferable, but I'm open to either. Personally, I don't mind having to tell tts what I mean with certain characters - "µg" doesn't sound like "microgram", so having to write "at" because "@" isn't a spoken character is no great hindrance. I understand this is something that would have to be done at the level of the api, but the end users are the ones who will understand the motivation for the feature.
Beta Was this translation helpful? Give feedback.
All reactions