How to handle interruptions better while building speech to speech pipeline? #156

mehul-fabrichq · 2024-12-02T13:33:28Z

Any examples of end to end speech to speech pipeline for better latency and interruption handling?

KoljaB · 2024-12-04T13:21:28Z

That's a pretty general question. Interruption handling is tricky because it often requires echo cancellation of the voice agent's TTS output. Latency, on the other hand, is all about balancing. A fast STT system should transcribe in under 100ms on a strong GPU, and a decent TTS system adds around 200ms. The rest of the delay comes from LLM generation or speech end detection.

Most basic speech endpoint detection methods rely on waiting for a certain amount of silence, which naturally adds latency.

For better latency:

Make sure your STT, LLM, and TTS are as fast as possible. Use a more advanced speech endpoint detection method, like adjusting the silence threshold based on real-time transcription (e.g., detecting end punctuation) or analyzing frequency changes. People often lower their pitch when finishing a thought or raise it for questions.

For interruption handling:

Remove TTS feedback from the input and apply volume-based thresholds afterward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle interruptions better while building speech to speech pipeline? #156

How to handle interruptions better while building speech to speech pipeline? #156

mehul-fabrichq commented Dec 2, 2024

KoljaB commented Dec 4, 2024

How to handle interruptions better while building speech to speech pipeline? #156

How to handle interruptions better while building speech to speech pipeline? #156

Comments

mehul-fabrichq commented Dec 2, 2024

KoljaB commented Dec 4, 2024