Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle interruptions better while building speech to speech pipeline? #156

Open
mehul-fabrichq opened this issue Dec 2, 2024 · 1 comment

Comments

@mehul-fabrichq
Copy link

Any examples of end to end speech to speech pipeline for better latency and interruption handling?

@KoljaB
Copy link
Owner

KoljaB commented Dec 4, 2024

That's a pretty general question. Interruption handling is tricky because it often requires echo cancellation of the voice agent's TTS output. Latency, on the other hand, is all about balancing. A fast STT system should transcribe in under 100ms on a strong GPU, and a decent TTS system adds around 200ms. The rest of the delay comes from LLM generation or speech end detection.

Most basic speech endpoint detection methods rely on waiting for a certain amount of silence, which naturally adds latency.

For better latency:

Make sure your STT, LLM, and TTS are as fast as possible. Use a more advanced speech endpoint detection method, like adjusting the silence threshold based on real-time transcription (e.g., detecting end punctuation) or analyzing frequency changes. People often lower their pitch when finishing a thought or raise it for questions.

For interruption handling:

Remove TTS feedback from the input and apply volume-based thresholds afterward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants