Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting whisper.cpp? #376

Closed
tachyonicbytes opened this issue May 4, 2023 · 5 comments
Closed

Supporting whisper.cpp? #376

tachyonicbytes opened this issue May 4, 2023 · 5 comments

Comments

@tachyonicbytes
Copy link

Are there any plans to support the OpenAI Whisper automatic speech recognition? How hard it would be to do that? (I am unfamiliar with the codebase).

From a performance stand-point, it seems to be currently one of the best engines, although I wouldn't necessarily trust OpenAI marketing.

From a licensing stand-point, it is FOSS, so it should not be a problem.

@drmfinlay
Copy link
Member

Hello @tachyonicbytes,

Support for OpenAI Whisper has come up before, I think in the Gitter chat room. There are no current plans to support it in Dragonfly, at least not on its own. Shervin Emami (@shervinemami) managed to get it working together with Dragonfly's Kaldi engine last year. He was able to use Whisper, instead of Kaldi, for the dictation parts of grammar rules. If I remember correctly, this improved the recognition accuracy of those parts. See daanzu/kaldi-active-grammar#73 for more on that.

In order to use Whisper for the command parts too, it would be necessary to write a dedicated Dragonfly-Whisper engine implementation. However, impressive as Whisper is, its natural language ASR models are quite unsuitable for the typical Dragonfly command phrases defined in speech grammars. Unless I am mistaken, there is no way to trim Whisper's recognition search tree in real time — to have the software strictly consider only those hypotheses which fit active Dragonfly grammars.

If it becomes possible to do that, and if commands are recognisable with a high degree of accuracy and speed, then an engine implementation for Whisper might be worth considering. But those are two big ifs! I don't think the folks at OpenAI are capable of such sorcery. :-)

@LexiconCode
Copy link
Member

LexiconCode commented May 4, 2023

I went ahead and made an inquiry. Thanks for the verbiage Danesprite. Opening discussion
ggerganov/whisper.cpp#870

There's an early implementation. "Guided mode"
https://github.com/ggerganov/whisper.cpp/blob/master/examples/command/README.md

Example
https://github.com/ggerganov/whisper.cpp/tree/master/examples/command

@drmfinlay
Copy link
Member

Thank you for investigating further, Aaron. I was unaware of guided mode. It is a start, but would not be adequate without significant changes.

Since this mode takes a flat list of commands, a Dragonfly-Whisper implementation would have to output every possible command phrase to a text file. It would be simple enough to do this for a spec string like go right <N>. But, for the complex use of continuous command recognition in, say, Caster, it would be utterly impractical. This problem would be solved if guided mode could recognise commands efficiently from some sort of grammar file.

As you say in the linked discussion, Dragonfly also needs the ability to activate and deactivate command phrases. Without this, contexts wouldn't work properly. Another issue is that it would not be possible to recognise the dictation parts of commands in the same utterance.

This all seems unnecessary to me, really. Dragonfly already has several engine implementations that do these things well. Whisper, in my opinion, is just not the right tool for this type of work.

@shervinemami
Copy link
Contributor

I've used Whisper in Dragonfly for dictation while using KaldiAG for commands, and I definitely agree with Danesprite that Whisper isn't suited to command mode even if you're willing to put lots of effort into customising it. Whisper works great at full sentences, it's an excellent choice for long dictation, but struggles with dictating anything less than a few words, so even dictating something as short as "hi how are you?" is very unreliable in Whisper. This gives me the assumption it would really struggle if used specifically for single-word commands.

@drmfinlay
Copy link
Member

Thanks, Shervin. Your point about accuracy for short phrases is important. Whisper's models were not trained for this purpose.

@tachyonicbytes, if you haven't already, I would like to suggest that you try out Dragonfly's KaldiAG engine. It is open source and fairly accurate with low-latency. The documentation for it is here. I think you'll find it is good enough.

drmfinlay added a commit that referenced this issue May 2, 2024
Re: #139, #376, #383.

Add a Q and A on implementing a custom Dragonfly engine externally
and a Q and A on whether Dragonfly will add support for new speech
recognition engines.
drmfinlay added a commit that referenced this issue May 4, 2024
Re: #139, #376, #383.

I've added a section on new engines back into the CONTRIBUTING.rst
file and given criteria for new engine implementations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants