-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Voice stuttering, Macbook Pro M1 16GB, what to change? #2
Comments
As of my knowledge M1 does not achieve the inference speed yet currently needed for realtime with Coqui TTS using the XTTS 2 model. I don't have a Mac, so I can't really experiment around how to improve the situation. I think maybe if Metal Shaders on your Mac are available this here could moving the model to the GPU (open coqui_engine.py and exchange the device set code with that one here): if torch.cuda.is_available():
logging.info("CUDA available, GPU inference used.")
device = torch.device("cuda")
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
logging.info("MPS available, GPU inference used.")
device = torch.device("mps")
else:
logging.info("CUDA and MPS not available, CPU inference used.")
device = torch.device("cpu") Also maybe raising the stream_chunk_size from 20 to 200 can help to get bigger synthesized chunks. Also you can make coqui_engine only synthesize full sentences: chunklist = []
for i, chunk in enumerate(chunks):
chunk = postprocess_wave(chunk)
chunklist.append(chunk.tobytes())
for chunk in chunklist:
conn.send(('success', chunk)) I know these are all not great solutions, but I think Coqui just does not fully support M1 currently. I guess general pytorch support for Mac will get better with future versions (I think I read they are working on this). The libraries used (faster_whisper, llama nd coqui TTS) all use torch currently, I'm not sure if switching to a completely other model inference provider like tinygrad or tensorflow is an option for the future. Maybe coqui will optimize their synthesis further, they improved a lot in the past. Two months ago I was not able to get realtime speed on my environment (gtx 2080, amd ryzen, 32GB DDR4). I hope it will get better soon. |
I've just realized that Coqui also offers an excellent XTTS streaming server project. Their approach differs somewhat from mine. Therefore, if you can get their implementation working and experience similar stuttering issues, it likely indicates that the problem is beyond my control. However, if their version runs smoothly, it suggests that there might be a specific issue with my RealtimeTTS implementation for Mac, which I can then focus on resolving. |
Can you tell me where I can find this coqui_engine.py? |
Other option is to just use the default Mac voices(?) Btw, I plan to move to a Macbook M3, any idea if that is supported? |
coqui_engine.py is in your site-packages installation folder. You can see it with "pip show realtimetts". You might need to install with "pip install -e realtimetts" to be able to edit- this is how it works on Win, not sure how it is on Mac. def __init__(self,
model_name = "tts_models/multilingual/multi-dataset/xtts_v2",
cloning_reference_wav: str = "female.wav",
language = "en",
speed = 1.0,
thread_count = 6, # <-
stream_chunk_size = 20, # <- these will allow for better customization for slower machines
full_sentences = False, # <-
level=logging.WARNING
): I would not put too much hope into this though, local neural TTS is still very demanding and these patches are basically only me picking for every straw to make the experience better. |
New version released now with way more options in the coqui engine constructor (pls upgrade with "pip install --upgrade RealtimeTTS") |
Upgrading RealTimeTTS broke it. When I now run it I get error. This is how voice number 3 sounds like Error: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable |
Ok, made another new version with optional use_mps parameter for coqui engine so you can at least use it somehow. Maybe also setting environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 together with mps works... but I read in the torch forum that it does not work for every missing op. They are working on metal shaders and are adding ops with every new release currently as far as I know, so I guess in the near future we will get mps accelerated tts for mac. |
Thanks, updating RealtimeTTS did work, but stuttering is still there. Also with the variable ...FALLBACK set to 1.
|
Thanks for reporting. RealtimeTTS 0.2.6 should fix the sentence end bug. Was caused by the emoji at the end. There is not much more I can do against the stuttering right now on Mac. It will stop when setting full_sentences to True with the downside of having noticably greater latency for the first chunk of every sentence. Playing with thread_count and stream_chunk_size might improve it, but I doubt it will resolve it. We need full pytorch GPU support for mac, currently the dependent libraries (faster_whisper, llama and coqui TTS) all use torch only, so switching to tinygrad, tensorflow etc currently is no option. Guess we have to wait until either coqui squeezes out a better synthesis performance (they did so a lot in the past, some weeks ago I wasn't able to get realtime speed on my 2080) or maybe if torch implements the aten::upsample_linear1d.out op into metal shaders the gpu support for mac will be sufficient for realtime factor < 1. |
Thanks for your feedback! Would like to understand better what you are doing here. Is there a comprehensive overview not to technical? I m especially interested in the speech to text engine and how well it's getting input (also in languages like Dutch, French?). Then the output ideally via this great tts but for now why not use the default Mac voices (they are ok too)? |
I combine other peoples work by somehow glueing their libraries together 😉 For STT it's basically putting together of a fast transcription library (faster_whisper) with two good voice activity detection libraries (Silero VAD and webrtcvad). This then allows to detect spoken sentences quite well, and then with this now you can handle big amounts of streamed audio. For TTS it's mostly preparation of the input texts to get sentence fragments that the engines can synthesize well (retrieve llm input streams until such a frag was found or splitting longer feeded texts into those frags). The STT should work with dutch. Set the language to "nl" (i guess?) and maybe switch do a higher model, if the word error rate is too high. Regarding default mac voices, I use pyttsx3 for the SystemEngine which was supposed to offer the native voices. I heard from another user though, that SystemEngine also caused issues on mac. I do not know yet if they appear on every mac and can be solved. |
An update, have my Macbook Pro M3 36GB now, tried default install and stuttering is less but still there. Any suggestions there to improve? |
Hi again, did a reinstall using git pull (using Macbook Pro M3). Seems to go better, but not yet fully without stuttering. Note that I haven't installed CUDA since the link you give in the install is pointing to Windows or Linux only. This does impacts performance right? Your input would be highly appreciated (any input which could improve the setup on Macbook Pro M3) since I really love this fully local open source voice chatting. Would love to, once I fully get it working without stuttering, add tools like web browsing. Cuda not available
|
The CoquiEngine also supports a parameter named use_deepspeed. If you can get deepspeed installed on Mac this is supposed to accelarate the synthesis (although I do not know if it works on CPU too). Otherwise I think there is only the thread_count parameter that would have the potential to speed things up. I'd love to help more but I'm quite lost here too. Maybe the community has some ideas about how to get Coqui TTS faster on CPU only? Or even better any way to get Coqui TTS working on GPU for Mac? |
I read that deepspeed only works with cuda (nvidia), thus not on a Mac. |
Love this project! Was playing around with it.
The voice works fine, but stutters. It starts correctly "This is how ..." then stops "voice x", stops "sounds like".
What would you recommend to change?
Thanks for your input!
The text was updated successfully, but these errors were encountered: