Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice stuttering, Macbook Pro M1 16GB, what to change? #2

Open
stevenbaert opened this issue Nov 21, 2023 · 17 comments
Open

Voice stuttering, Macbook Pro M1 16GB, what to change? #2

stevenbaert opened this issue Nov 21, 2023 · 17 comments

Comments

@stevenbaert
Copy link

Love this project! Was playing around with it.
The voice works fine, but stutters. It starts correctly "This is how ..." then stops "voice x", stops "sounds like".
What would you recommend to change?

Thanks for your input!

@KoljaB
Copy link
Owner

KoljaB commented Nov 21, 2023

As of my knowledge M1 does not achieve the inference speed yet currently needed for realtime with Coqui TTS using the XTTS 2 model. I don't have a Mac, so I can't really experiment around how to improve the situation.

I think maybe if Metal Shaders on your Mac are available this here could moving the model to the GPU (open coqui_engine.py and exchange the device set code with that one here):

	if torch.cuda.is_available():
		logging.info("CUDA available, GPU inference used.")
		device = torch.device("cuda")
	elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
		logging.info("MPS available, GPU inference used.")
		device = torch.device("mps")
	else:
		logging.info("CUDA and MPS not available, CPU inference used.")
		device = torch.device("cpu")

Also maybe raising the stream_chunk_size from 20 to 200 can help to get bigger synthesized chunks.

Also you can make coqui_engine only synthesize full sentences:

	chunklist = []

	for i, chunk in enumerate(chunks):
		chunk = postprocess_wave(chunk)
		chunklist.append(chunk.tobytes())

	for chunk in chunklist:
		conn.send(('success', chunk))

I know these are all not great solutions, but I think Coqui just does not fully support M1 currently. I guess general pytorch support for Mac will get better with future versions (I think I read they are working on this). The libraries used (faster_whisper, llama nd coqui TTS) all use torch currently, I'm not sure if switching to a completely other model inference provider like tinygrad or tensorflow is an option for the future.

Maybe coqui will optimize their synthesis further, they improved a lot in the past. Two months ago I was not able to get realtime speed on my environment (gtx 2080, amd ryzen, 32GB DDR4). I hope it will get better soon.

@KoljaB
Copy link
Owner

KoljaB commented Nov 21, 2023

I've just realized that Coqui also offers an excellent XTTS streaming server project. Their approach differs somewhat from mine. Therefore, if you can get their implementation working and experience similar stuttering issues, it likely indicates that the problem is beyond my control. However, if their version runs smoothly, it suggests that there might be a specific issue with my RealtimeTTS implementation for Mac, which I can then focus on resolving.

@stevenbaert
Copy link
Author

Can you tell me where I can find this coqui_engine.py?

@stevenbaert
Copy link
Author

Other option is to just use the default Mac voices(?) Btw, I plan to move to a Macbook M3, any idea if that is supported?

@KoljaB
Copy link
Owner

KoljaB commented Nov 22, 2023

coqui_engine.py is in your site-packages installation folder. You can see it with "pip show realtimetts". You might need to install with "pip install -e realtimetts" to be able to edit- this is how it works on Win, not sure how it is on Mac.
I'll update the package within the next two days with a new coqui_engine.py with new constructor parameters together with metal shader support for Mac.

    def __init__(self, 
                 model_name = "tts_models/multilingual/multi-dataset/xtts_v2",
                 cloning_reference_wav: str = "female.wav",
                 language = "en",
                 speed = 1.0,
                 thread_count = 6,       # <-
                 stream_chunk_size = 20, # <- these will allow for better customization for slower machines
                 full_sentences = False, # <-
                 level=logging.WARNING
                 ):

I would not put too much hope into this though, local neural TTS is still very demanding and these patches are basically only me picking for every straw to make the experience better.

@KoljaB
Copy link
Owner

KoljaB commented Nov 22, 2023

New version released now with way more options in the coqui engine constructor (pls upgrade with "pip install --upgrade RealtimeTTS")

@stevenbaert
Copy link
Author

Upgrading RealTimeTTS broke it. When I now run it I get error.

This is how voice number 3 sounds like
/opt/homebrew/lib/python3.11/site-packages/TTS/tts/layers/xtts/stream_generator.py:138: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
CoquiEngine: General synthesis error: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. occured trying to synthesize text This is how voice number 3 sounds like
Traceback: Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/RealtimeTTS/engines/coqui_engine.py", line 230, in _synthesize_worker
for i, chunk in enumerate(chunks):
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/TTS/tts/models/xtts.py", line 678, in inference_stream
wav_gen = self.hifigan_decoder(gpt_latents, g=speaker_embedding.to(self.device))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/TTS/tts/layers/xtts/hifigan_decoder.py", line 688, in forward
z = torch.nn.functional.interpolate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/functional.py", line 4006, in interpolate
return torch._C._nn.upsample_linear1d(input, output_size, align_corners, scale_factors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

Error: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
Exception in thread Thread-4 (synthesize_worker):
Traceback (most recent call last):
File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/opt/

@KoljaB
Copy link
Owner

KoljaB commented Nov 22, 2023

Ok, made another new version with optional use_mps parameter for coqui engine so you can at least use it somehow.

Maybe also setting environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 together with mps works... but I read in the torch forum that it does not work for every missing op. They are working on metal shaders and are adding ops with every new release currently as far as I know, so I guess in the near future we will get mps accelerated tts for mac.

@stevenbaert
Copy link
Author

Thanks, updating RealtimeTTS did work, but stuttering is still there. Also with the variable ...FALLBACK set to 1.
Note I have also these warnings now:

John: Hi there.
<<< Lina: Nice to meet you too. What brings you to this bar tonight? Asking for a friend who's also curious about the poker player sitting next to us. 😉
WARNING:root:Error fixing sentence end punctuation: string index out of range, Text: ""
RealTimeSTT: root - WARNING - Error fixing sentence end punctuation: string index out of range, Text: ""

@KoljaB
Copy link
Owner

KoljaB commented Nov 23, 2023

Thanks for reporting. RealtimeTTS 0.2.6 should fix the sentence end bug. Was caused by the emoji at the end.

There is not much more I can do against the stuttering right now on Mac. It will stop when setting full_sentences to True with the downside of having noticably greater latency for the first chunk of every sentence. Playing with thread_count and stream_chunk_size might improve it, but I doubt it will resolve it. We need full pytorch GPU support for mac, currently the dependent libraries (faster_whisper, llama and coqui TTS) all use torch only, so switching to tinygrad, tensorflow etc currently is no option.

Guess we have to wait until either coqui squeezes out a better synthesis performance (they did so a lot in the past, some weeks ago I wasn't able to get realtime speed on my 2080) or maybe if torch implements the aten::upsample_linear1d.out op into metal shaders the gpu support for mac will be sufficient for realtime factor < 1.

@stevenbaert
Copy link
Author

Thanks for your feedback! Would like to understand better what you are doing here. Is there a comprehensive overview not to technical? I m especially interested in the speech to text engine and how well it's getting input (also in languages like Dutch, French?). Then the output ideally via this great tts but for now why not use the default Mac voices (they are ok too)?

@KoljaB
Copy link
Owner

KoljaB commented Nov 23, 2023

I combine other peoples work by somehow glueing their libraries together 😉

For STT it's basically putting together of a fast transcription library (faster_whisper) with two good voice activity detection libraries (Silero VAD and webrtcvad). This then allows to detect spoken sentences quite well, and then with this now you can handle big amounts of streamed audio.

For TTS it's mostly preparation of the input texts to get sentence fragments that the engines can synthesize well (retrieve llm input streams until such a frag was found or splitting longer feeded texts into those frags).

The STT should work with dutch. Set the language to "nl" (i guess?) and maybe switch do a higher model, if the word error rate is too high.

Regarding default mac voices, I use pyttsx3 for the SystemEngine which was supposed to offer the native voices. I heard from another user though, that SystemEngine also caused issues on mac. I do not know yet if they appear on every mac and can be solved.

@stevenbaert
Copy link
Author

An update, have my Macbook Pro M3 36GB now, tried default install and stuttering is less but still there. Any suggestions there to improve?

@stevenbaert
Copy link
Author

Hi again, did a reinstall using git pull (using Macbook Pro M3).

Seems to go better, but not yet fully without stuttering. Note that I haven't installed CUDA since the link you give in the install is pointing to Windows or Linux only. This does impacts performance right?

Your input would be highly appreciated (any input which could improve the setup on Macbook Pro M3) since I really love this fully local open source voice chatting. Would love to, once I fully get it working without stuttering, add tools like web browsing.

Cuda not available
llama_cpp_lib: return llama_cpp
Initializing LLM llama.cpp model ...
llama.cpp model initialized
Initializing TTS CoquiEngine ...

Using model: xtts
Initializing STT AudioToTextRecorder ...
objc[56742]: Class AVFFrameReceiver is implemented in both /opt/homebrew/lib/python3.11/site-packages/av/.dylibs/libavdevice.59.7.100.dylib (0x2ae4b0778) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x280e60370). One of the two will be used. Which one is undefined.
objc[56742]: Class AVFAudioReceiver is implemented in both /opt/homebrew/lib/python3.11/site-packages/av/.dylibs/libavdevice.59.7.100.dylib (0x2ae4b07c8) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x280e603c0). One of the two will be used. Which one is undefined.
[2024-02-10 12:12:44.419] [ctranslate2] [thread 1113247] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.

@stevenbaert
Copy link
Author

Note that my GPU is not used at all while conversation is running:
Screenshot 2024-02-10 at 12 21 36

@KoljaB
Copy link
Owner

KoljaB commented Feb 10, 2024

The CoquiEngine also supports a parameter named use_deepspeed. If you can get deepspeed installed on Mac this is supposed to accelarate the synthesis (although I do not know if it works on CPU too). Otherwise I think there is only the thread_count parameter that would have the potential to speed things up.

I'd love to help more but I'm quite lost here too.

Maybe the community has some ideas about how to get Coqui TTS faster on CPU only? Or even better any way to get Coqui TTS working on GPU for Mac?

@scalar27
Copy link

I read that deepspeed only works with cuda (nvidia), thus not on a Mac.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants