Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erro running on Mac M2 #15

Open
vitorcalvi opened this issue Jun 5, 2024 · 6 comments
Open

Erro running on Mac M2 #15

vitorcalvi opened this issue Jun 5, 2024 · 6 comments

Comments

@vitorcalvi
Copy link

First of all, awsome repo. I've tried all possible instalations combinations, had failed. Any suggests? @KoljaB
Machine: Mac M2

Terminal output:

Using model: xtts
Initializing STT AudioToTextRecorder ...
[2024-06-05 15:39:29.914] [ctranslate2] [thread 1054526] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.

Select voice (1-5): 1
This is how voice number 1 sounds like
/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/TTS/tts/layers/xtts/stream_generator.py:138: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
General synthesis error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:

  • (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
  • (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
  • (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out)
    occured trying to synthesize text This is how voice number 1 sounds like
    Traceback: Traceback (most recent call last):
    File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/RealtimeTTS/engines/coqui_engine.py", line 279, in _synthesize_worker
    for i, chunk in enumerate(chunks):
    File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
    File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 643, in inference_stream
    gpt_generator = self.gpt.get_generator(
    File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/TTS/tts/layers/xtts/gpt.py", line 603, in get_generator
    return self.gpt_inference.generate_stream(
    File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
    File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/TTS/tts/layers/xtts/stream_generator.py", line 186, in generate
    model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
    File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/transformers/generation/utils.py", line 473, in _prepare_attention_mask_for_generation
    torch.isin(elements=inputs, test_elements=pad_token_id).any()
    TypeError: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:
  • (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
  • (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
  • (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out)

Error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:

  • (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
  • (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
  • (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out)

Exception in thread Thread-4 (synthesize_worker):
Traceback (most recent call last):
File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/RealtimeTTS/text_to_stream.py", line 201, in synthesize_worker
self.engine.synthesize(sentence)
File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/RealtimeTTS/engines/coqui_engine.py", line 411, in synthesize
status, result = self.parent_synthesize_pipe.recv()
File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Accept voice (y/n):

GPT4o output:

It appears that there are several warnings and errors related to the process of initializing the STT (Speech-to-Text) AudioToTextRecorder and selecting the voice. Here are the issues and their potential resolutions:
Compute Type Warning:

[warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
Resolution: This is a warning indicating that the model initially designed to use float16 precision has been converted to float32 because the device or backend doesn't support float16 efficiently. This is usually not a critical issue, but if you want to optimize performance, consider using hardware that supports float16 or adjust the model configuration to use float32 from the start.
Pretrained Model Configuration Warning:

UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
Resolution: Update your code to use a generation configuration file as suggested in the warning. This will ensure compatibility with future versions of the library.
General Synthesis Error:

General synthesis error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of: * (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out) * (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out) * (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out) occured trying to synthesize text This is how voice number 1 sounds like
Resolution: This error indicates a type mismatch in the function call to isin(). Make sure that the arguments passed to isin() are of the correct type as specified in the error message. The elements should either be both Tensors or one should be a Tensor and the other a Number.

To proceed, you may need to:
Verify and update the model and its configuration to ensure compatibility with the current hardware and software environment.
Make sure that all function calls, particularly those involving Tensors, are using the correct types as expected by the functions.

If you need further assistance or specific code examples to resolve these issues, please provide more details about your setup and the code you're running.

@KoljaB
Copy link
Owner

KoljaB commented Jun 5, 2024

This is due to new transformers library introducing an incompatibility to Coqui TTS (see here).
Please downgrade to an older transformers version: pip install transformers==4.38.2 or upgrade RealtimeTTS to latest version pip install realtimetts==0.4.1

@vitorcalvi
Copy link
Author

Thanks for the awnser :)
Tested both solutions, only older transformers version works

Another two issues:
-- > Using model: xtts
Initializing STT AudioToTextRecorder ...
[2024-06-05 16:20:24.534] [ctranslate2] [thread 27773] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.

-- Select voice (1-5): 1
This is how voice number 1 sounds like
/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.9/site-packages/TTS/tts/layers/xtts/stream_generator.py:138: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(

This is due to new transformers library introducing an incompatibility to Coqui TTS (see here). Please downgrade to an older transformers version: pip install transformers==4.38.2 or upgrade RealtimeTTS to latest version pip install realtimetts==0.4.1

@KoljaB
Copy link
Owner

KoljaB commented Jun 5, 2024

Thank you for feedback. Both warnings are absolutely normal and should not lead to any issues.

@vitorcalvi
Copy link
Author

@KoljaB thank you. I forget another issue, speech cuts out every 1.5 to 2 seconds. Any suggests?

@KoljaB
Copy link
Owner

KoljaB commented Jun 5, 2024

You may want to create CoquiEngine with full_sentences=True in the constructor on Mac M2 btw, because most Macs aren't fast enough for realtime synthesis with Coqui TTS (no GPU use possible).

coqui_engine = CoquiEngine(cloning_reference_wav="female.wav", language="en", speed=1.0, full_sentences=True)

@vitorcalvi
Copy link
Author

Works like charm but as you've said, Macs aren't fast enough for RT Syth with Coqui TTS and the machine gots heavy
Mac has Mlx framework and there's another TTS library MeloTTS mentioned on repo below
https://github.com/huwprosser/jarvis-mlx

Thanks brow, see u

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants