Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couqi Engine takes brakes mid sentence to load. #7

Open
tomwarias opened this issue Feb 25, 2024 · 6 comments
Open

Couqi Engine takes brakes mid sentence to load. #7

tomwarias opened this issue Feb 25, 2024 · 6 comments

Comments

@tomwarias
Copy link

Couqi Engine takes brakes mid sentence to load. IT takes sometimes between words or even in the middle of say the word. I tried to adjust setting but nothing works. I use i7 10th and RTX3060 computer.

@KoljaB
Copy link
Owner

KoljaB commented Feb 25, 2024

Your GPU should be fast enough for realtime. Is pytorch installed with CUDA?

@tomwarias
Copy link
Author

Yes i followed everystep of the readme. I may have problem with cuda because my gpu isn't used by llm model also but dont know how to solve it. I use windows

@KoljaB
Copy link
Owner

KoljaB commented Feb 26, 2024

I guess pytorch has no CUDA support. Please check with:

print(torch.cuda.is_available())

If not available, please try to install the latest torch with CUDA version with:

pip install torch==2.2.0+cu118 torchaudio==2.2.0+cu118 --index-url https://download.pytorch.org/whl/cu118

(may need to adjust 118 to your CUDA version, this is for CUDA 11.8)

To use GPU with LLM under windows you need to compile llama-cpp-python for CUBLAS:

  • Set environment variables:
    set CMAKE_ARGS=-DLLAMA_CUBLAS=on
    set FORCE_CMAKE=1
  • Also it may be needed to copy all four MSBuildExtensions files based on your CUDA version (11.8 or 12.3) from:
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\extras\visual_studio_integration\MSBuildExtensions   
    
    to
    C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations
    

After that install and compile llama-cpp with:

pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

After that you can set n_gpu_layers in the creation parameters of llama.cpp to define how many layers of the llm neural network should be offloaded on the GPU.

@tomwarias
Copy link
Author

I did it and it still does that, and I am also unable to dowland llama_cpp on those set CMAKE_ARGS=-DLLAMA_CUBLAS=on

@KoljaB
Copy link
Owner

KoljaB commented Feb 27, 2024

What's the result of print(torch.cuda.is_available())? Both torch and llama.cpp have to run with CUDA (GPU supported) to achieve realtime speed.

The above installation way for llama.cpp works for on my Windows 10 system, if it fails on yours I'm not sure how I can offer further support. llama.cpp not my library and it can be a complex issue.

@KoljaB
Copy link
Owner

KoljaB commented Sep 24, 2024

Hello Tom,

could you please try (on python 3.1 - I used 3.10.9):

pip install torch==2.1.2+cu121 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install https://github.com/daswer123/deepspeed-windows-wheels/releases/download/11.2/deepspeed-0.11.2+cuda121-cp310-cp310-win_amd64.whl
pip install https://github.com/oobabooga/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.1.2cxx11abiFALSE-cp310-cp310-win_amd64.whl
pip install transformers==4.38.2

And then try these wheels for llama.cpp:

# llama-cpp-python (CPU only, AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.89+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.89+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.89+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.89+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"

# llama-cpp-python (CUDA, no tensor cores)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.89+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.89+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.89+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.89+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"

# llama-cpp-python (CUDA, tensor cores)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.89+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.89+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.89+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.89+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"

I think this should get you a running llama.cpp version with RealtimeTTS support with CUDA, deepspeed and flash attention.

If you then change this line:

coqui_engine = CoquiEngine(cloning_reference_wav="female.wav", language="en", speed=1.0)

into this one:

coqui_engine = CoquiEngine(cloning_reference_wav="female.wav", language="en", speed=1.0, use_deepspeed=True)

you should have a very fast realtime-capable coqui engine.

Would be great if you can give me a feedback if that worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants