A real-time voice interactive digital human supporting both end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable avatar and voice, with voice cloning capability and first-package latency as low as 3 seconds.
Online Demo: https://www.modelscope.cn/studios/AI-ModelScope/video_chat
For detailed technical introduction, please refer to this article.
简体中文 | English
- Add voice cloning functionality to the TTS module
- Add
edge-tts
support to the TTS module - Add local inference for the Qwen LLM module
- Support GLM-4-Voice and provide both ASR-LLM-TTS-THG and MLLM-THG generation methods
- Integrate vLLM for inference acceleration with GLM-4-Voice
- Integrate gradio-webrtc (pending support for audio-video synchronization) to improve video stream stability
- ASR (Automatic Speech Recognition): FunASR
- LLM (Large Language Model): Qwen
- End-to-end MLLM (Multimodal Large Language Model): GLM-4-Voice
- TTS (Text to Speech): GPT-SoVITS, CosyVoice, edge-tts
- THG (Talking Head Generation): MuseTalk
- Cascaded Solution (ASR-LLM-TTS-THG): ~8G, first-package latency ~3s (single A100 GPU).
- End-to-end Voice Solution (MLLM-THG): ~20G, first-package latency ~7s (single A100 GPU).
Developers who do not require the end-to-end MLLM solution can use thecascade_only
branch.
$ git checkout cascade_only
- Ubuntu 22.04
- Python 3.10
- CUDA 12.2
- PyTorch 2.3.0
$ git lfs install
$ git clone https://www.modelscope.cn/studios/AI-ModelScope/video_chat.git
$ conda create -n metahuman python=3.10
$ conda activate metahuman
$ cd video_chat
$ pip install -r requirement.txt
The Creative Space repository has git lfs
tracking for weight files.
If cloned via git clone https://www.modelscope.cn/studios/AI-ModelScope/video_chat.git
, no additional setup is needed.
Refer to this link.
Directory structure:
./weights/
├── dwpose
│ └── dw-ll_ucoco_384.pth
├── face-parse-bisent
│ ├── 79999_iter.pth
│ └── resnet18-5c106cde.pth
├── musetalk
│ ├── musetalk.json
│ └── pytorch_model.bin
├── sd-vae-ft-mse
│ ├── config.json
│ └── diffusion_pytorch_model.bin
└── whisper
└── tiny.pt
Refer to this link.
Add the following code in app.py
to download weights:
from modelscope import snapshot_download
snapshot_download('ZhipuAI/glm-4-voice-tokenizer', cache_dir='./weights')
snapshot_download('ZhipuAI/glm-4-voice-decoder', cache_dir='./weights')
snapshot_download('ZhipuAI/glm-4-voice-9b', cache_dir='./weights')
LLM and TTS modules offer multiple inference options:
For LLM and TTS modules, if local machine performance is limited, you can use Alibaba Cloud's Qwen API and CosyVoice API. Configure the API-KEY in app.py
(line 14):
os.environ["DASHSCOPE_API_KEY"] = "INPUT YOUR API-KEY HERE"
If not using API-KEY, update the relevant code as follows:
The src/llm.py
file provides Qwen
and Qwen_API
classes for local inference and API calls respectively. Options for local inference:
- Use
Qwen
for local inference. - Use
vLLM
for accelerated inference withQwen_API(api_key="EMPTY", base_url="http://localhost:8000/v1")
. Installation:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -e . --no-build-isolation
Refer to this guide for deployment.
The src/tts.py
file provides GPT_SoVits_TTS
and CosyVoice_API
for local inference and API calls. Use Edge_TTS
for free TTS services.
$ python app.py
- Add the recorded avatar video in
/data/video/
. - Modify the
avatar_list
in theMuse_Talk
class in/src/thg.py
to include(avatar_name, bbox_shift)
. Refer to this link for details onbbox_shift
. - Add the avatar name in the
avatar_name
option in Gradio inapp.py
, restart the service, and wait for initialization to complete.
GPT-SoVits
supports custom voice cloning. To add a voice permanently:
- Add reference audio (3-10s, named as
x.wav
) to/data/audio/
. - Add the voice name (format:
x (GPT-SoVits)
) in theavatar_voice
option in Gradio inapp.py
and restart the service. - Set TTS option to
GPT-SoVits
and start interacting.
- Missing resources: Download missing resources as per error messages.
- Video stream playback stuttering: Await Gradio's optimization for video streaming.
- Model loading issues: Check if weights are downloaded completely.