Digital Human Conversation Demo

A real-time voice interactive digital human supporting both end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable avatar and voice, with voice cloning capability and first-package latency as low as 3 seconds.

Online Demo: https://www.modelscope.cn/studios/AI-ModelScope/video_chat

For detailed technical introduction, please refer to this article.

简体中文 | English

TODO

Add voice cloning functionality to the TTS module
Add edge-tts support to the TTS module
Add local inference for the Qwen LLM module
Support GLM-4-Voice and provide both ASR-LLM-TTS-THG and MLLM-THG generation methods
Integrate vLLM for inference acceleration with GLM-4-Voice
Integrate gradio-webrtc (pending support for audio-video synchronization) to improve video stream stability

Technical Stack

ASR (Automatic Speech Recognition): FunASR
LLM (Large Language Model): Qwen
End-to-end MLLM (Multimodal Large Language Model): GLM-4-Voice
TTS (Text to Speech): GPT-SoVITS, CosyVoice, edge-tts
THG (Talking Head Generation): MuseTalk

Local Deployment

0. GPU Memory Requirements

Cascaded Solution (ASR-LLM-TTS-THG): ~8G, first-package latency ~3s (single A100 GPU).
End-to-end Voice Solution (MLLM-THG): ~20G, first-package latency ~7s (single A100 GPU).
Developers who do not require the end-to-end MLLM solution can use the cascade_only branch.

$ git checkout cascade_only

1. Environment Setup

Ubuntu 22.04
Python 3.10
CUDA 12.2
PyTorch 2.3.0

$ git lfs install
$ git clone https://www.modelscope.cn/studios/AI-ModelScope/video_chat.git
$ conda create -n metahuman python=3.10
$ conda activate metahuman
$ cd video_chat
$ pip install -r requirement.txt

2. Weight Downloads

2.1 Via Creative Space (Recommended)

The Creative Space repository has git lfs tracking for weight files.

If cloned via git clone https://www.modelscope.cn/studios/AI-ModelScope/video_chat.git, no additional setup is needed.

2.2 Manual Downloads

2.2.1 MuseTalk

Refer to this link.
Directory structure:

./weights/
├── dwpose
│   └── dw-ll_ucoco_384.pth
├── face-parse-bisent
│   ├── 79999_iter.pth
│   └── resnet18-5c106cde.pth
├── musetalk
│   ├── musetalk.json
│   └── pytorch_model.bin
├── sd-vae-ft-mse
│   ├── config.json
│   └── diffusion_pytorch_model.bin
└── whisper
    └── tiny.pt

2.2.2 GPT-SoVITS

Refer to this link.

2.2.3 GLM-4-Voice

Add the following code in app.py to download weights:

from modelscope import snapshot_download
snapshot_download('ZhipuAI/glm-4-voice-tokenizer', cache_dir='./weights')
snapshot_download('ZhipuAI/glm-4-voice-decoder', cache_dir='./weights')
snapshot_download('ZhipuAI/glm-4-voice-9b', cache_dir='./weights')

3. Additional Configuration

LLM and TTS modules offer multiple inference options:

3.1 Using API-KEY (Default)

For LLM and TTS modules, if local machine performance is limited, you can use Alibaba Cloud's Qwen API and CosyVoice API. Configure the API-KEY in app.py (line 14):

os.environ["DASHSCOPE_API_KEY"] = "INPUT YOUR API-KEY HERE"

3.2 Without API-KEY

If not using API-KEY, update the relevant code as follows:

3.2.1 LLM Module

The src/llm.py file provides Qwen and Qwen_API classes for local inference and API calls respectively. Options for local inference:

Use Qwen for local inference.
Use vLLM for accelerated inference with Qwen_API(api_key="EMPTY", base_url="http://localhost:8000/v1"). Installation:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -e . --no-build-isolation

Refer to this guide for deployment.

3.2.2 TTS Module

The src/tts.py file provides GPT_SoVits_TTS and CosyVoice_API for local inference and API calls. Use Edge_TTS for free TTS services.

4. Start the Service

$ python app.py

5. Customize Digital Humans (Optional)

5.1 Custom Avatar

Add the recorded avatar video in /data/video/.
Modify the avatar_list in the Muse_Talk class in /src/thg.py to include (avatar_name, bbox_shift). Refer to this link for details on bbox_shift.
Add the avatar name in the avatar_name option in Gradio in app.py, restart the service, and wait for initialization to complete.

5.2 Custom Voice

GPT-SoVits supports custom voice cloning. To add a voice permanently:

Add reference audio (3-10s, named as x.wav) to /data/audio/.
Add the voice name (format: x (GPT-SoVits)) in the avatar_voice option in Gradio in app.py and restart the service.
Set TTS option to GPT-SoVits and start interacting.

6. Known Issues

Missing resources: Download missing resources as per error messages.
Video stream playback stuttering: Await Gradio's optimization for video streaming.
Model loading issues: Check if weights are downloaded completely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

Digital Human Conversation Demo

TODO

Technical Stack

Local Deployment

0. GPU Memory Requirements

1. Environment Setup

2. Weight Downloads

2.1 Via Creative Space (Recommended)

2.2 Manual Downloads

2.2.1 MuseTalk

2.2.2 GPT-SoVITS

2.2.3 GLM-4-Voice

3. Additional Configuration

3.1 Using API-KEY (Default)

3.2 Without API-KEY

3.2.1 LLM Module

3.2.2 TTS Module

4. Start the Service

5. Customize Digital Humans (Optional)

5.1 Custom Avatar

5.2 Custom Voice

6. Known Issues

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

Digital Human Conversation Demo

TODO

Technical Stack

Local Deployment

0. GPU Memory Requirements

1. Environment Setup

2. Weight Downloads

2.1 Via Creative Space (Recommended)

2.2 Manual Downloads

2.2.1 MuseTalk

2.2.2 GPT-SoVITS

2.2.3 GLM-4-Voice

3. Additional Configuration

3.1 Using API-KEY (Default)

3.2 Without API-KEY

3.2.1 LLM Module

3.2.2 TTS Module

4. Start the Service

5. Customize Digital Humans (Optional)

5.1 Custom Avatar

5.2 Custom Voice

6. Known Issues