Skip to content

SpeakSense is A high-performance ASR (Automatic Speech Recognition) server implementation, supporting both gRPC and REST APIs.

License

Notifications You must be signed in to change notification settings

bean-du/SpeakSense

Repository files navigation

SpeakSense ASR Server

English | 中文

A high-performance ASR (Automatic Speech Recognition) server implementation using Whisper, supporting both gRPC and REST APIs.

Overview

This project provides a server implementation for speech-to-text transcription using OpenAI's Whisper model, optimized for different platforms and hardware acceleration options.

Features

  • gRPC Server
    • Stream Transcription
  • Web API
    • Task Management
    • Task Status
    • Create Task By URL
    • Create Task By Local File
    • Authentication API Key Management
  • Schedule Task
    • Download Audio File
    • Transcription
    • Http Callback
  • Authentication
  • Multiple Platform Support
    • MacOS (Metal)
    • Linux (CUDA)
    • Windows (CUDA)

Quick Start

Prerequisites

  • Rust toolchain (1.70 or later)
  • For CUDA support: CUDA toolkit 11.x or later
  • For Metal support (MacOS): XCode and Metal SDK
  • etcd server running locally or accessible (unnecessary, only for microservice go-micro)

Installation

  1. Clone the repository:
git clone https://github.com/bean-du/SpeakSense
cd SpeakSense
  1. Download the Whisper model:
./script/download-ggml-model.sh
  1. Build the project:
# Standard build
cargo build --release

# With CUDA support
cargo build --release --features cuda

# With Metal support (MacOS)
cargo build --release --features metal

Environment Variables

  • ASR_SQLITE_PATH SQLite Path (default: sqlite://./asr_data/database/storage.db?mode=rwc)
  • ASR_AUDIO_PATH Audio Path (default: ./asr_data/audio/)
  • ETCD_DEFAULT_ENDPOINT Etcd Endpoint (default: http://localhost:2379)
  • ASR_MODEL_PATH Whisper Model Path (default: ./models/ggml-large-v3.bin)

Running the Server

Standard Run (CPU)

cargo run --release

Run with CUDA Support

cargo run --release --features cuda

Run with Metal Support (MacOS)

First, set the Metal resources path:

export GGML_METAL_PATH_RESOURCES="./resources"
cargo run --release --features metal

Docker Compose Quick Start

docker Only support linux cuda x86_64 now The easiest way to get started is using Docker Compose:

  1. Create required directories:
mkdir -p models asr_data/audio asr_data/database
  1. Download the Whisper model:
./script/download-ggml-model.sh
  1. Start the server:
# Standard version
docker-compose up -d

# With CUDA support
ASR_FEATURES=cuda docker-compose up -d

# With Metal support (MacOS)
ASR_FEATURES=metal docker-compose up -d
  1. Check the logs:
docker-compose logs -f
  1. Stop the server:
docker-compose down

The server will be available at:

Docker Compose Configuration

The default configuration includes:

  • Automatic volume mapping for models and data persistence
  • GPU support (when using CUDA feature)
  • Optional etcd service
  • Environment variable configuration

You can customize the configuration by:

  1. Modifying environment variables in docker-compose.yml
  2. Adding or removing services as needed
  3. Adjusting resource limits and port mappings

Usage Examples

gRPC Client Test

# Use local wav file
cargo run --example asr_client -- -i 2.wav

# Specify server address
cargo run --example asr_client -- -i test/2.wav -s http://127.0.0.1:7300

# Specify device id
cargo run --example asr_client -- -i input.wav -d test-device

REST API Examples

Create Transcription Task

curl -X POST http://localhost:7200/api/v1/asr/tasks \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"audio_url": "https://example.com/audio.wav"}'

Check Task Status

curl http://localhost:7200/api/v1/asr/tasks/{task_id} \
  -H "Authorization: Bearer your-api-key"

Configuration

Model Selection

The server supports various Whisper model sizes. You can download different models from Hugging Face: https://huggingface.co/ggerganov/whisper.cpp/tree/main

Performance Tuning

  • For CUDA: Adjust batch size and worker threads based on your GPU memory
  • For Metal: Ensure proper resource path configuration

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Acknowledgments

  • OpenAI Whisper
  • whisper.cpp
  • whisper-rs

About

SpeakSense is A high-performance ASR (Automatic Speech Recognition) server implementation, supporting both gRPC and REST APIs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages