Semantic Sommeliers is an advanced audio processing system designed to handle complex tasks such as transcription, story detection, and instruction alignment within audio files. Utilizing models like Whisper and WhisperX, the system can transcribe audio, identify stories, and synchronize instructions with the session data effectively.
- Audio Transcription: Leverages OpenAI's Whisper and custom WhisperX models for accurate speech-to-text capabilities.
- Story Detection: Identifies and timestamps stories within audio sessions using semantic similarity analysis.
- Instruction Synchronization: Aligns instructional audio files with session data, using cross-correlation to find exact timings.
- Dynamic Configuration: Allows for varied audio processing settings through external configuration.
Before you begin, ensure you have the following installed:
- Python 3.10 or later*
- Poetry Python Package Manager
- FFmpeg for audio processing
- pip
- Conda (optional)
To set up the Semantic Sommeliers system on your local machine:
-
Clone the repository:
git clone https://github.com/your-username/semantic-sommeliers.git cd semantic-sommeliers
-
Install Poetry:
Follow the instructions on the Poetry website to install Poetry.
-
Create and activate the virtual environment:
poetry env use python3.10
-
Install dependencies:
poetry install
-
Activate the virtual environment:
source $(poetry env info --path)/bin/activate
-
Install torch with GPU support:
pip install torch==2.0.1+cu117 -f https://download.pytorch.org/whl/torch_stable.html
-
Clone the Repository:
git clone https://github.com/your-repository/semantic_sommeliers.git cd semantic_sommeliers
-
Setup Python Envrionment (using Conda):
conda create -n semantic_sommeliers python=3.10 conda activate semantic_sommeliers
-
Install Dependencires:
pip install -r requirements.txt
To run individual experiments with a specific session:
python main.py --session_name [session_filename.wav] --transcript_tool [whisper|whisperx] [optional parameters]
Optional parameters and their default values from config.py
are:
--new_sample_rate
: Sample rate for audio processing (default is set inconfig.py
)--highcut
: Highcut frequency for filtering (default is set inconfig.py
)--lowcut
: Lowcut frequency for filtering (default is set inconfig.py
)--normalization
: Enable or disable volume normalization (default is set inconfig.py
)--filtering
: Enable or disable filtering (default is set inconfig.py
)--seconds_threshold
,--story_absolute_peak_height
, etc.: Other thresholds and heights as specified inconfig.py
To automatically process all session files located in your data/sessions
directory, run the run_experiments.py
script. This script reads all .wav
files in the sessions directory and processes them using the default settings specified in config.py
:
python batch_run.py --audio_list path/to/audio_list.txt --error_log path/to/error_log.txt
Modify 'config.py' to change default settings used by the scripts. These settings include audio processing parameters like sample rate, filter settings, normalization, and detection thresholds. Changes in 'config.py' will affect both individual and batch processing unless parameters are explicitly overridden in the command line.
- 'main.py : Main script for running individual experiments.
- 'batch_fun.py' : Wrapper script for running experiments in batch mode.
- 'utility/utility.py' : Contains all utility functions for audio loading, trascription, and other core functionalities
- 'utils/general_util.py' : Contains utility functions for audio loading, transcription, and other core functionalities.
- 'utils/audio_util.py' : Contains functions specific to audio processing tasks.
- 'utils/text_util.py' : Contains functions specific to text processing tasks.
- 'config.py' : Configuration file for setting default parameters.
Contributions to improve Semantic Sommeliers are welcome. Please ensure to follow the existing code style and add unit tests for any new or changed functionality.
Distributed under the GNU Lesser General Public License v2.1 (LGPL 2.1). See LICENSE for more information.