diff --git a/CHANGELOG.md b/CHANGELOG.md index 0441aed0d..46ed41604 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,6 +17,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Removed unused file transfer how to guides - Removed `pennylane` as a requirement from notebooks' requirements.txt as it comes with `covalent` +### Docs + +- Added voice cloning tutorial + ## [0.233.0-rc.0] - 2024-01-07 ### Authors diff --git a/doc/source/tutorials/tutorials.rst b/doc/source/tutorials/tutorials.rst index e426f1cca..d026a7174 100644 --- a/doc/source/tutorials/tutorials.rst +++ b/doc/source/tutorials/tutorials.rst @@ -75,6 +75,8 @@ Advanced - :doc:`Scalable API backends for LLM and generative AI <./0_ClassicalMachineLearning/genai/source>` * - Federated learning - :doc:`Federated learning <./federated_learning/source>` + * - Voice cloning + - :doc:`Voice cloning <./voice_cloning/source>` --------------------------------- diff --git a/doc/source/tutorials/voice_cloning/Dockerfile_gcp b/doc/source/tutorials/voice_cloning/Dockerfile_gcp new file mode 100644 index 000000000..01fc77aad --- /dev/null +++ b/doc/source/tutorials/voice_cloning/Dockerfile_gcp @@ -0,0 +1,43 @@ +# Copyright 2023 Agnostiq Inc. +# +# This file is part of Covalent. +# +# Licensed under the GNU Affero General Public License 3.0 (the "License"). +# A copy of the License may be obtained with this software package or at +# +# https://www.gnu.org/licenses/agpl-3.0.en.html +# +# Use of this file is prohibited except in compliance with the License. Any +# modifications or derivative works of this file must retain this copyright +# notice, and modified files must contain a notice indicating that they have +# been altered from the originals. +# +# Covalent is distributed in the hope that it will be useful, but WITHOUT +# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +# FITNESS FOR A PARTICULAR PURPOSE. See the License for more details. +# +# Relief from the License may be granted by purchasing a commercial license. + +ARG COVALENT_BASE_IMAGE=python:3.9-slim-buster +FROM ${COVALENT_BASE_IMAGE} + +# Install dependencies +ARG COVALENT_TASK_ROOT=/covalent +WORKDIR ${COVALENT_TASK_ROOT} + +ARG COVALENT_PACKAGE_VERSION=0.229.0rc0 +ARG PRE_RELEASE="" + +COPY requirements.txt requirements.txt +RUN apt-get update && \ + apt-get install -y build-essential wget && \ + pip install -r requirements.txt + +COPY covalent_gcpbatch_plugin/exec.py ${COVALENT_TASK_ROOT} + +ENV PYTHONPATH ${COVALENT_TASK_ROOT}:${PYTHONPATH} + +# Path where the storage bucket will be mounted inside the container +ENV GCPBATCH_TASK_MOUNTPOINT /mnt/disks/covalent + +ENTRYPOINT [ "python", "exec.py" ] diff --git a/doc/source/tutorials/voice_cloning/assets/streamlit_gcp.gif b/doc/source/tutorials/voice_cloning/assets/streamlit_gcp.gif new file mode 100644 index 000000000..bc3691a7e Binary files /dev/null and b/doc/source/tutorials/voice_cloning/assets/streamlit_gcp.gif differ diff --git a/doc/source/tutorials/voice_cloning/assets/workflow.gif b/doc/source/tutorials/voice_cloning/assets/workflow.gif new file mode 100644 index 000000000..283fb35e4 Binary files /dev/null and b/doc/source/tutorials/voice_cloning/assets/workflow.gif differ diff --git a/doc/source/tutorials/voice_cloning/requirements.txt b/doc/source/tutorials/voice_cloning/requirements.txt new file mode 100644 index 000000000..534f3b8e0 --- /dev/null +++ b/doc/source/tutorials/voice_cloning/requirements.txt @@ -0,0 +1,12 @@ +covalent==0.229.0rc0 +covalent-gcpbatch-plugin==0.11.0 +librosa==0.10.0 +pydub==0.25.1 +pytube==15.0.0 +scipy==1.11.3 +soundfile==0.12.1 +streamlit==1.28.1 +torch==2.1.0 +torchaudio==2.1.0 +transformers==4.33.3 +TTS==0.19.1 diff --git a/doc/source/tutorials/voice_cloning/source.ipynb b/doc/source/tutorials/voice_cloning/source.ipynb new file mode 100644 index 000000000..754c998da --- /dev/null +++ b/doc/source/tutorials/voice_cloning/source.ipynb @@ -0,0 +1,401 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Youtube Video Summarization with Voice Cloning" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "Our script performs several tasks:\n", + "\n", + "1. downloads and processes a YouTube video,\n", + "2. transcribes the audio from the YouTube video,\n", + "3. summarizes the transcription of the transcribed audio, and\n", + "4. converts the summary to speech using the user's voice.\n", + "\n", + "The script leverages Covalent for executing these tasks, either locally or on a cloud platform like GCP.\n", + "\n", + "## Import Dependencies\n", + "\n", + "First, we import necessary Python libraries:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import streamlit as st\n", + "\n", + "from pytube import YouTube\n", + "from pydub import AudioSegment\n", + "from transformers import pipeline\n", + "from TTS.api import TTS" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This section loads libraries for audio processing, machine learning models, and Covalent for workflow management.\n", + "\n", + "```bash\n", + "covalent deploy up gcpbatch\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set Up Covalent and Dependencies\n", + "Covalent simplifies cloud resource management. We define dependencies for each task and configure a Covalent executor for cloud execution. \n", + "In the example below we utilize [Google Cloud Batch](https://cloud.google.com/batch) using our [gcp batch executor](https://docs.covalent.xyz/docs/user-documentation/api-reference/executors/gcp/). " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "audio_deps = [\n", + " \"transformers==4.33.3\", \"pydub==0.25.1\",\n", + " \"torchaudio==2.1.0\", \"librosa==0.10.0\",\n", + " \"torch==2.1.0\"\n", + "]\n", + "text_deps = [\"transformers==4.33.3\", \"torch==2.1.0\"]\n", + "tts_deps = audio_deps + [\"TTS==0.19.1\"]\n", + "\n", + "executor = ct.executor.GCPBatchExecutor(\n", + " container_image_uri=\"docker.io/filipbolt/covalent-gcp-0.229.0rc0\",\n", + " vcpus=4,\n", + " memory=8192,\n", + " time_limit=3000,\n", + " poll_freq=1,\n", + " retries=1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Alternatively, you may use Covalent Cloud to execute this workflow by doing:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import covalent_cloud as cc\n", + "\n", + "cc_executor = cc.CloudExecutor(num_cpus=4, env=\"genai-env\", memory=8192, time_limit=3000)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define Covalent Tasks\n", + "Each step of our workflow is encapsulated in a Covalent task. Here's an example:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "@ct.electron\n", + "def download_video(url):\n", + " yt = YouTube(url)\n", + " # download file\n", + " out_file = yt.streams.filter(\n", + " only_audio=True, file_extension=\"mp4\"\n", + " ).first().download(\".\")\n", + "\n", + " # rename downloaded file\n", + " os.rename(out_file, \"audio.mp4\")\n", + " with open(\"audio.mp4\", \"rb\") as f:\n", + " file_content = f.read()\n", + " return file_content\n", + "\n", + "\n", + "@ct.electron\n", + "def load_audio(input_file_content):\n", + " input_path = os.path.join(os.getcwd(), \"file.mp4\")\n", + " # write to file\n", + " with open(input_path, \"wb\") as f:\n", + " f.write(input_file_content)\n", + "\n", + " audio_content = AudioSegment.from_file(input_path, format=\"mp4\")\n", + " return audio_content\n", + "\n", + "\n", + "@ct.electron(executor=executor, deps_pip=audio_deps)\n", + "def transcribe_audio(audio_content):\n", + " # Export the audio as a WAV file\n", + " audio_content.export(\"audio_file.wav\", format=\"wav\")\n", + "\n", + " pipe = pipeline(\n", + " task=\"automatic-speech-recognition\",\n", + " # model=\"openai/whisper-small\",\n", + " model=\"openai/whisper-large-v3\",\n", + " chunk_length_s=30, max_new_tokens=2048,\n", + " )\n", + " transcription = pipe(\"audio_file.wav\")\n", + " return transcription['text']\n", + "\n", + "\n", + "@ct.electron(executor=executor, deps_pip=text_deps)\n", + "def summarize_transcription(transcription):\n", + " summarizer = pipeline(\n", + " \"summarization\",\n", + " model=\"facebook/bart-large-cnn\",\n", + " )\n", + " summary = summarizer(\n", + " transcription, min_length=5, max_length=100,\n", + " do_sample=False, truncation=True\n", + " )[0][\"summary_text\"]\n", + " return summary\n", + "\n", + "\n", + "@ct.electron(executor=executor, deps_pip=tts_deps)\n", + "def text_to_speech_voice_clone(text, speaker_content, output_file):\n", + " with open(\"speaker.wav\", \"wb\") as f:\n", + " f.write(speaker_content)\n", + "\n", + " # agree to service agreement programmatically\n", + " os.environ['COQUI_TOS_AGREED'] = \"1\"\n", + "\n", + " tts = TTS(\"tts_models/multilingual/multi-dataset/xtts_v1\")\n", + " tts.tts_to_file(\n", + " text=text,\n", + " file_path=output_file,\n", + " speaker_wav=\"speaker.wav\",\n", + " language=\"en\"\n", + " )\n", + " with open(output_file, \"rb\") as f:\n", + " file_content = f.read()\n", + " return file_content\n", + "\n", + "\n", + "@ct.electron\n", + "def load_wav_file(wav_file):\n", + " with open(wav_file, \"rb\") as f:\n", + " file_content = f.read()\n", + " return file_content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use the `@ct.electron` decorator to define tasks like `download_video`,`mp4_to_wav`, `transcribe_audio`, etc." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Orchestrate the Workflow\n", + "The `@ct.lattice` decorator is used to define the workflow that orchestrates the entire process:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "@ct.lattice\n", + "def workflow(url, user_voice_file, output_file):\n", + " video_content = download_video(url)\n", + " audio_content = load_audio(video_content)\n", + "\n", + " user_voice_content = load_wav_file(user_voice_file)\n", + "\n", + " # Use Google Cloud Batch to transcribe, summarize and re-voice\n", + " transcription = transcribe_audio(audio_content)\n", + " summary = summarize_transcription(transcription)\n", + " output_file_content = text_to_speech_voice_clone(\n", + " summary, user_voice_content, output_file\n", + " )\n", + " return summary, transcription, output_file_content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Streamlit Interface\n", + "We use Streamlit to create an interactive web interface for the script:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-01-08 14:37:20.421 \n", + " \u001b[33m\u001b[1mWarning:\u001b[0m to view this Streamlit app on a browser, run it with the following\n", + " command:\n", + "\n", + " streamlit run /home/filip/miniconda3/envs/google/lib/python3.9/site-packages/ipykernel_launcher.py [ARGUMENTS]\n", + "2024-01-08 14:37:20.422 Session state does not function when running a script without `streamlit run`\n" + ] + } + ], + "source": [ + "import streamlit as st\n", + "\n", + "# Function to display results\n", + "def display_results(summary, transcription, audio_file_content):\n", + " display_summary(summary)\n", + " display_full_transcription(transcription)\n", + " display_audio_summary(audio_file_content)\n", + "\n", + "# Function to display the summary\n", + "def display_summary(summary):\n", + " st.subheader(\"YouTube transcription summary:\")\n", + " st.text(summary)\n", + "\n", + "# Function to display the full transcription with a toggle\n", + "def display_full_transcription(transcription):\n", + " st.subheader(\"YouTube full transcription\")\n", + " if st.checkbox(\"Show/Hide\", False):\n", + " st.text(transcription)\n", + "\n", + "# Function to display the audio summary\n", + "def display_audio_summary(audio_file_content):\n", + " st.subheader(\"Summary in your own voice:\")\n", + " st.audio(audio_file_content, format=\"audio/wav\")\n", + "\n", + "\n", + "# Streamlit app layout\n", + "def main():\n", + " st.title(\"Summarize YouTube videos in your own voice using AI\")\n", + " speaker_file, speaker_file_path = upload_speaker_file()\n", + " youtube_url = st.text_input(\"Enter valid YouTube URL\")\n", + "\n", + " if st.button(\"Process\"):\n", + " process_input(speaker_file, speaker_file_path, youtube_url)\n", + " elif \"transcription\" in st.session_state:\n", + " display_results(\n", + " st.session_state[\"summary\"],\n", + " st.session_state[\"transcription\"],\n", + " st.session_state[\"audio_file_content\"]\n", + " )\n", + "\n", + "# Function to upload speaker file\n", + "def upload_speaker_file():\n", + " speaker_file = st.file_uploader(\"Upload an audio file (WAV)\", type=[\"wav\"])\n", + " if speaker_file:\n", + " st.audio(speaker_file, format=\"audio/wav\")\n", + " speaker_file_path = \"speaker.wav\"\n", + " with open(speaker_file_path, \"wb\") as f:\n", + " f.write(speaker_file.getbuffer())\n", + " return speaker_file, speaker_file_path\n", + " return None, None\n", + "\n", + "# Function to process the input\n", + "def process_input(speaker_file, speaker_file_path, youtube_url):\n", + " if speaker_file and youtube_url:\n", + " audio_file_full_path = os.path.join(os.getcwd(), \"audio.wav\")\n", + " speaker_file_full_path = os.path.join(os.getcwd(), speaker_file_path)\n", + "\n", + " dispatch_id = ct.dispatch(workflow)(\n", + " youtube_url, speaker_file_full_path, audio_file_full_path\n", + " )\n", + " with st.spinner(f\"Processing... job dispatch id: {dispatch_id}\"):\n", + " result = ct.get_result(dispatch_id, wait=True)\n", + "\n", + " if result:\n", + " summary, transcription, output_file_content = result.result\n", + " st.session_state[\"transcription\"] = transcription\n", + " st.session_state[\"summary\"] = summary\n", + " st.session_state[\"audio_file_content\"] = output_file_content\n", + " display_results(summary, transcription, output_file_content)\n", + " else:\n", + " st.error(\"Something went wrong. Please try again.\")\n", + "\n", + "main()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the Script\n", + "To run the script, execute it via Streamlit:\n", + "\n", + "```bash\n", + "streamlit run your_script.py\n", + "```\n", + "\n", + "In the Covalent UI, you should be seeing a workflow like the following\n", + "\n", + "![alt text](assets/workflow.gif)\n", + "\n", + "The streamlit app usage would then be:\n", + "\n", + "![alt text](assets/streamlit_gcp.gif)\n", + "\n", + "### Customizing the Workflow\n", + "You can tailor this script to your specific needs:\n", + "\n", + "- Modify the Covalent task functions for different processing requirements like swapping one of the models.\n", + "- Adjust the Covalent executor settings based on your cloud resource needs.\n", + "\n", + "### Conclusion\n", + "\n", + "This tutorial demonstrates using Covalent for fine-tuning a speech summarization model. Covalent's cloud computing abstraction simplifies executing complex workflows, making it a powerful tool for developers and researchers in AI/ML fields." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}