From 6d487507db43414b58ee8f21a8488659227892ab Mon Sep 17 00:00:00 2001 From: thomasgnuttall Date: Tue, 3 Dec 2024 18:34:11 +0100 Subject: [PATCH] removed oldd webbook --- webbook/resources/exploring-performance.ipynb | 1502 ----------------- 1 file changed, 1502 deletions(-) delete mode 100644 webbook/resources/exploring-performance.ipynb diff --git a/webbook/resources/exploring-performance.ipynb b/webbook/resources/exploring-performance.ipynb deleted file mode 100644 index 404f412..0000000 --- a/webbook/resources/exploring-performance.ipynb +++ /dev/null @@ -1,1502 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Exploring Carnatic Performance" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Thomas Nuttall, Genís Plaja-Roglans, Lara Pearson, Brindha Manickavasakan, Xavier Serra." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook serves to demonstrate the wide range of tools available as part of the compIAM package. We demonstrate their use on a single performance from the Saraga Audiovisual Dataset, the multi-modal portion of the wider Saraga Dataset {cite}`saraga`. The tools showcased here are not accompanied by exhaustive usage documentation, which can be found in their respective pages in other parts of this webbook, links for which are provided in each section." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Import Dependencies and Data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Due to restrictions in accessing the Saraga API through the Github hosted webbook, we access the data through a custom shared Google drive created specifically for this tutorial. Users wishing to work with audio from Saraga should follow the instructions [here](https://mtg.github.io/saraga/)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "## Installing (if not) and importing compiam to the project\n", - "import importlib.util\n", - "%pip install -U compiam==0.4.1 # Install latest version of compiam\n", - "%pip install essentia\n", - "%pip install \"torch==1.13\"\n", - "%pip install \"tensorflow==2.15.0\" \"keras<3\"\n", - "\n", - "import compiam\n", - "import essentia.standard as estd\n", - "\n", - "# Import extras and supress warnings to keep the tutorial clean\n", - "import os\n", - "import shutil\n", - "import gdown\n", - "import zipfile\n", - "\n", - "import numpy as np\n", - "import IPython.display as ipd\n", - "import matplotlib.pyplot as plt\n", - "\n", - "from pprint import pprint\n", - "\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "\n", - "AUDIO_PATH = os.path.join(\"..\", \"audio\", \"demos\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*NOTE:* If working on Collab, please uncomment and run this cell below, otherwise, don't!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#AUDIO_PATH = \"./\" ## Run if working on collab!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Since Saraga Audiovisual is a fresh new dataset very recently published in ISMIR 2024 (San Francisco, USA), it is still not available through mirdata and compIAM. However, we will manually download and load an example concert recording from this dataset, a concert performed by Dr. Brindha Manickavasakan during the December Season in Chennai in 2023. Dr. Manickavasakan is also a collaborator of the ongoing efforts on the computational melodic analysis of Carnatic Music in a collaboration between a group of researchers from the Music Technology Group and Dr. Lara Pearson from the Max Plank Institute of Empirical Aesthetics." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "url = \"https://drive.google.com/uc?id=1iR0bfxDLQbH8fEeHU_GFsg2kh7brZ0HZ&export=download\"\n", - "output = os.path.join(AUDIO_PATH, \"dr-brindha-manickavasakan.zip\")\n", - "gdown.download(url, output, quiet=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Unzip file\n", - "with zipfile.ZipFile(output, 'r') as zip_ref:\n", - " zip_ref.extractall(AUDIO_PATH)\n", - "\n", - "# Delete zip file after extraction\n", - "os.remove(output)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Alright, the data is downloaded and uncompressed. Let's get the path to it and analyse a rendition from the concert." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Loading and visualising the data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We work with a single performance from a concert by Brindha Manickavasakan at the Arkay Convention Center, recorded in 2023 in Chennai, South India. The composition is Bhavanuta by Tyaagaraaja in raga mohanam." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "rendition = \"Bhavanuta\"\n", - "folder_path = os.path.join(AUDIO_PATH, 'dr-brindha-manickavasakan', rendition)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For 100s of performances in the Saraga dataset, the audio stems corresponding to each instrument/perfromer are available. In this performance, this constitutes the lead vocal, the mridangam (left and right microphone), the violin, and the tanpura. The full mix of all instruments is also available.\n", - "\n", - "Let us select the preview versions of the multitrack audio, which are shortened and compressed versions of the rendition for easier handling of the previsualisation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "audio_path_pre = os.path.join(folder_path, \"preview\", f\"{rendition}.mp3\")\n", - "mrid_left_path_pre = os.path.join(folder_path, \"preview\", f\"{rendition}.mridangam-left.mp3\")\n", - "mrid_right_path_pre = os.path.join(folder_path, \"preview\", f\"{rendition}.mridangam-right.mp3\")\n", - "violin_path_pre = os.path.join(folder_path, \"preview\", f\"{rendition}.multitrack-violin.mp3\")\n", - "vocal_path_pre = os.path.join(folder_path, \"preview\", f\"{rendition}.multitrack-vocal.mp3\")\n", - "tanpura_path_pre = os.path.join(folder_path, \"preview\", f\"{rendition}.tanpura.mp3\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.1 Multitrack player" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can use the compIAM waveform player to visualise and listen to all of the tracks at the same time, panning, or changing the volume of each as required." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam.visualisation.waveform_player import Player" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# list of paths to load and listen\n", - "all_audio_paths = [\n", - " vocal_path_pre,\n", - " violin_path_pre,\n", - " mrid_left_path_pre,\n", - " mrid_right_path_pre,\n", - " tanpura_path_pre\n", - "]\n", - "# List of labels for each path\n", - "all_names = [\"Vocal\", \"Violin\", \"Mridangam left\", \"Mridangam right\", \"Tanpura\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "Player(all_names, all_audio_paths)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.2 Video and Gesture Tracks" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Saraga Audiovisual dataset includes videos of the performances and gesture tracks extracted using MMPose for the lead performer [3]. Let's take a look at a sample for this performance." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import cv2\n", - "import IPython.display as ipd\n", - "from IPython.core.display import HTML" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vid_out_path = f'{folder_path}/output_segment.mp4'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load keypoints and scores\n", - "keypoints_file = f\"{folder_path}/singer/Brindha_Manickavasakan_Segment1_0-513_kpts.npy\"\n", - "scores_file = f\"{folder_path}/singer/Brindha_Manickavasakan_Segment1_0-513_scores.npy\"\n", - "video_file = f\"{folder_path}/{rendition}.mov\" # Replace with your video file" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "keypoints = np.load(keypoints_file)\n", - "scores = np.load(scores_file)\n", - "\n", - "# Skeleton for 135 keypoints\n", - "# Skeleton for 135 keypoints (MMPose)\n", - "skeleton = [\n", - " (0, 1), (1, 2), # Eyes (left to right)\n", - " (0, 3), (0, 4), # Nose to ears (left and right)\n", - " (5, 6), # Shoulders (left and right)\n", - " (5, 7), (7, 9), # Left arm (shoulder -> elbow -> wrist)\n", - " (6, 8), (8, 10),\n", - " (11,12), # Right arm (shoulder -> elbow -> wrist)\n", - " (5, 11), (6, 12), # Shoulders to hips\n", - " (11, 13), (13, 15), # Left leg (hip -> knee -> ankle)\n", - " (12, 14), (14, 16) # Right leg (hip -> knee -> ankle)\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Open video file\n", - "cap = cv2.VideoCapture(video_file)\n", - "fps = int(cap.get(cv2.CAP_PROP_FPS)) # Frames per second\n", - "frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))\n", - "frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))\n", - "\n", - "# Define start and end frames for the 20-second segment\n", - "start_time = 10 # Start time in seconds (adjust as needed)\n", - "end_time = start_time + 20 # End time in seconds\n", - "start_frame = int(start_time * fps)\n", - "end_frame = int(end_time * fps)\n", - "\n", - "# Output video writer\n", - "out = cv2.VideoWriter(vid_out_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))\n", - "\n", - "# Process the selected frames\n", - "frame_idx = 0\n", - "while cap.isOpened():\n", - " ret, frame = cap.read()\n", - " if not ret:\n", - " break\n", - "\n", - " if start_frame <= frame_idx < end_frame:\n", - " # Get keypoints and scores for the current frame\n", - " if frame_idx < len(keypoints):\n", - " frame_keypoints = keypoints[frame_idx]\n", - " frame_scores = scores[frame_idx]\n", - "\n", - " # Draw keypoints and skeleton\n", - " for i, (x, y) in enumerate(frame_keypoints):\n", - " # Only draw if confidence score is above threshold\n", - " if frame_scores[i] > 0.5: # Adjust threshold as needed\n", - " cv2.circle(frame, (int(x), int(y)), 5, (0, 255, 0), -1)\n", - "\n", - " # Draw skeleton\n", - " for connection in skeleton:\n", - " start, end = connection\n", - " if frame_scores[start] > 0.5 and frame_scores[end] > 0.5:\n", - " x1, y1 = frame_keypoints[start]\n", - " x2, y2 = frame_keypoints[end]\n", - " cv2.line(frame, (int(x1), int(y1)), (int(x2), int(y2)), (255, 0, 0), 2)\n", - "\n", - " # Write frame to output video\n", - " out.write(frame)\n", - "\n", - " frame_idx += 1\n", - "\n", - " # Stop processing after the end frame\n", - " if frame_idx >= end_frame:\n", - " break\n", - "\n", - "# Release resources\n", - "cap.release()\n", - "out.release()\n", - "cv2.destroyAllWindows()\n", - "print(\"20-second video segment processing complete. Output saved as 'output_segment.mp4'\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#video_html = f\"\"\"\n", - "#\n", - "#\"\"\"\n", - "#ipd.display(HTML(video_html))\n", - "\n", - "from IPython.core.display import Video\n", - "\n", - "Video(vid_out_path, embed=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Feature Extraction" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this section we extract various audio and musical features from the raw performance audio; the singer tonic, raga, predominant pitch track of the lead vocal melody, source separated vocal audio, downbeat, and repeated melodic patterns.\n", - "\n", - "Let's first get the path of the full and uncompressed mixture and vocal tracks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "audio_path = os.path.join(folder_path, f\"{rendition}.wav\")\n", - "vocal_path = os.path.join(folder_path, f\"{rendition}.multitrack-vocal.wav\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.1 Tonic Identification" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The tonic of the lead singer is useful for normalising pitch and comparing with other performers. Here we can use the TonicIndianMultiPitch tool which is available through Essentia and compIAM." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Importing the tool\n", - "from compiam.melody.tonic_identification import TonicIndianMultiPitch\n", - "\n", - "# We first initialize the tool we have just imported\n", - "tonic_multipitch = TonicIndianMultiPitch()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tonic = tonic_multipitch.extract(audio_path)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(f'Performer tonic: {round(tonic, 2)} Hz')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can quickly listed to the estimated tonic on top of the original audio to perceptually evaluate if the tonic sounds reasonable to the chosen rendition." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Let's get the audio for the track\n", - "sr = 44100\n", - "audio_mix = estd.MonoLoader(filename=audio_path, sampleRate=sr)()\n", - "\n", - "# Let's synthesize a tambura\n", - "synthesized_tambura = 0.75*np.sin(\n", - " 2*np.pi*float(tonic)*np.arange(0, len(audio_mix)//sr, 1/sr)\n", - ")\n", - "# Adding some harmonics\n", - "synthesized_tambura += 0.25*np.sin(\n", - " 2*np.pi*float(tonic)*2*np.arange(0, len(audio_mix)//sr, 1/sr)\n", - ")\n", - "synthesized_tambura += 0.5*np.sin(\n", - " 2*np.pi*float(tonic)*3*np.arange(0, len(audio_mix)//sr, 1/sr)\n", - ")\n", - "synthesized_tambura += 0.125*np.sin(\n", - " 2*np.pi*float(tonic)*4*np.arange(0, len(audio_mix)//sr, 1/sr)\n", - ")\n", - "\n", - "# We take just a minute of music (60 seg * 44100)\n", - "audio_tonic = audio_mix[:60*sr] + synthesized_tambura[:60*sr]*0.02" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# And we play it!\n", - "ipd.Audio(\n", - " data=audio_tonic[None],\n", - " rate=sr,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "That sounds good! This is the tonic of the recording. This is a really valuable information that allows us to characterise and normalize the melodies, and may give relevant information about the artist and performed concert.\n", - "\n", - "For further reference, please visit the [tonic identification page](tonic-identification)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.2 Raga Recognition" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Whilst raga metadata is available as part of Saraga, compIAM contains a raga identifier, DeepSRGM. We can automatically identifier the raga using this tool. Be aware, this model was trained on the ragas; Bhairav, Madhukauns, Mōhanaṁ, Hamsadhvāni, Varāḷi, Dēś, Kamās, Yaman kalyāṇ, and Bilahari, Ahira bhairav, only and hence can only assign those classes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam import load_model\n", - "\n", - "# This model uses tensorflow in the backend!\n", - "deepsrgm = load_model(\"melody:deepsrgm\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "# Computing features\n", - "feat = deepsrgm.get_features(vocal_path)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "# Predict raga using subset of features from the very beginning of audio for faster prediction\n", - "predicted_raga = deepsrgm.predict(feat[:4])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(f'Raga: {predicted_raga}')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "from compiam import load_model\n", - "from compiam.melody.pitch_extraction import Melodia\n", - "\n", - "melodia = Melodia() \n", - "ftanet_carnatic = load_model(\"melody:ftanet-carnatic\")\n", - "PITCH_EXTRACTION_SR = 44100\n", - "\n", - "freqs = ftanet_carnatic.predict(audio_mix)[:, 1]\n", - "tonic = tonic_multipitch.extract(audio_mix)\n", - "\n", - "k = 4\n", - "N = 200\n", - "new_feat = []\n", - "\n", - "feature = np.round(1200 * np.log2(freqs / tonic) * (k / 100)).clip(0)\n", - "\n", - "if len(feature) <= 5000:\n", - " raise ValueError(\"Audio signal is not longer enough for a proper estimation. Please provide a larger audio.\")\n", - "for i in range(N):\n", - " c = np.random.randint(0, len(feature) - 5000)\n", - " new_feat.append(feature[c : c + 5000])\n", - "new_feat = np.array(new_feat)\n", - "\n", - "raga = deepsrgm.predict(new_feat[:4]) # Let's again only take alap frames" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(f'Raga: {predicted_raga}')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.3 Music Source Separation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Music source separation is the task of automatically estimating the individual elements in a musical mixture. Apart from its creative uses, it may be very important as builing block in research pipeline, acting as a very handy pre-processing step {cite}`plaja_separation_2023`. To carefully analyse the singing voice and its components, normally having it isolated from the rest of the instruments is beneficial. \n", - "\n", - "There are several models in the literature to address this problem, most of them based on deep learning architectures, some of them provide pre-trained weights such as Demucs {cite}`demucs` or Spleeter {cite}`spleeter`, the latter is broadly used in Carnatic Music computational research works. However, thes systems have two problems: (1) the training data of these models does normally not include Carnatic Music examples, therefore there are instruments and practices which are completely unseed by these models, and (2) these models have a restricted set of target elements, namely (_vocals_, _bass_, _drums_, and _other_), which does not fit to Carnatic Music arrangements at all. \n", - "\n", - "To address problem (1), there have been few attemps on trying to use the multi-stem data presented above to develop Carnatic-tailored source separation systmes, although the available multi-stem recordings are collected from mixing consoles in live performances, and therefore the individual tracks are noisy (have background leakage from the rest of the sources). We can test one example of these systems here: {cite}``." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "# Make sure proper TF version is installed\n", - "%pip install \"tensorflow==2.15.0\" \"keras<3\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "from compiam import load_model\n", - "\n", - "# This model uses tensorflow in the backend!\n", - "separation_model = load_model(\"separation:cold-diff-sep\")\n", - "SEPARATION_SR = separation_model.sample_rate" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "audio_mix = estd.MonoLoader(filename=audio_path, sampleRate=SEPARATION_SR)()\n", - "separation_input = audio_mix[:SEPARATION_SR*20] # Get 20s\n", - "separated_vocals = separation_model.separate(\n", - " separation_input,\n", - " input_sr=SEPARATION_SR\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ipd.Audio(separated_vocals, rate=SEPARATION_SR)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's try to be a bit more restrictive using the configuration of the separation algorithm." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "separated_vocals = separation_model.separate(\n", - " separation_input,\n", - " input_sr=SEPARATION_SR,\n", - " clusters=8,\n", - " scheduler=7,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ipd.Audio(separated_vocals, rate=SEPARATION_SR)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Although there is still a long way to go on this problem, the ongoing efforts on improving on the separation of singing voice (and also the rest of the instrumentation!) for Carnatic Music set an interesting baseline to bulid on top of." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For further reference, please visit the [music source separation page](singing-voice-extraction)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.4 Pitch Extraction" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The f0 pitch track of the predominant vocal melody has proved useful for range of computational analysis tasks in Indian Art Music. We can extract this for our performance using Melodia {cite}`salamon_pitch_2012`, a broadly used knowledge-based method. We will also test a recently published DL model to achieve the same goal: FTA-Net model, which has been trained specifically for the Carnatic Music use case and included in compIAM as well {cite}`plaja_pitch_2023`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "# Make sure proper TF version is installed\n", - "%pip install \"tensorflow==2.15.0\" \"keras<3\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "from compiam import load_model\n", - "from compiam.melody.pitch_extraction import Melodia\n", - "\n", - "melodia = Melodia() \n", - "ftanet_carnatic = load_model(\"melody:ftanet-carnatic\")\n", - "PITCH_EXTRACTION_SR = 44100" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To extract an example pitch track, we first load 30s of the mixture recording, and run prediction with both methods. Once the methods are initialized, running pitch extraction is easily done in one line of code." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "import librosa\n", - "import librosa.display\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "\n", - "audio_mix = estd.MonoLoader(filename=audio_path, sampleRate=PITCH_EXTRACTION_SR)()\n", - "prediction_input = audio_mix[PITCH_EXTRACTION_SR*60:PITCH_EXTRACTION_SR*80]\n", - "\n", - "# Predominant extraction models\n", - "melodia_pitch_track = melodia.extract(prediction_input, input_sr=PITCH_EXTRACTION_SR)\n", - "ftanet_pitch_track = ftanet_carnatic.predict(prediction_input, input_sr=PITCH_EXTRACTION_SR)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's now plot both pitch tracks and compare!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fig, ax = plt.subplots(nrows=1, ncols=1, sharex=True, figsize=(15, 12))\n", - "D = librosa.amplitude_to_db(np.abs(librosa.stft(prediction_input)), ref=np.max)\n", - "img = librosa.display.specshow(D, y_axis='linear', x_axis='time', sr=PITCH_EXTRACTION_SR, ax=ax);\n", - "ax.set_ylim(0, 2000)\n", - "ax.set_xlim(0, 8) # Visualising 8 seconds\n", - "plt.plot(\n", - " melodia_pitch_track[:, 0], melodia_pitch_track[:, 1],\n", - " color=\"white\", label=\"Melodia\",\n", - ")\n", - "plt.plot(\n", - " ftanet_pitch_track[:, 0], ftanet_pitch_track[:, 1],\n", - " color=\"black\",label=\"FTANet-Carnatic\",\n", - ")\n", - "plt.legend()\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ipd.Audio(prediction_input[:8*PITCH_EXTRACTION_SR], rate=PITCH_EXTRACTION_SR)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "See that the violin is fooling the Melodia algorithm which is getting all vocal pitch values one octave above, while FTA-Net is able to get the right pitch values. This is a very common issue when analysing melody in the context of Carnatic Music: the presence of violin shadowing the singing voice is an enormous challenge for the vocal models and algorithms.\n", - "\n", - "Let's now extract the entire pitch track from the available vocal stem." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "pitch_track = ftanet_carnatic.predict(vocal_path, input_sr=PITCH_EXTRACTION_SR)\n", - "pitch = pitch_track[:, 1] # Pitch in Hz\n", - "time = pitch_track[:, 0] # Time in seconds\n", - "timestep = time[2] - time[1] # Time in seconds between elements of pitch track" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We interpolate small gaps owing to glottal stops or consonant sounds." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam.utils.pitch import interpolate_below_length" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pitch = interpolate_below_length(\n", - " pitch, # track to interpolate\n", - " 0, # value to interpolate \n", - " 200*0.001/timestep # maximum gap in number sequence elements to interpolate for\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's visualise our pitch plot using the plot_pitch function from compIAM.visualisation.pitch. We can manually change the yticks to correspond to theoretical svara positions by passing a custom dictionary of {ytick labels : y values}. Since we know the raga (from section 2.2) and the singer tonic (from section 2.1). We can use get_svara_pitch_carnatic to query for the svaras relevant to that raga, and pass that dictionary and the tonic to plot_pitch to alter the pitch plot." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam.visualisation.pitch import plot_pitch, flush_matplotlib\n", - "from compiam.utils import ipy_audio\n", - "from compiam.utils import get_svara_pitch_carnatic" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# let's load the audio also\n", - "audio = estd.MonoLoader(filename=audio_path, sampleRate=PITCH_EXTRACTION_SR)()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "t1 = 304 # in seconds\n", - "t2 = 324 # in seconds\n", - "t1s = round(t1/timestep) # in sequence elements\n", - "t2s = round(t2/timestep) # in sequence elements" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "this_pitch = pitch[t1s:t2s]\n", - "this_time = time[t1s:t2s]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "silence_mask = this_pitch == 0" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "svara_pitch = get_svara_pitch_carnatic('mohanam', tonic=tonic)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plot_pitch(\n", - " this_pitch,\n", - " this_time,\n", - " mask=silence_mask,\n", - " yticks_dict=svara_pitch,\n", - " tonic=tonic,\n", - " cents=True,\n", - " title=f'Excerpt of {rendition} by Brindha Manickavasakan'\n", - ")\n", - "ipy_audio(audio, t1, t2, sr=PITCH_EXTRACTION_SR)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Pitch curves are a really important feature for the computational analysis of Carnatic Music. The important and very much present ornamentation of notes and transitions between notes, it is better represented with continuous pitch value arrays. However, since the lead melodic instruments are normally found mixed with accompanying instruments, we require methods that are able to capture the melodies in the presence of background music. \n", - "\n", - "Pitch curves, or tracks, can be used for a list of tasks that aim at extracting relevant melodic information from the performed melodies. Also, several classification and tagging tasks (e.g. raga classification), build on top of melodic features, normally including pitch information, whether explicitly, or embedded in other representations.\n", - "\n", - "For further reference, please visit the [pitch extraction page](melody-extraction)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. Rhythm analysis: Percussion onset detection" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# We import the tool\n", - "from compiam.rhythm.meter import AksharaPulseTracker\n", - "\n", - "# Let's initialize an instance of the tool\n", - "apt = AksharaPulseTracker()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "predicted_aksharas = apt.extract(audio_path)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam.visualisation.audio import plot_waveform" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pulses = predicted_aksharas['aksharaPulses']\n", - "predicted_beats_dict = {\n", - " time_step: idx for idx, time_step in enumerate(pulses)\n", - "}\n", - "\n", - "# And we plot!\n", - "plot_waveform(\n", - " input_data=audio_path,\n", - " t1=272,\n", - " t2=276,\n", - " labels=predicted_beats_dict,\n", - ");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "See some demos of related beat tracking and percussion pattern research done within the context of the CompMusic project." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import YouTubeVideo\n", - "\n", - "YouTubeVideo(\"wvrGhXFXtv8\", width=800, height=450)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. Melodic analysis: Melodic pattern discovery" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Melodic patterns are important building blocks in Carnatic music." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "# Pattern Extraction for a Given Audio\n", - "from compiam import load_model\n", - "\n", - "# Feature Extraction: CAE features\n", - "cae = load_model(\"melody:cae-carnatic\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "# returns magnitude and phase\n", - "ampl, _ = cae.extract_features(vocal_path)\n", - "ampl" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam.melody.pattern import self_similarity" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam.utils.pitch import (\n", - " extract_stability_mask,\n", - " pitch_seq_to_cents,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pitch_cents = pitch_seq_to_cents(pitch, tonic=tonic)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "stability_mask = extract_stability_mask(\n", - " pitch=pitch_cents, # pitch track\n", - " min_stab_sec=1.0, # minimum cummulative length of stable windows to warrant annotation\n", - " hop_sec=0.2, # hop length in seconds\n", - " var=60, # minimum variation from the mean in each window to be considered stable\n", - " timestep=timestep # time in seconds between consecutice elements in \n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "silence_mask = pitch==0" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "stability_mask = extract_stability_mask(\n", - " pitch=pitch_cents, # pitch track\n", - " min_stab_sec=1.0, # minimum cummulative length of stable windows to warrant annotation\n", - " hop_sec=0.2, # hop length in seconds\n", - " var=60, # minimum variation from the mean in each window to be considered stable\n", - " timestep=timestep # time in seconds between consecutice elements in \n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "t1 = 304 # in seconds\n", - "t2 = t1 + 10 # in seconds\n", - "\n", - "t1s = round(t1/timestep) # in sequence elements\n", - "t2s = round(t2/timestep) # in sequence elements\n", - "\n", - "this_pitch = pitch[t1s:t2s]\n", - "this_time = time[t1s:t2s]\n", - "this_silence_mask = silence_mask[t1s:t2s]\n", - "this_stable_mask = stability_mask[t1s:t2s]\n", - "\n", - "# Get pitch plot\n", - "fig, ax = plot_pitch(\n", - " this_pitch,\n", - " this_time, \n", - " mask=this_silence_mask,\n", - " tonic=tonic,\n", - " yticks_dict=svara_pitch,\n", - " cents=True,\n", - " title=f'Excerpt of {rendition} by Brindha Manickavasakan'\n", - ")\n", - "\n", - "# On alternative axis plot stable mask values\n", - "ax2 = ax.twinx()\n", - "ax2.plot(this_time, this_stable_mask, 'g', linewidth=1, alpha=1, linestyle='--')\n", - "ax2.set_yticks([0,1])\n", - "ax2.set_ylabel(\"Is stable region?\")\n", - " \n", - "# Accompanying audio\n", - "ipy_audio(audio, t1, t2, sr=sr)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "exclusion_mask = np.logical_or(silence_mask==1, stability_mask==1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Mask of regions not interested in cite kaustuv, the papers that use \n", - "ss = self_similarity(\n", - " ampl, # features\n", - " exclusion_mask=exclusion_mask, # exclusion mask\n", - " timestep=timestep, # time in seconds between elements of exlcusion mask\n", - " hop_length=cae.hop_length, # window size in audio frames\n", - " sr=cae.sr # sample rate of audio\n", - ")\n", - "\n", - "# Sparsely computed self similarity matrix \n", - "X = ss[0]\n", - "# Mapping of index between theoretical full matrix and sparse one\n", - "orig_sparse_lookup = ss[1]\n", - "# Mapping of index between sparse matrix and theoretical full matrix one\n", - "sparse_orig_lookup = ss[2]\n", - "# Indices of boundaries between split regions in full matrix\n", - "boundaries_orig = ss[3]\n", - "# Indices of boundaries between split regions in sparse matrix\n", - "boundaries_sparse = ss[4]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fig, ax = plt.subplots(figsize=(10,10))\n", - "plt.title(f'Self similarity matrix for {rendition}', fontsize=9)\n", - "ax.imshow(X[2000:5000,2000:5000], interpolation='nearest')\n", - "plt.axis('off')\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam.melody.pattern import segmentExtractor" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Emphasize Diagonal\n", - "se = segmentExtractor(\n", - " X, # self sim matrix\n", - " window_size=cae.hop_length, # window size\n", - " cache_dir='.cache/' # cache directory for faster computation in future\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for i in np.arange(0.05, 0.15, 0.01):\n", - " X_proc = se.emphasize_diagonals(bin_thresh=i)\n", - " se.display_matrix(X_proc[2000:5000,2000:5000], title=f'bin_thresh={round(i,2)}', figsize=(5,5))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "X_proc = se.emphasize_diagonals(bin_thresh=0.11)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "all_segments = se.extract_segments(\n", - " timestep=timestep, # timestep between\n", - " boundaries=boundaries_sparse, # boundaries of sparse regions (for conversion)\n", - " lookup=sparse_orig_lookup, # To convert between sparse and true indices\n", - " break_mask=exclusion_mask) # mask corresponding to break points, any segment that traverse these points are broken into two" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Format: [(x0, y0), (x1, y1)]...\")\n", - "all_segments[:10]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam.utils import add_center_to_mask" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "exclusion_mask_center = add_center_to_mask(exclusion_mask) # center of masked regions is annotated as \"2\"\n", - "anchor_mask = np.array([1 if i==2 else 0 for i in exclusion_mask_center])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, - "outputs": [], - "source": [ - "# Returns patterns in units of pitch sequence elements\n", - "starts_seq, lengths_seq = se.group_segments(\n", - " all_segments, # segments from se.extract_segments()\n", - " anchor_mask, # Extend patterns to these points\n", - " pitch, # pitch track\n", - " min_pattern_length_seconds=2, # minimum pattern length,\n", - " thresh_dtw=None\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "starts_sec = [[x*timestep for x in p] for p in starts_seq]\n", - "lengths_sec = [[x*timestep for x in l] for l in lengths_seq]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(f\"Number of groups: {len(starts_sec)}\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from compiam.visualisation.pitch import plot_subsequence" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# kwargs for plot_pitch\n", - "plot_kwargs = {\n", - " 'yticks_dict': svara_pitch,\n", - " 'cents':True,\n", - " 'tonic':tonic,\n", - " 'figsize':(15,4)\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "i = 1 # Choose pattern group\n", - "\n", - "S = starts_seq[i] # get group\n", - "L = lengths_seq[i] # get lengths\n", - "\n", - "for j,s in enumerate(S):\n", - " l = L[j] # this pattern length\n", - " ss = starts_sec[i][j] # this pattern start in seconds\n", - " ls = lengths_sec[i][j] # this pattern length in seconds\n", - " ipd.display(ipy_audio(audio, ss, ss+ls, sr=sr)) # display audio\n", - " # display pitch plot\n", - " plot_subsequence(s, l, pitch, time, timestep, path=None, plot_kwargs=plot_kwargs)\n", - " plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "full_concert_path = os.path.join(AUDIO_PATH, 'dr-brindha-manickavasakan') \n", - "shutil.rmtree(full_concert_path)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.20" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}