diff --git a/AI/Day5/README.md b/AI/Day5/README.md
index 989ea8d..ee2ca55 100644
--- a/AI/Day5/README.md
+++ b/AI/Day5/README.md
@@ -1,22 +1,18 @@
# ~ PoC AI Pool 2024 ~
-- ## Day 5: GNNs, NLP, and more
- - ### Module 1: Graph Neural Networks
- - **Notebook:** [`gnn.ipynb`](./gnn.ipynb)
- - ### Module 2: Natural Language Processing
+- ## Day 5: NLP
+ - ### Module 1: Natural Language Processing
- **Notebook:** [`nlp.ipynb`](./nlp.ipynb)
---
**The finish line is near !**
-On today's menu, we'll explore various topics within the field; such as graph neural networks and natural language processing.
+On today's menu, we'll explore the field of natural language processing.
> Here's a list of resources that we believe can be useful to follow along (and that we've ourselves used to learn these topics before being able to write the subjects):
## Module 1
-- [Maxime Labonne](https://mlabonne.github.io/blog/)
- - [Hands-On GNNs](https://mlabonne.github.io/blog/book.html)
- - [GNN articles](https://mlabonne.github.io/blog/posts/2022_02_20_Graph_Convolution_Network.html)
+- [Introduction to Natural Language Processing - Data Science Dojo](https://youtube.com/watch?v=s5zuplW8ua8)
## Module 2
diff --git a/AI/Day5/gnn.ipynb b/AI/Day5/gnn.ipynb
deleted file mode 100644
index bc053fd..0000000
--- a/AI/Day5/gnn.ipynb
+++ /dev/null
@@ -1,31 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "## imports"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# ~ PoC AI Pool 2024 ~\n",
- "- ## Day 5: GNNs, NLP, and more\n",
- " - ### Module 1: Graph Neural Networks\n",
- "-----\n",
- "Welcome to the final day of your PoC AI Pool !"
- ]
- }
- ],
- "metadata": {
- "language_info": {
- "name": "python"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/AI/Day5/nlp.ipynb b/AI/Day5/nlp.ipynb
new file mode 100644
index 0000000..b73408b
--- /dev/null
+++ b/AI/Day5/nlp.ipynb
@@ -0,0 +1,516 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# ~ PoC AI Pool 2024 ~\n",
+ "- ## Day 5: NLP\n",
+ " - ### Module 1: Emotion Recognition with NLP\n",
+ "-----\n",
+ "Welcome to the final day of your PoC AI Pool !\n",
+ "\n",
+ "In this module, we'll see a different way of using PyTorch to to build a Natural Language Processing neural network which is capable of detecting the language of a given sentence."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Data Cleaning"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import seaborn as sns\n",
+ "import sklearn\n",
+ "import torch\n",
+ "import torch.nn as nn"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's import the language dataset from the `datasets` package π¦ :\n",
+ "\n",
+ ">Datasets is a library for easily accessing and sharing datasets for Audio π, Computer Vision ποΈ , and Natural Language Processing (NLP) π tasks.\n",
+ "\n",
+ "We will be using the [papluca/language-identification](https://huggingface.co/datasets/papluca/language-identification)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from datasets import load_dataset\n",
+ "\n",
+ "dataset = load_dataset(\"papluca/language-identification\")"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The below code will transform your dataset into a pandas Dataframe which we will use for the rest of this module."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def filter_dataset(data, languages):\n",
+ " return data.filter(lambda x: languages.__contains__(x['labels']))\n",
+ "\n",
+ "def process_dataset(data):\n",
+ " return data.map(lambda x: {'data': (x['labels'], x['text'])})['data']\n",
+ "\n",
+ "languages = {\n",
+ " 'fr': 'french',\n",
+ " 'en': 'english',\n",
+ " 'es': 'spanish',\n",
+ " 'de': 'german'\n",
+ "}\n",
+ "\n",
+ "filtered_data = filter_dataset(dataset['train'], list(languages.keys()))\n",
+ "processed_data = process_dataset(filtered_data)\n",
+ "\n",
+ "df = pd.DataFrame(processed_data, columns=[\"languages\", \"text\"])\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Your output should look like this:\n",
+ "\n",
+ "![](images/expected_output_lang.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 1. Cleaning the data π§Ή\n",
+ "\n",
+ "\n",
+ "\n",
+ "First off, you need to clean the data using natural language processing techniques.\n",
+ "\n",
+ "However you achieve this, your cleaned data should be available inside a pandas dataframe."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As long as you've cleaned it correctly, it doesn't matter what your result is.\n",
+ "\n",
+ "As an example, the sentence \"May The Force be with you.\" might become \"may force\" when cleaned.\\\n",
+ "If your result looks like that, it means you've implemented the cleaning process correctly. π"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import nltk\n",
+ "from nltk.corpus import stopwords\n",
+ "nltk.download(\"stopwords\")\n",
+ "nltk.download(\"popular\")\n",
+ "\n",
+ "import re"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "languages = [languages[language] for language in languages.keys()]\n",
+ "stop_words = stopwords.words(languages)\n",
+ "\n",
+ "def clean(sentence):\n",
+ " \"\"\"\n",
+ " You should clean the data inside this function by using\n",
+ " different nlp techniques.\n",
+ " \"\"\"\n",
+ "\n",
+ " clean_data = sentence\n",
+ "\n",
+ " # Enter your code here\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ " #\n",
+ "\n",
+ " return clean_data\n",
+ "\n",
+ "df[\"clean\"] = df[\"text\"].apply(clean)\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "##### 2. Count Vectorizer π»\n",
+ "\n",
+ "\n",
+ "Now, in order to prepare the data for usage inside a neural network, you need to vectorize each word in the vocabulary and replace all usages inside your data with the corresponding tensors.\n",
+ "\n",
+ "- Step 1: Build a vocabulary containing each word in the dataset (each word must only appear once)\n",
+ "- Step 2: Vectorize each sentence in the dataset π‘ -> π’ by replacing it with an array containing the number of occurences of each word in the vocabulary inside the sentence.\n",
+ "- Step 3: Vectorize your labels (for example, you can replace french π«π· with index 0, spanish πͺπΈ with index 1, etc... )\n",
+ "\n",
+ "If you implement all of these steps correctly, you will have a vectorized dataset which will be processable inside a neural network ! \n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You might first want to create a vocabulary comprised of all the words in your cleaned data.\n",
+ "\n",
+ ">Build a vocabulary containing each word in the dataset (each word must only appear once)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def build_vocab(sentences):\n",
+ " \"\"\"\n",
+ " This method should return a vocabulary of all unique words in our dataframe\n",
+ " \"\"\"\n",
+ " ### Enter your code here\n",
+ "\n",
+ "\n",
+ " \n",
+ "\n",
+ " ###\n",
+ "\n",
+ " return None"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If the `build_vocab()` function is implemented properly, you should be able to run the code below π and see how many words were removed thanks to cleaning."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "vocab_vanilla = build_vocab(df[\"text\"].apply(nltk.word_tokenize))\n",
+ "vocab = build_vocab(df[\"clean\"])\n",
+ "\n",
+ "print(f\"Number of words in unprocessed data: {len(vocab_vanilla)}\")\n",
+ "print(f\"Number of words in processed data: {len(vocab)}\")\n",
+ "\n",
+ "vocab"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now, for the fun part: implement the Count Vectorizer\n",
+ "\n",
+ ">Vectorize each sentence in the dataset π‘ -> π’ by replacing it with an array containing the number of occurences of each word in the vocabulary inside the sentence."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "word2idx = {}\n",
+ "\n",
+ "for index, word in enumerate(vocab):\n",
+ " word2idx[word] = index\n",
+ "\n",
+ "def vectorize(sentences):\n",
+ " vectorized = []\n",
+ "\n",
+ " ### Enter your code here\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ " ###\n",
+ "\n",
+ " return vectorized\n",
+ "\n",
+ "df[\"vectorized\"] = vectorize(df[\"clean\"])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now for the label vectorization:\n",
+ "\n",
+ ">Vectorize your labels (for example, you can replace french π«π· with index 0, spanish πͺπΈ with index 1, etc... )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Label Vectorizer\n",
+ "\n",
+ "languages_dict = {\n",
+ " \"fr\": 0,\n",
+ " \"en\": 1,\n",
+ " \"es\": 2,\n",
+ " \"de\": 3,\n",
+ "}\n",
+ "\n",
+ "labels = []\n",
+ "\n",
+ "# Enter your code here\n",
+ "\n",
+ "#\n",
+ "\n",
+ "labels"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Neural Network π§ \n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In order to process the data with PyTorch, let's convert it into tensors:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "x = torch.FloatTensor(df[\"vectorized\"])\n",
+ "y = torch.LongTensor(labels)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now, you need to create your neural network and train a model on our data.\n",
+ "\n",
+ "- Step 1: Build a network in [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) (your model can be simple as long as it does the job)\n",
+ "- Step 2: Split your data into train and test subsets (you can use [sklearn's method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for this)\n",
+ "- Step 3: Train a model on your data until you reach a good accuracy (above 90%)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "### Neural Network\n",
+ "\n",
+ "class Network(nn.Module):\n",
+ " def __init__(self):\n",
+ " super(Network, self).__init__()\n",
+ "\n",
+ " def forward(self, x):\n",
+ " pass\n",
+ "\n",
+ "###\n",
+ "\n",
+ "model = Network()\n",
+ "\n",
+ "criterion = None\n",
+ "optimizer = None\n",
+ "\n",
+ "from torch.utils.data import Dataset, DataLoader\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "class MyData(Dataset):\n",
+ " \"\"\"\n",
+ " This class will be useful when working with batches\n",
+ " \"\"\"\n",
+ "\n",
+ " def __init__(self, x, y):\n",
+ " self.data = x\n",
+ " self.target = y\n",
+ "\n",
+ " def __getitem__(self, index):\n",
+ " x = self.data[index]\n",
+ " y = self.target[index]\n",
+ "\n",
+ " return x, y\n",
+ "\n",
+ " def __len__(self):\n",
+ " return len(self.data)\n",
+ "\n",
+ "### Training and Testing\n",
+ "\n",
+ "def training_loop(x, y):\n",
+ " x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)\n",
+ "\n",
+ " train_dataset = MyData(x_train, y_train)\n",
+ " test_dataset = MyData(x_test, y_test)\n",
+ "\n",
+ " train_dataset = DataLoader(train_dataset, batch_size=32)\n",
+ " test_dataset = DataLoader(test_dataset, batch_size=32)\n",
+ "\n",
+ " # Enter your code here\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ " #\n",
+ "\n",
+ " train_accuracy = None\n",
+ " test_accuracy = None\n",
+ "\n",
+ " return train_accuracy, test_accuracy\n",
+ "\n",
+ "###\n",
+ "\n",
+ "# Store the predictions for all of our data as well as the % of training and testing accuracy inside `predictions`, `train_accuracy` and `test_accuracy`\n",
+ "train_accuracy, test_accuracy = training_loop(x, y)\n",
+ "\n",
+ "print(f\"Train accuracy: {train_accuracy}\")\n",
+ "print(f\"Test accuracy: {test_accuracy}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If all went well, your accuracy should be close to 100%. π―\n",
+ "\n",
+ "Now, let's see how well the model guesses a language:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "### Prediction\n",
+ "\n",
+ "idx2lang = {\n",
+ " 0: \"fr\",\n",
+ " 1: \"en\",\n",
+ " 2: \"es\",\n",
+ " 3: \"de\",\n",
+ "}\n",
+ "\n",
+ "def predict(x):\n",
+ " predictions = []\n",
+ "\n",
+ " return predictions\n",
+ "\n",
+ "predictions = predict(x)\n",
+ "\n",
+ "df[\"predictions\"] = predictions\n",
+ "\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sns.countplot(x='value', hue=\"variable\", data=df[['languages', 'predictions']].melt())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Awesome ! π\n",
+ "\n",
+ "You've successfully created a language detection AI using Natural Language Processing and neural networks."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def predict_sentence(sentence):\n",
+ " return predict(vectorize([clean(sentence)]))\n",
+ "\n",
+ "predict_sentence(\"J'ai rΓ©ussi Γ implΓ©menter une intelligence artificielle !\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.6"
+ },
+ "orig_nbformat": 4,
+ "vscode": {
+ "interpreter": {
+ "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}