diff --git a/AI/Day5/README.md b/AI/Day5/README.md index 989ea8d..ee2ca55 100644 --- a/AI/Day5/README.md +++ b/AI/Day5/README.md @@ -1,22 +1,18 @@ # ~ PoC AI Pool 2024 ~ -- ## Day 5: GNNs, NLP, and more - - ### Module 1: Graph Neural Networks - - **Notebook:** [`gnn.ipynb`](./gnn.ipynb) - - ### Module 2: Natural Language Processing +- ## Day 5: NLP + - ### Module 1: Natural Language Processing - **Notebook:** [`nlp.ipynb`](./nlp.ipynb) --- **The finish line is near !** -On today's menu, we'll explore various topics within the field; such as graph neural networks and natural language processing. +On today's menu, we'll explore the field of natural language processing. > Here's a list of resources that we believe can be useful to follow along (and that we've ourselves used to learn these topics before being able to write the subjects): ## Module 1 -- [Maxime Labonne](https://mlabonne.github.io/blog/) - - [Hands-On GNNs](https://mlabonne.github.io/blog/book.html) - - [GNN articles](https://mlabonne.github.io/blog/posts/2022_02_20_Graph_Convolution_Network.html) +- [Introduction to Natural Language Processing - Data Science Dojo](https://youtube.com/watch?v=s5zuplW8ua8) ## Module 2 diff --git a/AI/Day5/gnn.ipynb b/AI/Day5/gnn.ipynb deleted file mode 100644 index bc053fd..0000000 --- a/AI/Day5/gnn.ipynb +++ /dev/null @@ -1,31 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "## imports" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# ~ PoC AI Pool 2024 ~\n", - "- ## Day 5: GNNs, NLP, and more\n", - " - ### Module 1: Graph Neural Networks\n", - "-----\n", - "Welcome to the final day of your PoC AI Pool !" - ] - } - ], - "metadata": { - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/AI/Day5/nlp.ipynb b/AI/Day5/nlp.ipynb new file mode 100644 index 0000000..b73408b --- /dev/null +++ b/AI/Day5/nlp.ipynb @@ -0,0 +1,516 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ~ PoC AI Pool 2024 ~\n", + "- ## Day 5: NLP\n", + " - ### Module 1: Emotion Recognition with NLP\n", + "-----\n", + "Welcome to the final day of your PoC AI Pool !\n", + "\n", + "In this module, we'll see a different way of using PyTorch to to build a Natural Language Processing neural network which is capable of detecting the language of a given sentence." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Data Cleaning" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import seaborn as sns\n", + "import sklearn\n", + "import torch\n", + "import torch.nn as nn" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's import the language dataset from the `datasets` package πŸ“¦ :\n", + "\n", + ">Datasets is a library for easily accessing and sharing datasets for Audio πŸ”‰, Computer Vision πŸ‘οΈ , and Natural Language Processing (NLP) πŸ“– tasks.\n", + "\n", + "We will be using the [papluca/language-identification](https://huggingface.co/datasets/papluca/language-identification)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"papluca/language-identification\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The below code will transform your dataset into a pandas Dataframe which we will use for the rest of this module." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def filter_dataset(data, languages):\n", + " return data.filter(lambda x: languages.__contains__(x['labels']))\n", + "\n", + "def process_dataset(data):\n", + " return data.map(lambda x: {'data': (x['labels'], x['text'])})['data']\n", + "\n", + "languages = {\n", + " 'fr': 'french',\n", + " 'en': 'english',\n", + " 'es': 'spanish',\n", + " 'de': 'german'\n", + "}\n", + "\n", + "filtered_data = filter_dataset(dataset['train'], list(languages.keys()))\n", + "processed_data = process_dataset(filtered_data)\n", + "\n", + "df = pd.DataFrame(processed_data, columns=[\"languages\", \"text\"])\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Your output should look like this:\n", + "\n", + "![](images/expected_output_lang.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 1. Cleaning the data 🧹\n", + "\n", + "\n", + "\n", + "First off, you need to clean the data using natural language processing techniques.\n", + "\n", + "However you achieve this, your cleaned data should be available inside a pandas dataframe." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As long as you've cleaned it correctly, it doesn't matter what your result is.\n", + "\n", + "As an example, the sentence \"May The Force be with you.\" might become \"may force\" when cleaned.\\\n", + "If your result looks like that, it means you've implemented the cleaning process correctly. πŸ‘" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import nltk\n", + "from nltk.corpus import stopwords\n", + "nltk.download(\"stopwords\")\n", + "nltk.download(\"popular\")\n", + "\n", + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "languages = [languages[language] for language in languages.keys()]\n", + "stop_words = stopwords.words(languages)\n", + "\n", + "def clean(sentence):\n", + " \"\"\"\n", + " You should clean the data inside this function by using\n", + " different nlp techniques.\n", + " \"\"\"\n", + "\n", + " clean_data = sentence\n", + "\n", + " # Enter your code here\n", + "\n", + "\n", + "\n", + "\n", + " #\n", + "\n", + " return clean_data\n", + "\n", + "df[\"clean\"] = df[\"text\"].apply(clean)\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 2. Count Vectorizer πŸ’»\n", + "\n", + "\n", + "Now, in order to prepare the data for usage inside a neural network, you need to vectorize each word in the vocabulary and replace all usages inside your data with the corresponding tensors.\n", + "\n", + "- Step 1: Build a vocabulary containing each word in the dataset (each word must only appear once)\n", + "- Step 2: Vectorize each sentence in the dataset πŸ”‘ -> πŸ”’ by replacing it with an array containing the number of occurences of each word in the vocabulary inside the sentence.\n", + "- Step 3: Vectorize your labels (for example, you can replace french πŸ‡«πŸ‡· with index 0, spanish πŸ‡ͺπŸ‡Έ with index 1, etc... )\n", + "\n", + "If you implement all of these steps correctly, you will have a vectorized dataset which will be processable inside a neural network ! \n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You might first want to create a vocabulary comprised of all the words in your cleaned data.\n", + "\n", + ">Build a vocabulary containing each word in the dataset (each word must only appear once)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def build_vocab(sentences):\n", + " \"\"\"\n", + " This method should return a vocabulary of all unique words in our dataframe\n", + " \"\"\"\n", + " ### Enter your code here\n", + "\n", + "\n", + " \n", + "\n", + " ###\n", + "\n", + " return None" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the `build_vocab()` function is implemented properly, you should be able to run the code below πŸ‘‡ and see how many words were removed thanks to cleaning." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vocab_vanilla = build_vocab(df[\"text\"].apply(nltk.word_tokenize))\n", + "vocab = build_vocab(df[\"clean\"])\n", + "\n", + "print(f\"Number of words in unprocessed data: {len(vocab_vanilla)}\")\n", + "print(f\"Number of words in processed data: {len(vocab)}\")\n", + "\n", + "vocab" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, for the fun part: implement the Count Vectorizer\n", + "\n", + ">Vectorize each sentence in the dataset πŸ”‘ -> πŸ”’ by replacing it with an array containing the number of occurences of each word in the vocabulary inside the sentence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "word2idx = {}\n", + "\n", + "for index, word in enumerate(vocab):\n", + " word2idx[word] = index\n", + "\n", + "def vectorize(sentences):\n", + " vectorized = []\n", + "\n", + " ### Enter your code here\n", + "\n", + "\n", + "\n", + "\n", + "\n", + " ###\n", + "\n", + " return vectorized\n", + "\n", + "df[\"vectorized\"] = vectorize(df[\"clean\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now for the label vectorization:\n", + "\n", + ">Vectorize your labels (for example, you can replace french πŸ‡«πŸ‡· with index 0, spanish πŸ‡ͺπŸ‡Έ with index 1, etc... )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Label Vectorizer\n", + "\n", + "languages_dict = {\n", + " \"fr\": 0,\n", + " \"en\": 1,\n", + " \"es\": 2,\n", + " \"de\": 3,\n", + "}\n", + "\n", + "labels = []\n", + "\n", + "# Enter your code here\n", + "\n", + "#\n", + "\n", + "labels" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Neural Network 🧠\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In order to process the data with PyTorch, let's convert it into tensors:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = torch.FloatTensor(df[\"vectorized\"])\n", + "y = torch.LongTensor(labels)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, you need to create your neural network and train a model on our data.\n", + "\n", + "- Step 1: Build a network in [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) (your model can be simple as long as it does the job)\n", + "- Step 2: Split your data into train and test subsets (you can use [sklearn's method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for this)\n", + "- Step 3: Train a model on your data until you reach a good accuracy (above 90%)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "### Neural Network\n", + "\n", + "class Network(nn.Module):\n", + " def __init__(self):\n", + " super(Network, self).__init__()\n", + "\n", + " def forward(self, x):\n", + " pass\n", + "\n", + "###\n", + "\n", + "model = Network()\n", + "\n", + "criterion = None\n", + "optimizer = None\n", + "\n", + "from torch.utils.data import Dataset, DataLoader\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "class MyData(Dataset):\n", + " \"\"\"\n", + " This class will be useful when working with batches\n", + " \"\"\"\n", + "\n", + " def __init__(self, x, y):\n", + " self.data = x\n", + " self.target = y\n", + "\n", + " def __getitem__(self, index):\n", + " x = self.data[index]\n", + " y = self.target[index]\n", + "\n", + " return x, y\n", + "\n", + " def __len__(self):\n", + " return len(self.data)\n", + "\n", + "### Training and Testing\n", + "\n", + "def training_loop(x, y):\n", + " x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)\n", + "\n", + " train_dataset = MyData(x_train, y_train)\n", + " test_dataset = MyData(x_test, y_test)\n", + "\n", + " train_dataset = DataLoader(train_dataset, batch_size=32)\n", + " test_dataset = DataLoader(test_dataset, batch_size=32)\n", + "\n", + " # Enter your code here\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + " #\n", + "\n", + " train_accuracy = None\n", + " test_accuracy = None\n", + "\n", + " return train_accuracy, test_accuracy\n", + "\n", + "###\n", + "\n", + "# Store the predictions for all of our data as well as the % of training and testing accuracy inside `predictions`, `train_accuracy` and `test_accuracy`\n", + "train_accuracy, test_accuracy = training_loop(x, y)\n", + "\n", + "print(f\"Train accuracy: {train_accuracy}\")\n", + "print(f\"Test accuracy: {test_accuracy}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If all went well, your accuracy should be close to 100%. πŸ’―\n", + "\n", + "Now, let's see how well the model guesses a language:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "### Prediction\n", + "\n", + "idx2lang = {\n", + " 0: \"fr\",\n", + " 1: \"en\",\n", + " 2: \"es\",\n", + " 3: \"de\",\n", + "}\n", + "\n", + "def predict(x):\n", + " predictions = []\n", + "\n", + " return predictions\n", + "\n", + "predictions = predict(x)\n", + "\n", + "df[\"predictions\"] = predictions\n", + "\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sns.countplot(x='value', hue=\"variable\", data=df[['languages', 'predictions']].melt())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Awesome ! πŸ˜„\n", + "\n", + "You've successfully created a language detection AI using Natural Language Processing and neural networks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def predict_sentence(sentence):\n", + " return predict(vectorize([clean(sentence)]))\n", + "\n", + "predict_sentence(\"J'ai rΓ©ussi Γ  implΓ©menter une intelligence artificielle !\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + }, + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}