diff --git a/open-machine-learning-jupyter-book/_toc.yml b/open-machine-learning-jupyter-book/_toc.yml
index 478f83569..9abcc8a12 100644
--- a/open-machine-learning-jupyter-book/_toc.yml
+++ b/open-machine-learning-jupyter-book/_toc.yml
@@ -230,6 +230,7 @@ parts:
- file: assignments/deep-learning/difussion-model/denoising-difussion-model
- file: assignments/deep-learning/object-detection/car-object-detection
- file: assignments/deep-learning/overview/basic-classification-classify-images-of-clothing
+ - file: assignments/deep-learning/nlp/getting-start-nlp-with-classification-task
- file: slides/introduction
sections:
- file: slides/python-programming/python-programming-introduction
diff --git a/open-machine-learning-jupyter-book/assignments/deep-learning/nlp/getting-start-nlp-with-classification-task.ipynb b/open-machine-learning-jupyter-book/assignments/deep-learning/nlp/getting-start-nlp-with-classification-task.ipynb
new file mode 100644
index 000000000..457d84b13
--- /dev/null
+++ b/open-machine-learning-jupyter-book/assignments/deep-learning/nlp/getting-start-nlp-with-classification-task.ipynb
@@ -0,0 +1,1152 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "6e14052d-5eec-449a-b6d5-bcc100456aba",
+ "metadata": {},
+ "source": [
+ "# Getting Start NLP with classification task"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "259d0ecc-b734-44ef-b989-49a6e127a944",
+ "metadata": {},
+ "source": [
+ "One area where deep learning has dramatically improved in the last couple of years is natural language processing (NLP). Computers can now generate text, translate automatically from one language to another, analyze comments, label words in sentences, and much more.\n",
+ "\n",
+ "Perhaps the most widely practically useful application of NLP is classification -- that is, classifying a document automatically into some category. This can be used, for instance, for:\n",
+ "\n",
+ "- Sentiment analysis (e.g are people saying positive or negative things about your product)\n",
+ "- Author identification (what author most likely wrote some document)\n",
+ "- Legal discovery (which documents are in scope for a trial)\n",
+ "- Organizing documents by topic\n",
+ "- Triaging inbound emails\n",
+ "- ...and much more!\n",
+ "\n",
+ "Today, we are tasked with comparing two words or short phrases, and scoring them based on whether they're similar or not, based on which patent class they were used in. With a score of 1 it is considered that the two inputs have identical meaning, and 0 means they have totally different meaning. For instance, abatement and eliminating process have a score of 0.5, meaning they're somewhat similar, but not identical.\n",
+ "\n",
+ "It turns out that this can be represented as a classification problem. How? By representing the question like this:\n",
+ "\n",
+ "> For the following text...: \"TEXT1: abatement; TEXT2: eliminating process\" ...chose a category of meaning similarity: \"Different; Similar; Identical\".\n",
+ "\n",
+ "In this assignment section we'll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "12389d2f-c08d-4941-a759-d8bbd8fa44ba",
+ "metadata": {},
+ "source": [
+ "## Import and EDA"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "id": "4e4e3c06-4292-40a7-bb1a-66207159c604",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "from datasets import Dataset,DatasetDict\n",
+ "from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer\n",
+ "import warnings\n",
+ "\n",
+ "warnings.filterwarnings(\"ignore\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f3085076-3dbc-4eab-941c-9c7e2a4b740e",
+ "metadata": {},
+ "source": [
+ "First of all, let's import the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "cbd450f6-31dd-46c9-918e-e0f3984d1436",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " anchor | \n",
+ " target | \n",
+ " context | \n",
+ " score | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 37d61fd2272659b1 | \n",
+ " abatement | \n",
+ " abatement of pollution | \n",
+ " A47 | \n",
+ " 0.50 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 7b9652b17b68b7a4 | \n",
+ " abatement | \n",
+ " act of abating | \n",
+ " A47 | \n",
+ " 0.75 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 36d72442aefd8232 | \n",
+ " abatement | \n",
+ " active catalyst | \n",
+ " A47 | \n",
+ " 0.25 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 5296b0c19e1ce60e | \n",
+ " abatement | \n",
+ " eliminating process | \n",
+ " A47 | \n",
+ " 0.50 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 54c1e3b9184cb5b6 | \n",
+ " abatement | \n",
+ " forest region | \n",
+ " A47 | \n",
+ " 0.00 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id anchor target context score\n",
+ "0 37d61fd2272659b1 abatement abatement of pollution A47 0.50\n",
+ "1 7b9652b17b68b7a4 abatement act of abating A47 0.75\n",
+ "2 36d72442aefd8232 abatement active catalyst A47 0.25\n",
+ "3 5296b0c19e1ce60e abatement eliminating process A47 0.50\n",
+ "4 54c1e3b9184cb5b6 abatement forest region A47 0.00"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/deep-learning/nlp/phrase_matching_train.csv')\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8c3283b2-2a18-447f-8400-f6d6efc95f4d",
+ "metadata": {},
+ "source": [
+ "As you see, there are 5 columns, where **anchor** and **target** are a pair phrases, **context** is the common context they are in, **score** is the similarity score of anchor and target."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "69b58088-6d15-4b11-85dd-bf5b86d9df64",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " anchor | \n",
+ " target | \n",
+ " context | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " count | \n",
+ " 36473 | \n",
+ " 36473 | \n",
+ " 36473 | \n",
+ " 36473 | \n",
+ "
\n",
+ " \n",
+ " unique | \n",
+ " 36473 | \n",
+ " 733 | \n",
+ " 29340 | \n",
+ " 106 | \n",
+ "
\n",
+ " \n",
+ " top | \n",
+ " 37d61fd2272659b1 | \n",
+ " component composite coating | \n",
+ " composition | \n",
+ " H01 | \n",
+ "
\n",
+ " \n",
+ " freq | \n",
+ " 1 | \n",
+ " 152 | \n",
+ " 24 | \n",
+ " 2186 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id anchor target context\n",
+ "count 36473 36473 36473 36473\n",
+ "unique 36473 733 29340 106\n",
+ "top 37d61fd2272659b1 component composite coating composition H01\n",
+ "freq 1 152 24 2186"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.describe(include='object')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "00f5b5c7-bcb7-4bf0-875e-45fbf8a2ba65",
+ "metadata": {},
+ "source": [
+ "We can see that in the 36473 rows, there are 733 unique anchors, 106 contexts, and nearly 30000 targets. Some anchors are very common, with \"component composite coating\" for instance appearing 152 times.\n",
+ "\n",
+ "Earlier, I suggested we could represent the input to the model as something like \"TEXT1: abatement; TEXT2: eliminating process\". We'll need to add the context to this too. In Pandas, we just use + to concatenate, like so:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "a89091d6-771a-4b7e-a540-b20d6f03e9aa",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " anchor | \n",
+ " target | \n",
+ " context | \n",
+ " score | \n",
+ " input | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 37d61fd2272659b1 | \n",
+ " abatement | \n",
+ " abatement of pollution | \n",
+ " A47 | \n",
+ " 0.50 | \n",
+ " TEXT1: A47; TEXT2: abatement of pollution; ANC... | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 7b9652b17b68b7a4 | \n",
+ " abatement | \n",
+ " act of abating | \n",
+ " A47 | \n",
+ " 0.75 | \n",
+ " TEXT1: A47; TEXT2: act of abating; ANC1: abate... | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 36d72442aefd8232 | \n",
+ " abatement | \n",
+ " active catalyst | \n",
+ " A47 | \n",
+ " 0.25 | \n",
+ " TEXT1: A47; TEXT2: active catalyst; ANC1: abat... | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 5296b0c19e1ce60e | \n",
+ " abatement | \n",
+ " eliminating process | \n",
+ " A47 | \n",
+ " 0.50 | \n",
+ " TEXT1: A47; TEXT2: eliminating process; ANC1: ... | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 54c1e3b9184cb5b6 | \n",
+ " abatement | \n",
+ " forest region | \n",
+ " A47 | \n",
+ " 0.00 | \n",
+ " TEXT1: A47; TEXT2: forest region; ANC1: abatement | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id anchor target context score \\\n",
+ "0 37d61fd2272659b1 abatement abatement of pollution A47 0.50 \n",
+ "1 7b9652b17b68b7a4 abatement act of abating A47 0.75 \n",
+ "2 36d72442aefd8232 abatement active catalyst A47 0.25 \n",
+ "3 5296b0c19e1ce60e abatement eliminating process A47 0.50 \n",
+ "4 54c1e3b9184cb5b6 abatement forest region A47 0.00 \n",
+ "\n",
+ " input \n",
+ "0 TEXT1: A47; TEXT2: abatement of pollution; ANC... \n",
+ "1 TEXT1: A47; TEXT2: act of abating; ANC1: abate... \n",
+ "2 TEXT1: A47; TEXT2: active catalyst; ANC1: abat... \n",
+ "3 TEXT1: A47; TEXT2: eliminating process; ANC1: ... \n",
+ "4 TEXT1: A47; TEXT2: forest region; ANC1: abatement "
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor\n",
+ "df.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0fc9e71f-aed8-40b0-821d-c352331f4a89",
+ "metadata": {},
+ "source": [
+ "## Tokenization"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0d9c20bb-a146-497b-a503-edfc6372c084",
+ "metadata": {},
+ "source": [
+ "Transformers uses a `Dataset` object for storing their dataset, of course! We can create one like so:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "a3dcc05b-e170-43ed-92f5-8805d3dbac7f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Dataset({\n",
+ " features: ['id', 'anchor', 'target', 'context', 'score', 'input'],\n",
+ " num_rows: 36473\n",
+ "})"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ds = Dataset.from_pandas(df)\n",
+ "ds"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "107fe0be-b9a4-4c30-a141-41fcfb3beab4",
+ "metadata": {},
+ "source": [
+ "But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:\n",
+ "\n",
+ "- Tokenization: Split each text up into words (or actually, as we'll see, into tokens)\n",
+ "- Numericalization: Convert each word (or token) into a number.\n",
+ "\n",
+ "The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this (replace \"small\" with \"large\" for a slower but more accurate model, once you've finished exploring):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "9bd4fdd3-f881-486b-9d75-ccbb6286b525",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model_nm = 'microsoft/deberta-v3-small'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3f60c3b5-a0a5-4dd5-b512-549cb4242579",
+ "metadata": {},
+ "source": [
+ "`AutoTokenizer` will create a tokenizer appropriate for a given model:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "96aca66f-0663-4dfd-97f6-169f8a0fcb1d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "tokz = AutoTokenizer.from_pretrained(model_nm)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5335cfec-7a71-481a-8c18-e934a90194ad",
+ "metadata": {},
+ "source": [
+ "Here's an example of how the tokenizer splits a text into \"tokens\" (which are like words, but can be sub-word pieces, as you see below):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "6fd61beb-a5a8-4f89-9659-8470faf62fb4",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['▁G',\n",
+ " \"'\",\n",
+ " 'day',\n",
+ " '▁folks',\n",
+ " ',',\n",
+ " '▁I',\n",
+ " \"'\",\n",
+ " 'm',\n",
+ " '▁Jeremy',\n",
+ " '▁from',\n",
+ " '▁fast',\n",
+ " '.',\n",
+ " 'ai',\n",
+ " '!']"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tokz.tokenize(\"G'day folks, I'm Jeremy from fast.ai!\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0a2a6a56-2e10-403a-9533-b65afb974425",
+ "metadata": {},
+ "source": [
+ "Uncommon words will be split into pieces just like `ornithorhynchus`. The start of a new word is represented by `▁`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "92e0189f-c8f1-405f-83fb-bd01c483710c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['▁A',\n",
+ " '▁platypus',\n",
+ " '▁is',\n",
+ " '▁an',\n",
+ " '▁or',\n",
+ " 'ni',\n",
+ " 'tho',\n",
+ " 'rhynch',\n",
+ " 'us',\n",
+ " '▁an',\n",
+ " 'at',\n",
+ " 'inus',\n",
+ " '.']"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tokz.tokenize(\"A platypus is an ornithorhynchus anatinus.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a4e253e5-e098-4253-b401-57f0cf753325",
+ "metadata": {},
+ "source": [
+ "## Numericalization"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "83ee57e9-54d7-45cb-a23d-68156b0350f8",
+ "metadata": {},
+ "source": [
+ "After completing Tokenization, we need to convert each token into a number, because the model only accepts numbers as input. But ... how to do it?\n",
+ "We need a large token dictionary to map each token to a number!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "7d56000f-9137-4f7f-b3b6-93624efddc50",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "vocab = tokz.get_vocab()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f062775c-83b5-48cb-be62-70d9e1355d05",
+ "metadata": {},
+ "source": [
+ "The above is the token dictionary that comes with the `deberta-v3-small` model. You can print it out to check."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "dd749eeb-17da-4bd9-b229-c96270a0d415",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'input_ids': [1, 336, 114224, 269, 299, 289, 4840, 34765, 102530, 1867, 299, 2401, 26835, 260, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tokz(\"A platypus is an ornithorhynchus anatinus.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "95028988-a530-46d2-96a9-b11c2e9ab3d7",
+ "metadata": {},
+ "source": [
+ "According to this token dictionary, we can convert the original token sequence into a digital sequence. Input_ids is the number we need, token_type_ids represents whether all tokens belong to the same sentence, and attention_mask represents whether the token exists in the token dictionary.\n",
+ "\n",
+ "Here's a simple function which tokenizes our inputs:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "8c6da089-a612-499b-8a00-e448a0eee212",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def tok_func(x): return tokz(x[\"input\"])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "681d176d-f055-42fd-87de-88dcee8ec18a",
+ "metadata": {},
+ "source": [
+ "To run this quickly in parallel on every row in our dataset, use map:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "fec0c346-3726-42d2-87c6-2260b7d4c80c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "42f7bd84ff414f0cb23b9b1bb44b8ad5",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Map: 0%| | 0/36473 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "Dataset({\n",
+ " features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+ " num_rows: 36473\n",
+ "})"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tok_ds = ds.map(tok_func, batched=True)\n",
+ "tok_ds"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ad41633c-b609-4c71-b51e-7be9861fa05e",
+ "metadata": {},
+ "source": [
+ "This adds a new item to our dataset called input_ids. For instance, here is the input and IDs for the first row of our data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "7fad8274-4847-49f2-83ab-c3ae6abe6313",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',\n",
+ " [1,\n",
+ " 54453,\n",
+ " 435,\n",
+ " 294,\n",
+ " 336,\n",
+ " 5753,\n",
+ " 346,\n",
+ " 54453,\n",
+ " 445,\n",
+ " 294,\n",
+ " 47284,\n",
+ " 265,\n",
+ " 6435,\n",
+ " 346,\n",
+ " 23702,\n",
+ " 435,\n",
+ " 294,\n",
+ " 47284,\n",
+ " 2])"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "row = tok_ds[0]\n",
+ "row['input'], row['input_ids']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5fcd90d-f60b-4ca7-9101-e788bb3be03e",
+ "metadata": {},
+ "source": [
+ "Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name labels, but in our dataset it's currently score. Therefore, we need to rename it:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "bb8e3ef2-0574-411f-ab17-3b35af0d7519",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Dataset({\n",
+ " features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+ " num_rows: 36473\n",
+ "})"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tok_ds = tok_ds.rename_columns({'score':'labels'})\n",
+ "tok_ds"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "17dc8f85-fc53-425f-ae26-fe9c1429a667",
+ "metadata": {},
+ "source": [
+ "Now that we've prepared our tokens and labels, we need to create our validation set."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c124199a-7978-4c01-9924-c47fb63ca2fb",
+ "metadata": {},
+ "source": [
+ "## Test and validation sets"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "ad9891e8-9f0e-46da-a9e2-7c357673f337",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "9feb595356464f7597d8bc8907474a2a",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Map: 0%| | 0/36 [00:00, ? examples/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "Dataset({\n",
+ " features: ['id', 'anchor', 'target', 'context', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+ " num_rows: 36\n",
+ "})"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "eval_df = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/deep-learning/nlp/phrase_matching_test.csv')\n",
+ "eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor\n",
+ "eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)\n",
+ "eval_ds"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ddaad27f-c72f-4867-bc54-b6b3e83329ea",
+ "metadata": {},
+ "source": [
+ "This is the test set. Possibly the most important idea in machine learning is that of having separate training, validation, and test data sets."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "263cec95-5e4b-4fec-bc06-7e55a3a9e25f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "DatasetDict({\n",
+ " train: Dataset({\n",
+ " features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+ " num_rows: 27354\n",
+ " })\n",
+ " test: Dataset({\n",
+ " features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],\n",
+ " num_rows: 9119\n",
+ " })\n",
+ "})"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "dds = tok_ds.train_test_split(0.25, seed=42)\n",
+ "dds"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fae92113-59e2-475d-b472-4f4f7c338f39",
+ "metadata": {},
+ "source": [
+ "This is the validation set. We use train_test_split to separate it from the training set with a separation ratio of 25%."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c23e0df9-6feb-451a-8d26-7515cc21e6d9",
+ "metadata": {},
+ "source": [
+ "## Training our model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c8294281-4b31-4446-a4a4-2c99ba4bd593",
+ "metadata": {},
+ "source": [
+ "Before starting training, we need to set some hyperparameters for our model. Here's a concise explanation:\n",
+ "\n",
+ "- **Batch Size (`bs`):** 128 examples processed in each iteration.\n",
+ "- **Epochs (`epochs`):** The model will be trained through the entire dataset 4 times.\n",
+ "- **Learning Rate (`lr`):** The step size for adjusting model weights during optimization is set to 8e-5.\n",
+ "- **TrainingArguments (`args`):**\n",
+ " - **Warmup Ratio:** 10% of training steps used for learning rate warm-up.\n",
+ " - **Learning Rate Scheduler:** Cosine learning rate scheduler.\n",
+ " - **Mixed Precision (`fp16`):** Training with mixed-precision for faster computation.\n",
+ " - **Evaluation Strategy:** Model evaluation after each epoch.\n",
+ " - **Batch Sizes:** 128 examples per training device, 256 for evaluation.\n",
+ " - **Number of Training Epochs:** Training for 4 epochs.\n",
+ " - **Weight Decay:** L2 regularization with a rate of 0.01.\n",
+ " - **Report To:** No reports sent during training (set to 'none'). to 'none')."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "a0134c74-63b1-472f-8a8e-ad9eefae0a2b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bs = 128\n",
+ "epochs = 4\n",
+ "lr = 8e-5\n",
+ "args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,\n",
+ " evaluation_strategy=\"epoch\", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,\n",
+ " num_train_epochs=epochs, weight_decay=0.01, report_to='none')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d0458916-7686-4833-84ec-c66823ad6fdf",
+ "metadata": {},
+ "source": [
+ "Now, we can initialize a pre-trained sequence classification model and sets up a training environment using Hugging Face's Trainer. The model is loaded with `AutoModelForSequenceClassification.from_pretrained` and configured with training parameters in the `Trainer` object.\n",
+ "\r\n",
+ "\r\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "32d7e88e-d163-47b8-9b7b-8f910e500fa2",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']\n",
+ "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+ ]
+ }
+ ],
+ "source": [
+ "model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)\n",
+ "trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],\n",
+ " tokenizer=tokz)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "664277e2-2771-44aa-8695-d448c38e86f6",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " \n",
+ "
\n",
+ " [856/856 00:53, Epoch 4/4]\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Epoch | \n",
+ " Training Loss | \n",
+ " Validation Loss | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 1 | \n",
+ " No log | \n",
+ " 0.026275 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " No log | \n",
+ " 0.021973 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 0.039600 | \n",
+ " 0.022443 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 0.039600 | \n",
+ " 0.023286 | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "trainer.train();"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "3c35b83e-0320-4ea6-89df-342f9d3fb36e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([[-1.50489807e-03],\n",
+ " [ 4.90570068e-03],\n",
+ " [-5.05447388e-04],\n",
+ " [ 2.69412994e-04],\n",
+ " [-1.44767761e-03],\n",
+ " [ 4.85897064e-04],\n",
+ " [-1.81484222e-03],\n",
+ " [ 8.22067261e-04],\n",
+ " [ 4.36019897e-03],\n",
+ " [ 4.40216064e-03],\n",
+ " [-6.16550446e-04],\n",
+ " [-4.18424606e-05],\n",
+ " [-1.20639801e-03],\n",
+ " [ 3.18288803e-04],\n",
+ " [-6.15119934e-04],\n",
+ " [-8.05377960e-04],\n",
+ " [-2.66265869e-03],\n",
+ " [ 2.60114670e-04],\n",
+ " [ 3.48281860e-03],\n",
+ " [ 1.68323517e-03],\n",
+ " [ 1.38378143e-03],\n",
+ " [-2.48527527e-03],\n",
+ " [ 7.53879547e-04],\n",
+ " [ 8.55922699e-04],\n",
+ " [-2.27355957e-03],\n",
+ " [-2.88581848e-03],\n",
+ " [ 3.29780579e-03],\n",
+ " [ 9.42707062e-04],\n",
+ " [ 4.26769257e-04],\n",
+ " [-1.19447708e-04],\n",
+ " [-2.77519226e-03],\n",
+ " [ 5.27381897e-04],\n",
+ " [-8.44001770e-04],\n",
+ " [ 4.88281250e-04],\n",
+ " [-2.11715698e-04],\n",
+ " [-1.00421906e-03]])"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "preds = trainer.predict(eval_ds).predictions.astype(float)\n",
+ "preds"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "138c6cd1-e0cb-405e-8b15-952e90c6a954",
+ "metadata": {},
+ "source": [
+ "Look out - some of our predictions are <0, or >1! Let's fix those out-of-bounds predictions:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "d6cfff69-3178-4afb-858e-5653a938e3af",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([[0. ],\n",
+ " [0.0049057 ],\n",
+ " [0. ],\n",
+ " [0.00026941],\n",
+ " [0. ],\n",
+ " [0.0004859 ],\n",
+ " [0. ],\n",
+ " [0.00082207],\n",
+ " [0.0043602 ],\n",
+ " [0.00440216],\n",
+ " [0. ],\n",
+ " [0. ],\n",
+ " [0. ],\n",
+ " [0.00031829],\n",
+ " [0. ],\n",
+ " [0. ],\n",
+ " [0. ],\n",
+ " [0.00026011],\n",
+ " [0.00348282],\n",
+ " [0.00168324],\n",
+ " [0.00138378],\n",
+ " [0. ],\n",
+ " [0.00075388],\n",
+ " [0.00085592],\n",
+ " [0. ],\n",
+ " [0. ],\n",
+ " [0.00329781],\n",
+ " [0.00094271],\n",
+ " [0.00042677],\n",
+ " [0. ],\n",
+ " [0. ],\n",
+ " [0.00052738],\n",
+ " [0. ],\n",
+ " [0.00048828],\n",
+ " [0. ],\n",
+ " [0. ]])"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "preds = np.clip(preds, 0, 1)\n",
+ "preds"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "464c75ad-e4c3-4c6a-a776-c98011ed5eba",
+ "metadata": {
+ "jp-MarkdownHeadingCollapsed": true
+ },
+ "source": [
+ "# Acknowledgments\n",
+ "\n",
+ "Thanks to [Jeremy Howard](https://www.kaggle.com/jhoward) for creating [Getting started with NLP for absolute beginners](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners). It inspires the majority of the content in this chapter."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "vmamba",
+ "language": "python",
+ "name": "vmamba"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/open-machine-learning-jupyter-book/deep-learning/cnn/cnn-deepdream.ipynb b/open-machine-learning-jupyter-book/deep-learning/cnn/cnn-deepdream.ipynb
index 49ef8353b..bc578fe39 100644
--- a/open-machine-learning-jupyter-book/deep-learning/cnn/cnn-deepdream.ipynb
+++ b/open-machine-learning-jupyter-book/deep-learning/cnn/cnn-deepdream.ipynb
@@ -352,7 +352,7 @@
"source": [
"## Your turn! 🚀\n",
"\n",
- "TBD."
+ "You can practice your cnn skills by following the assignment [sign language digits classification with cnn](../../assignments/deep-learning/cnn/sign-language-digits-classification-with-cnn.ipynb)"
]
},
{
diff --git a/open-machine-learning-jupyter-book/deep-learning/cnn/cnn-vgg.ipynb b/open-machine-learning-jupyter-book/deep-learning/cnn/cnn-vgg.ipynb
index 4e67c7e72..188b31cdb 100644
--- a/open-machine-learning-jupyter-book/deep-learning/cnn/cnn-vgg.ipynb
+++ b/open-machine-learning-jupyter-book/deep-learning/cnn/cnn-vgg.ipynb
@@ -440,8 +440,7 @@
"metadata": {},
"source": [
"## Your turn! 🚀\n",
- "\n",
- "TBD."
+ "You can practice your cnn skills by following the assignment [object recognition in images using cnn](../../assignments/deep-learning/cnn/object-recognition-in-images-using-cnn.ipynb)."
]
},
{
diff --git a/open-machine-learning-jupyter-book/deep-learning/cnn/cnn.ipynb b/open-machine-learning-jupyter-book/deep-learning/cnn/cnn.ipynb
index 29bb544d7..57605d88c 100644
--- a/open-machine-learning-jupyter-book/deep-learning/cnn/cnn.ipynb
+++ b/open-machine-learning-jupyter-book/deep-learning/cnn/cnn.ipynb
@@ -861,8 +861,7 @@
"metadata": {},
"source": [
"## Your turn! 🚀\n",
- "\n",
- "TBD."
+ "You can practice your cnn skills by following the assignment [how to choose cnn architecture mnist](../../assignments/deep-learning/cnn/how-to-choose-cnn-architecture-mnist.ipynb)."
]
},
{
diff --git a/open-machine-learning-jupyter-book/deep-learning/nlp.ipynb b/open-machine-learning-jupyter-book/deep-learning/nlp.ipynb
index 839b75e67..efcff10fd 100644
--- a/open-machine-learning-jupyter-book/deep-learning/nlp.ipynb
+++ b/open-machine-learning-jupyter-book/deep-learning/nlp.ipynb
@@ -786,8 +786,7 @@
},
"source": [
"## Your turn! 🚀\n",
- "\n",
- "TBD."
+ "You can practice your nlp skills by following the assignment [getting start nlp with classification task](../assignments/deep-learning/nlp/getting-start-nlp-with-classification-task.ipynb)."
]
},
{
diff --git a/open-machine-learning-jupyter-book/deep-learning/rnn.ipynb b/open-machine-learning-jupyter-book/deep-learning/rnn.ipynb
index d436b6048..ccb86f547 100644
--- a/open-machine-learning-jupyter-book/deep-learning/rnn.ipynb
+++ b/open-machine-learning-jupyter-book/deep-learning/rnn.ipynb
@@ -448,7 +448,7 @@
"source": [
"## Your turn! 🚀\n",
"\n",
- "Practice the Recurrent Neural Networks by following this TBD."
+ "You can practice your rnn skills by following the assignment [google stock price prediction rnn](../assignments/deep-learning/rnn/google-stock-price-prediction-rnn.ipynb)"
]
},
{
diff --git a/open-machine-learning-jupyter-book/deep-learning/time-series.ipynb b/open-machine-learning-jupyter-book/deep-learning/time-series.ipynb
index 2947c75e1..35bd8f304 100644
--- a/open-machine-learning-jupyter-book/deep-learning/time-series.ipynb
+++ b/open-machine-learning-jupyter-book/deep-learning/time-series.ipynb
@@ -1700,12 +1700,18 @@
"\n",
"## Your turn! 🚀\n",
"\n",
- "TBD.\n",
+ "You can practice your time series skills by following the assignment [time series forecasting assignment](../assignments/deep-learning/time-series-forecasting-assignment.ipynb)\n",
"\n",
"## Acknowledgments\n",
"\n",
"Thanks to [kaggle](https://www.kaggle.com/) for creating the open-source course [Time Series](https://www.kaggle.com/learn/time-series). It inspires the majority of the content in this chapter.\n"
]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f258c933",
+ "metadata": {},
+ "source": []
}
],
"metadata": {