update llm

ocademy-ai · Apr 8, 2024 · e24faa5 · e24faa5
1 parent 031c6f2
commit e24faa5
Show file tree

Hide file tree

Showing 2 changed files with 184 additions and 212 deletions.
diff --git a/open-machine-learning-jupyter-book/assignments/llm/basic/transformer-architecture.ipynb b/open-machine-learning-jupyter-book/assignments/llm/basic/transformer-architecture.ipynb
@@ -47,7 +47,7 @@
     "In addition, the decoder includes an additional attention mechanism that focuses on the encoder's output to incorporate context information during sequence generation.\n",
     "Overall, the encoder-decoder architecture based on the Transformer structure allows for effective semantic abstraction by leveraging attention mechanisms, position-wise feedforward layers, residual connections, and layer normalization. This architecture enables the model to capture complex dependencies between words in the input sequence and generate meaningful outputs for various sequence-to-sequence tasks.\n",
     "\n",
-    ":::{figure} https://media.geeksforgeeks.org/wp-content/uploads/20230531140926/Transformer-python-(1).png\n",
+    ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/llm/Transformer-python-%281%29.png\n",
     "Transformer-based encoder and decoder Architecture\n",
     ":::\n",
     "\n",
@@ -62,7 +62,13 @@
     "\n",
     "The Embedding Layer in the Transformer model is responsible for converting discrete token indices into continuous vector representations. Each token index is mapped to a high-dimensional vector, which is learned during the training process. These embeddings capture semantic and syntactic information about the tokens.\n",
     "\n",
-    "Implementation in PyTorch:"
+    "Implementation in PyTorch:\n",
+    "\n",
+    "We define a PositionalEncoder class that inherits from nn.Module.\n",
+    "The constructor initializes the positional encoding matrix (pe) based on the given d_model (dimension of the model) and max_seq_len (maximum sequence length).\n",
+    "The forward method scales the input embeddings (x) by the square root of the model dimension and adds the positional encoding matrix (pe) to the input embeddings.\n",
+    "Note that we're using PyTorch's Variable and autograd to ensure that the positional encoding is compatible with the autograd mechanism for backpropagation.\n",
+    "Finally, the PositionalEncoder class can be used within a larger PyTorch model to incorporate positional information into word embeddings."
    ]
   },
   {
@@ -148,27 +154,17 @@
     "        assert math.isclose(output[0, 0, 0].item(), expected_first_element.item(), rel_tol=1e-6)\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this code:\n",
-    "\n",
-    "We define a PositionalEncoder class that inherits from nn.Module.\n",
-    "The constructor initializes the positional encoding matrix (pe) based on the given d_model (dimension of the model) and max_seq_len (maximum sequence length).\n",
-    "The forward method scales the input embeddings (x) by the square root of the model dimension and adds the positional encoding matrix (pe) to the input embeddings.\n",
-    "Note that we're using PyTorch's Variable and autograd to ensure that the positional encoding is compatible with the autograd mechanism for backpropagation.\n",
-    "Finally, the PositionalEncoder class can be used within a larger PyTorch model to incorporate positional information into word embeddings."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Attention Layer\n",
     "The Attention Layer in the Transformer model enables the model to focus on different parts of the input sequence when processing each token. It computes attention scores between each pair of tokens in the input sequence and generates a context vector for each token based on the importance of other tokens. This mechanism allows the model to capture long-range dependencies in the input sequence effectively.\n",
     "\n",
-    "Implementation in PyTorch:"
+    "Implementation in PyTorch:\n",
+    "\n",
+    "The MultiHeadAttention class defines a multi-head self-attention layer.\n",
+    "The forward method performs linear operations to divide inputs into multiple heads, computes attention scores, and aggregates the outputs of multiple heads."
    ]
   },
   {
@@ -264,16 +260,6 @@
     "        self.assertEqual(output.shape, (batch_size, seq_length, d_model))\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this implementation:\n",
-    "\n",
-    "The MultiHeadAttention class defines a multi-head self-attention layer.\n",
-    "The forward method performs linear operations to divide inputs into multiple heads, computes attention scores, and aggregates the outputs of multiple heads."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -282,7 +268,10 @@
     "\n",
     "The Position-wise Feedforward Layer in the Transformer model applies a simple feedforward neural network independently to each position in the sequence. It consists of two linear transformations with a non-linear activation function (commonly ReLU) applied in between. This layer helps capture complex interactions between different dimensions of the input embeddings.\n",
     "\n",
-    "Implementation in PyTorch:"
+    "Implementation in PyTorch:\n",
+    "\n",
+    "The FeedForward class defines a feedforward layer.\n",
+    "The forward method applies ReLU activation to the output of the first linear transformation, followed by dropout, and then performs the second linear transformation to produce the final output."
    ]
   },
   {
@@ -347,16 +336,6 @@
     "        self.assertEqual(output.shape, (batch_size, seq_length, d_model))"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this implementation:\n",
-    "\n",
-    "The FeedForward class defines a feedforward layer.\n",
-    "The forward method applies ReLU activation to the output of the first linear transformation, followed by dropout, and then performs the second linear transformation to produce the final output."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -369,7 +348,10 @@
     "Layer Normalization:\n",
     "Layer Normalization is a technique used to stabilize the training of deep neural networks by normalizing the activations of each layer. In the Transformer model, layer normalization is applied after each sub-layer (such as attention and feedforward layers) and before the residual connection. It normalizes the activations along the feature dimension, allowing the model to learn more robust representations and accelerate convergence during training.\n",
     "\n",
-    "Implementation in PyTorch:"
+    "Implementation in PyTorch:\n",
+    "\n",
+    "The NormLayer class defines a layer normalization layer.\n",
+    "The forward method computes the layer normalization using the given input tensor x."
    ]
   },
   {
@@ -434,16 +416,6 @@
     "        self.assertEqual(output.shape, (batch_size, seq_length, d_model))"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this implementation:\n",
-    "\n",
-    "The NormLayer class defines a layer normalization layer.\n",
-    "The forward method computes the layer normalization using the given input tensor x."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -459,7 +431,11 @@
     "Multi-Head Attention sub-layer that attends to the encoder's output.\n",
     "FeedForward sub-layer. Again, each sub-layer is followed by Residual Connection and Layer Normalization.\n",
     "\n",
-    "Below are the Python implementations for the Encoder and Decoder structures:"
+    "Below are the Python implementations for the Encoder and Decoder structures:\n",
+    "\n",
+    "The EncoderLayer and DecoderLayer classes define encoder and decoder layers, respectively.\n",
+    "The Encoder and Decoder classes define encoder and decoder modules, respectively, composed of multiple layers of encoder or decoder layers.\n",
+    "These classes follow the architecture described in the text, including the use of multi-head attention, feedforward layers, residual connections, and layer normalization."
    ]
   },
   {
@@ -579,17 +555,6 @@
     "        return self.norm(x)\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In these implementations:\n",
-    "\n",
-    "The EncoderLayer and DecoderLayer classes define encoder and decoder layers, respectively.\n",
-    "The Encoder and Decoder classes define encoder and decoder modules, respectively, composed of multiple layers of encoder or decoder layers.\n",
-    "These classes follow the architecture described in the text, including the use of multi-head attention, feedforward layers, residual connections, and layer normalization."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},