Skip to content

Commit

Permalink
beginner_source/bettertransformer_tutorial.rst λ²ˆμ—­ (#916)
Browse files Browse the repository at this point in the history
* beginner_source/bettertransformer_tutorial.rst λ²ˆμ—­
  • Loading branch information
uddk6215 authored Oct 15, 2024
1 parent 279d079 commit 83e6559
Showing 1 changed file with 68 additions and 72 deletions.
140 changes: 68 additions & 72 deletions beginner_source/bettertransformer_tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,59 +1,61 @@
Fast Transformer Inference with Better Transformer
Better Transformerλ₯Ό μ΄μš©ν•œ 고속 트랜슀포머 μΆ”λ‘ 
===============================================================

**Author**: `Michael Gschwind <https://github.com/mikekgfb>`__
**μ €μž**: `마이클 κ·Έμ‰¬λΉˆλ“œ <https://github.com/mikekgfb>`__
**λ²ˆμ—­**: `μ΄μ§„ν˜ <https://github.com/uddk6215>`__

This tutorial introduces Better Transformer (BT) as part of the PyTorch 1.12 release.
In this tutorial, we show how to use Better Transformer for production
inference with torchtext. Better Transformer is a production ready fastpath to
accelerate deployment of Transformer models with high performance on CPU and GPU.
The fastpath feature works transparently for models based either directly on
PyTorch core ``nn.module`` or with torchtext.

이 νŠœν† λ¦¬μ–Όμ—μ„œλŠ” PyTorch 1.12 λ²„μ „μ˜ μΌλΆ€λ‘œ Better Transformer (BT)λ₯Ό μ†Œκ°œν•©λ‹ˆλ‹€.
μ—¬κΈ°μ„œλŠ” torchtextλ₯Ό μ‚¬μš©ν•΄ μƒμš©ν™”λœ μ œν’ˆ μˆ˜μ€€μ˜ μΆ”λ‘ μ—μ„œ Better Transformerλ₯Ό μ μš©ν•˜λŠ” 방법을 λ³΄μ—¬μ€λ‹ˆλ‹€.
Better TransformerλŠ” μƒμš© μ œν’ˆ μˆ˜μ€€μœΌλ‘œ λ°”λ‘œ μ μš©κ°€λŠ₯ν•œ fastpathμž…λ‹ˆλ‹€.
μ΄λŠ”, CPU와 GPUμ—μ„œ κ³ μ„±λŠ₯으둜 더 λΉ λ₯΄κ²Œ Transformer λͺ¨λΈμ„ 배포할 수 μžˆκ²Œλ” ν•΄μ€λ‹ˆλ‹€.
이 fastpath κΈ°λŠ₯은 PyTorch μ½”μ–΄ nn.module을 직접 기반으둜 ν•˜κ±°λ‚˜ torchtextλ₯Ό μ‚¬μš©ν•˜λŠ” λͺ¨λΈμ— λŒ€ν•΄ μ΄ν•΄ν•˜κΈ° 쉽고 λͺ…ν™•ν•˜κ²Œ μž‘λ™ν•©λ‹ˆλ‹€.

Better Transformer fastpath둜 가속화될 수 μžˆλŠ” λͺ¨λΈμ€ PyTorch μ½”μ–΄ torch.nn.module 클래슀인 TransformerEncoder, TransformerEncoderLayer,
그리고 MultiHeadAttention을 μ‚¬μš©ν•˜λŠ” λͺ¨λΈμž…λ‹ˆλ‹€.
λ˜ν•œ, torchtextλŠ” fastpath κ°€μ†ν™”μ˜ 이점을 μ–»κΈ° μœ„ν•΄ μ½”μ–΄ 라이브러리 λͺ¨λ“ˆλ“€μ„ μ‚¬μš©ν•˜λ„λ‘ μ—…λ°μ΄νŠΈλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
(μΆ”ν›„ 더 λ§Žμ€ λͺ¨λ“ˆμ΄ fastpath 싀행을 지원할 수 μžˆμŠ΅λ‹ˆλ‹€.)

Models which can be accelerated by Better Transformer fastpath execution are those
using the following PyTorch core ``torch.nn.module`` classes ``TransformerEncoder``,
``TransformerEncoderLayer``, and ``MultiHeadAttention``. In addition, torchtext has
been updated to use the core library modules to benefit from fastpath acceleration.
(Additional modules may be enabled with fastpath execution in the future.)

Better Transformer offers two types of acceleration:
Better TransformerλŠ” 두 가지 μœ ν˜•μ˜ 가속화λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€:

* Native multihead attention (MHA) implementation for CPU and GPU to improve overall execution efficiency.
* Exploiting sparsity in NLP inference. Because of variable input lengths, input
tokens may contain a large number of padding tokens for which processing may be
skipped, delivering significant speedups.
* CPU와 GPU에 λŒ€ν•œ Native multihead attention(MHA) κ΅¬ν˜„μœΌλ‘œ μ „λ°˜μ μΈ μ‹€ν–‰ νš¨μœ¨μ„±μ„ ν–₯μƒμ‹œν‚΅λ‹ˆλ‹€.
* NLP μΆ”λ‘ μ—μ„œμ˜ sparsityλ₯Ό ν™œμš©ν•©λ‹ˆλ‹€. κ°€λ³€ 길이 μž…λ ₯(variable input lengths)으둜 인해 μž…λ ₯ 토큰에 λ§Žμ€ 수의
νŒ¨λ”© 토큰이 포함될 수 μžˆλŠ”λ°, μ΄λŸ¬ν•œ ν† ν°λ“€μ˜ 처리λ₯Ό κ±΄λ„ˆλ›°μ–΄ μƒλ‹Ήν•œ 속도 ν–₯상을 μ œκ³΅ν•©λ‹ˆλ‹€.

Fastpath execution is subject to some criteria. Most importantly, the model
must be executed in inference mode and operate on input tensors that do not collect
gradient tape information (e.g., running with torch.no_grad).
Fastpath 싀행은 λͺ‡ 가지 기쀀을 μΆ©μ‘±ν•΄μ•Ό ν•©λ‹ˆλ‹€. κ°€μž₯ μ€‘μš”ν•œ 건, λͺ¨λΈμ΄ μΆ”λ‘  λͺ¨λ“œμ—μ„œ μ‹€ν–‰λ˜μ–΄μ•Ό ν•˜λ©°
gradient tape 정보λ₯Ό μˆ˜μ§‘ν•˜μ§€ μ•ŠλŠ” μž…λ ₯ ν…μ„œμ— λŒ€ν•΄ μž‘λ™ν•΄μ•Ό ν•œλ‹€λŠ” κ²ƒμž…λ‹ˆλ‹€(예: torch.no_gradλ₯Ό μ‚¬μš©ν•˜μ—¬ μ‹€ν–‰).

To follow this example in Google Colab, `click here
이 예제λ₯Ό Google Colabμ—μ„œ λ”°λΌν•˜λ €λ©΄, `μ—¬κΈ°λ₯Ό 클릭
<https://colab.research.google.com/drive/1KZnMJYhYkOMYtNIX5S3AGIYnjyG0AojN?usp=sharing>`__.

Better Transformer Features in This Tutorial


이 νŠœν† λ¦¬μ–Όμ—μ„œ Better Transformer의 κΈ°λŠ₯λ“€
--------------------------------------------

* Load pretrained models (created before PyTorch version 1.12 without Better Transformer)
* Run and benchmark inference on CPU with and without BT fastpath (native MHA only)
* Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only)
* Enable sparsity support
* Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity)
* 사전 ν›ˆλ ¨λœ λͺ¨λΈ λ‘œλ“œ (Better Transformer 없이 PyTorch 버전 1.12 이전에 μƒμ„±λœ λͺ¨λΈ)
* CPUμ—μ„œ BT fastpathλ₯Ό μ‚¬μš©ν•œ κ²½μš°μ™€ μ‚¬μš©ν•˜μ§€ μ•Šμ€ 경우의 μΆ”λ‘ μ˜ μ‹€ν–‰ 및 벀치마크 (λ„€μ΄ν‹°λΈŒ MHA만 ν•΄λ‹Ή)
* (ꡬ성 κ°€λŠ₯ν•œ)λ””λ°”μ΄μŠ€μ—μ„œ BT fastpathλ₯Ό μ‚¬μš©ν•œ κ²½μš°μ™€ μ‚¬μš©ν•˜μ§€ μ•Šμ€ 경우의 μΆ”λ‘ μ˜ μ‹€ν–‰ 및 벀치마크 (λ„€μ΄ν‹°λΈŒ MHA만 ν•΄λ‹Ή)
* sparsity 지원 ν™œμ„±ν™”
* (ꡬ성 κ°€λŠ₯ν•œ) λ””λ°”μ΄μŠ€μ—μ„œ BT fastpathλ₯Ό μ‚¬μš©ν•œ κ²½μš°μ™€ μ‚¬μš©ν•˜μ§€ μ•Šμ€ 경우의 μΆ”λ‘ μ˜ μ‹€ν–‰ 및 벀치마크 (λ„€μ΄ν‹°λΈŒ MHA + ν¬μ†Œμ„±)


Additional Information

좔가적인 정보듀
-----------------------
Additional information about Better Transformer may be found in the PyTorch.Org blog
`A Better Transformer for Fast Transformer Inference
더 λ‚˜μ€ νŠΈλžœμŠ€ν¬λ¨Έμ— λŒ€ν•œ μΆ”κ°€ μ •λ³΄λŠ” PyTorch.Org λΈ”λ‘œκ·Έμ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.
`고속 트랜슀포머 좔둠을 μœ„ν•œ Better Transformer
<https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference//>`__.



1. Setup
1. μ„€μ •

1.1 Load pretrained models
1.1 사전 ν›ˆλ ¨λœ λͺ¨λΈ 뢈러였기

We download the XLM-R model from the predefined torchtext models by following the instructions in
`torchtext.models <https://pytorch.org/text/main/models.html>`__. We also set the DEVICE to execute
on-accelerator tests. (Enable GPU execution for your environment as appropriate.)
`torchtext.models <https://pytorch.org/text/main/models.html>`__ 의 지침에 따라 미리 μ •μ˜λœ torchtext λͺ¨λΈμ—μ„œ XLM-R λͺ¨λΈμ„ λ‹€μš΄λ‘œλ“œν•©λ‹ˆλ‹€.
λ˜ν•œ 가속기 μƒμ—μ„œμ˜ ν…ŒμŠ€νŠΈλ₯Ό μ‹€ν–‰ν•˜κΈ° μœ„ν•΄ DEVICEλ₯Ό μ„€μ •ν•©λ‹ˆλ‹€. (ν•„μš”μ— 따라 μ‚¬μš© ν™˜κ²½μ— 맞게 GPU 싀행을 ν™œμ„±ν™”λ©΄ λ©λ‹ˆλ‹€.)

.. code-block:: python
Expand All @@ -74,9 +76,9 @@ on-accelerator tests. (Enable GPU execution for your environment as appropriate
model = xlmr_large.get_model(head=classifier_head)
transform = xlmr_large.transform()
1.2 Dataset Setup
1.2 데이터셋 μ„€μ •

We set up two types of inputs: a small input batch and a big input batch with sparsity.
두 가지 μœ ν˜•μ˜ μž…λ ₯을 μ„€μ •ν•˜κ² μŠ΅λ‹ˆλ‹€. μž‘μ€ μž…λ ₯ λ°°μΉ˜μ™€ sparsityλ₯Ό 가진 큰 μž…λ ₯ λ°°μΉ˜μž…λ‹ˆλ‹€.

.. code-block:: python
Expand Down Expand Up @@ -104,7 +106,7 @@ We set up two types of inputs: a small input batch and a big input batch with sp
St. Petersburg, used only by the elite."""
]
Next, we select either the small or large input batch, preprocess the inputs and test the model.
λ‹€μŒμœΌλ‘œ, μž‘μ€ μž…λ ₯ 배치 λ˜λŠ” 큰 μž…λ ₯ 배치 쀑 ν•˜λ‚˜λ₯Ό μ„ νƒν•˜κ³ , μž…λ ₯을 μ „μ²˜λ¦¬ν•œ ν›„ λͺ¨λΈμ„ ν…ŒμŠ€νŠΈν•©λ‹ˆλ‹€.

.. code-block:: python
Expand All @@ -114,23 +116,23 @@ Next, we select either the small or large input batch, preprocess the inputs and
output = model(model_input)
output.shape
Finally, we set the benchmark iteration count:
λ§ˆμ§€λ§‰μœΌλ‘œ, 벀치마크 반볡 횟수λ₯Ό μ„€μ •ν•©λ‹ˆλ‹€.

.. code-block:: python
ITERATIONS=10
2. Execution
2. μ‹€ν–‰

2.1 Run and benchmark inference on CPU with and without BT fastpath (native MHA only)
2.1 CPUμ—μ„œ BT fastpathλ₯Ό μ‚¬μš©ν•œ κ²½μš°μ™€ μ‚¬μš©ν•˜μ§€ μ•Šμ€ 경우의 μΆ”λ‘ μ˜ μ‹€ν–‰ 및 벀치마크 (λ„€μ΄ν‹°λΈŒ MHA만 ν•΄λ‹Ή)

We run the model on CPU, and collect profile information:
CPUμ—μ„œ λͺ¨λΈμ„ μ‹€ν–‰ν•˜κ³  ν”„λ‘œνŒŒμΌ 정보λ₯Ό μˆ˜μ§‘ν•©λ‹ˆλ‹€:

* The first run uses traditional ("slow path") execution.
* The second run enables BT fastpath execution by putting the model in inference mode using `model.eval()` and disables gradient collection with `torch.no_grad()`.
* 첫 번째 싀행은 전톡적인 μ‹€ν–‰('slow path')을 μ‚¬μš©ν•©λ‹ˆλ‹€.
* 두 번째 싀행은 model.eval()을 μ‚¬μš©ν•˜μ—¬ λͺ¨λΈμ„ μΆ”λ‘  λͺ¨λ“œλ‘œ μ„€μ •ν•˜κ³  torch.no_grad()둜 변화도(gradient) μˆ˜μ§‘μ„ λΉ„ν™œμ„±ν™”ν•˜μ—¬ BT fastpath 싀행을 ν™œμ„±ν™”ν•©λ‹ˆλ‹€.

You can see an improvement (whose magnitude will depend on the CPU model) when the model is executing on CPU. Notice that the fastpath profile shows most of the execution time
in the native `TransformerEncoderLayer` implementation `aten::_transformer_encoder_layer_fwd`.
CPUμ—μ„œ λͺ¨λΈμ„ μ‹€ν–‰ν•  λ•Œ μ„±λŠ₯이 ν–₯μƒλœ 것을 λ³Ό 수 μžˆμ„ κ²λ‹ˆλ‹€.(ν–₯상 μ •λ„λŠ” CPU λͺ¨λΈμ— 따라 λ‹€λ¦…λ‹ˆλ‹€)
fastpath ν”„λ‘œνŒŒμΌμ—μ„œ λŒ€λΆ€λΆ„μ˜ μ‹€ν–‰ μ‹œκ°„μ΄ λ„€μ΄ν‹°λΈŒ `TransformerEncoderLayer`의 μ €μˆ˜μ€€ 연산을 κ΅¬ν˜„ν•œ `aten::_transformer_encoder_layer_fwd`에 μ†Œμš”λ˜λŠ” 것을 μ£Όλͺ©ν•˜μ„Έμš”:

.. code-block:: python
Expand All @@ -152,29 +154,28 @@ in the native `TransformerEncoderLayer` implementation `aten::_transformer_encod
print(prof)
2.2 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only)
2.2 (ꡬ성 κ°€λŠ₯ν•œ)λ””λ°”μ΄μŠ€μ—μ„œ BT fastpathλ₯Ό μ‚¬μš©ν•œ κ²½μš°μ™€ μ‚¬μš©ν•˜μ§€ μ•Šμ€ 경우의 μΆ”λ‘ μ˜ μ‹€ν–‰ 및 벀치마크 (λ„€μ΄ν‹°λΈŒ MHA만 ν•΄λ‹Ή)

We check the BT sparsity setting:
BT sparsity 섀정을 ν™•μΈν•΄λ³΄κ² μŠ΅λ‹ˆλ‹€.

.. code-block:: python
model.encoder.transformer.layers.enable_nested_tensor
We disable the BT sparsity:
μ΄λ²ˆμ—” BT sparsity을 λΉ„ν™œμ„±ν™”ν•©λ‹ˆλ‹€.

.. code-block:: python
model.encoder.transformer.layers.enable_nested_tensor=False
We run the model on DEVICE, and collect profile information for native MHA execution on DEVICE:
DEVICEμ—μ„œ λͺ¨λΈμ„ μ‹€ν–‰ν•˜κ³ , DEVICEμ—μ„œμ˜ λ„€μ΄ν‹°λΈŒ MHA 싀행에 λŒ€ν•œ ν”„λ‘œνŒŒμΌ 정보λ₯Ό μˆ˜μ§‘ν•©λ‹ˆλ‹€:

* The first run uses traditional ("slow path") execution.
* The second run enables BT fastpath execution by putting the model in inference mode using `model.eval()`
and disables gradient collection with `torch.no_grad()`.
* 첫 번째 싀행은 전톡적인 ('slow path') 싀행을 μ‚¬μš©ν•©λ‹ˆλ‹€.
* 두 번째 싀행은 model.eval()을 μ‚¬μš©ν•˜μ—¬ λͺ¨λΈμ„ μΆ”λ‘  λͺ¨λ“œλ‘œ μ„€μ •ν•˜κ³  torch.no_grad()둜 변화도(gradient) μˆ˜μ§‘μ„ λΉ„ν™œμ„±ν™”ν•˜μ—¬ BT fastpath 싀행을 ν™œμ„±ν™”ν•©λ‹ˆλ‹€.

When executing on a GPU, you should see a significant speedup, in particular for the small input batch setting:
GPUμ—μ„œ μ‹€ν–‰ν•  λ•Œ, 특히 μž‘μ€ μž…λ ₯ 배치둜 μ„€μ •ν•œ 경우 속도가 크게 ν–₯μƒλ˜λŠ” 것을 λ³Ό 수 μžˆμ„ κ²λ‹ˆλ‹€.

.. code-block:: python
Expand All @@ -199,20 +200,20 @@ When executing on a GPU, you should see a significant speedup, in particular for
print(prof)
2.3 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity)
2.3 (ꡬ성 κ°€λŠ₯ν•œ) λ””λ°”μ΄μŠ€μ—μ„œ BT fastpathλ₯Ό μ‚¬μš©ν•œ κ²½μš°μ™€ μ‚¬μš©ν•˜μ§€ μ•Šμ€ 경우의 μΆ”λ‘ μ˜ μ‹€ν–‰ 및 벀치마크 (λ„€μ΄ν‹°λΈŒ MHA + ν¬μ†Œμ„±)

We enable sparsity support:
sparsity 지원을 ν™œμ„±ν™”ν•©λ‹ˆλ‹€.

.. code-block:: python
model.encoder.transformer.layers.enable_nested_tensor = True
We run the model on DEVICE, and collect profile information for native MHA and sparsity support execution on DEVICE:
DEVICEμ—μ„œ λͺ¨λΈμ„ μ‹€ν–‰ν•˜κ³ , DEVICEμ—μ„œμ˜ λ„€μ΄ν‹°λΈŒ MHA와 sparsity 지원 싀행에 λŒ€ν•œ ν”„λ‘œνŒŒμΌ 정보λ₯Ό μˆ˜μ§‘ν•©λ‹ˆλ‹€:

* The first run uses traditional ("slow path") execution.
* The second run enables BT fastpath execution by putting the model in inference mode using `model.eval()` and disables gradient collection with `torch.no_grad()`.
* 첫 번째 싀행은 전톡적인 ('slow path') 싀행을 μ‚¬μš©ν•©λ‹ˆλ‹€.
* 두 번째 싀행은 model.eval()을 μ‚¬μš©ν•˜μ—¬ λͺ¨λΈμ„ μΆ”λ‘  λͺ¨λ“œλ‘œ μ„€μ •ν•˜κ³  torch.no_grad()둜 변화도(gradient) μˆ˜μ§‘μ„ λΉ„ν™œμ„±ν™”ν•˜μ—¬ BT fastpath 싀행을 ν™œμ„±ν™”ν•©λ‹ˆλ‹€.

When executing on a GPU, you should see a significant speedup, in particular for the large input batch setting which includes sparsity:
GPUμ—μ„œ μ‹€ν–‰ν•  λ•Œ, 특히 sparsityλ₯Ό ν¬ν•¨ν•˜λŠ” 큰 μž…λ ₯ 배치 μ„€μ •μ—μ„œ μƒλ‹Ήν•œ 속도 ν–₯상을 λ³Ό 수 μžˆμ„ κ²λ‹ˆλ‹€.

.. code-block:: python
Expand All @@ -237,15 +238,10 @@ When executing on a GPU, you should see a significant speedup, in particular for
print(prof)
Summary
μš”μ•½
-------

In this tutorial, we have introduced fast transformer inference with
Better Transformer fastpath execution in torchtext using PyTorch core
Better Transformer support for Transformer Encoder models. We have
demonstrated the use of Better Transformer with models trained prior to
the availability of BT fastpath execution. We have demonstrated and
benchmarked the use of both BT fastpath execution modes, native MHA execution
and BT sparsity acceleration.



이 νŠœν† λ¦¬μ–Όμ—μ„œλŠ” torchtextμ—μ„œ PyTorch μ½”μ–΄μ˜ 트랜슀포머 인코더 λͺ¨λΈμ„ μœ„ν•œ Better Transformer 지원을 ν™œμš©ν•˜μ—¬,
Better Transformerλ₯Ό μ΄μš©ν•œ 고속 트랜슀포머 좔둠을 μ†Œκ°œν–ˆμŠ΅λ‹ˆλ‹€.
BT fastpath 싀행이 κ°€λŠ₯해지기 이전에 ν›ˆλ ¨λœ λͺ¨λΈμ—μ„œ Better Transformer의 μ‚¬μš©μ„ μ‹œμ—°ν–ˆμŠ΅λ‹ˆλ‹€.
λ˜ν•œ BT fastpath μ‹€ν–‰μ˜ 두 가지 λͺ¨λ“œμΈ λ„€μ΄ν‹°λΈŒ MHA μ‹€ν–‰κ³Ό BT sparsity κ°€μ†ν™”μ˜ μ‚¬μš©μ„ μ‹œμ—° 및 벀치마크λ₯Ό ν•΄λ³΄μ•˜μŠ΅λ‹ˆλ‹€.

0 comments on commit 83e6559

Please sign in to comment.