From 11c3839790fb8555918699b015a93fc8ad9c91d3 Mon Sep 17 00:00:00 2001 From: Wonhyeong Seo Date: Tue, 3 Oct 2023 01:55:33 +0900 Subject: [PATCH] =?UTF-8?q?=F0=9F=8C=90=20[i18n-KO]=20Translated=20`tokeni?= =?UTF-8?q?zer=5Fsummary.md`=20to=20Korean=20(#26243)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * docs: ko: toknenizer_summary.md Co-Authored-By: Sohyun Sim <96299403+sim-so@users.noreply.github.com> Co-Authored-By: Juntae <79131091+sronger@users.noreply.github.com> Co-Authored-By: Injin Paek <71638597+eenzeenee@users.noreply.github.com> * update review * fix: resolve suggestions Co-Authored-By: Nayeon Han Co-Authored-By: Steven Liu <59462357+stevhliu@users.noreply.github.com> * fix: resolve suggestions Co-authored-by: Hyeonseo Yun <0525yhs@gmail.com> --------- Co-authored-by: HanNayeoniee Co-authored-by: Sohyun Sim <96299403+sim-so@users.noreply.github.com> Co-authored-by: Juntae <79131091+sronger@users.noreply.github.com> Co-authored-by: Injin Paek <71638597+eenzeenee@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Hyeonseo Yun <0525yhs@gmail.com> --- docs/source/ko/_toctree.yml | 4 +- docs/source/ko/tokenizer_summary.md | 253 ++++++++++++++++++++++++++++ 2 files changed, 255 insertions(+), 2 deletions(-) create mode 100644 docs/source/ko/tokenizer_summary.md diff --git a/docs/source/ko/_toctree.yml b/docs/source/ko/_toctree.yml index 50b9218b21cd71..e086bc4adc9f3c 100644 --- a/docs/source/ko/_toctree.yml +++ b/docs/source/ko/_toctree.yml @@ -174,8 +174,8 @@ title: ๐Ÿค— Transformers๋กœ ์ž‘์—…์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ• - local: model_summary title: Transformer ๋ชจ๋ธ๊ตฐ - - local: in_translation - title: (๋ฒˆ์—ญ์ค‘) Summary of the tokenizers + - local: tokenizer_summary + title: ํ† ํฌ๋‚˜์ด์ € ์š”์•ฝ - local: attention title: ์–ดํ…์…˜ ๋งค์ปค๋‹ˆ์ฆ˜ - local: pad_truncation diff --git a/docs/source/ko/tokenizer_summary.md b/docs/source/ko/tokenizer_summary.md new file mode 100644 index 00000000000000..5c6b9a6b73ca5f --- /dev/null +++ b/docs/source/ko/tokenizer_summary.md @@ -0,0 +1,253 @@ + + +# ํ† ํฌ๋‚˜์ด์ € ์š”์•ฝ[[summary-of-the-tokenizers]] + +[[open-in-colab]] + +์ด ํŽ˜์ด์ง€์—์„œ๋Š” ํ† ํฐํ™”์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. + + + +[๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ ํŠœํ† ๋ฆฌ์–ผ](preprocessing)์—์„œ ์‚ดํŽด๋ณธ ๊ฒƒ์ฒ˜๋Ÿผ, ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๊ฒƒ์€ ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด ๋˜๋Š” ์„œ๋ธŒ์›Œ๋“œ๋กœ ๋ถ„ํ• ํ•˜๊ณ  ๋ฃฉ์—… ํ…Œ์ด๋ธ”์„ ํ†ตํ•ด id๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. +๋‹จ์–ด ๋˜๋Š” ์„œ๋ธŒ์›Œ๋“œ๋ฅผ id๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฒˆ ๋ฌธ์„œ์—์„œ๋Š” ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด ๋˜๋Š” ์„œ๋ธŒ์›Œ๋“œ๋กœ ์ชผ๊ฐœ๋Š” ๊ฒƒ(์ฆ‰, ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๊ฒƒ)์— ์ค‘์ ์„ ๋‘๊ฒ ์Šต๋‹ˆ๋‹ค. +๊ตฌ์ฒด์ ์œผ๋กœ, ๐Ÿค— Transformers์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ํ† ํฐํ™” ์œ ํ˜•์ธ [Byte-Pair Encoding (BPE)](#byte-pair-encoding), [WordPiece](#wordpiece), [SentencePiece](#sentencepiece)๋ฅผ ์‚ดํŽด๋ณด๊ณ  ์–ด๋–ค ๋ชจ๋ธ์—์„œ ์–ด๋–ค ํ† ํฐํ™” ์œ ํ˜•์„ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค. + +๊ฐ ๋ชจ๋ธ ํŽ˜์ด์ง€์— ์—ฐ๊ฒฐ๋œ ํ† ํฌ๋‚˜์ด์ €์˜ ๋ฌธ์„œ๋ฅผ ๋ณด๋ฉด ์‚ฌ์ „ ํ›ˆ๋ จ ๋ชจ๋ธ์—์„œ ์–ด๋–ค ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +์˜ˆ๋ฅผ ๋“ค์–ด, [`BertTokenizer`]๋ฅผ ๋ณด๋ฉด ์ด ๋ชจ๋ธ์ด [WordPiece](#wordpiece)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + +## ๊ฐœ์š”[[introduction]] + +ํ…์ŠคํŠธ๋ฅผ ์ž‘์€ ๋ฌถ์Œ(chunk)์œผ๋กœ ์ชผ๊ฐœ๋Š” ๊ฒƒ์€ ๋ณด๊ธฐ๋ณด๋‹ค ์–ด๋ ค์šด ์ž‘์—…์ด๋ฉฐ, ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. +์˜ˆ๋ฅผ ๋“ค์–ด, `"Don't you love ๐Ÿค— Transformers? We sure do."` ๋ผ๋Š” ๋ฌธ์žฅ์„ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. + + + +์œ„ ๋ฌธ์žฅ์„ ํ† ํฐํ™”ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ์ชผ๊ฐœ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. +ํ† ํฐํ™”๋œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: + +``` +["Don't", "you", "love", "๐Ÿค—", "Transformers?", "We", "sure", "do."] +``` +์ด๋Š” ์ฒซ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋กœ๋Š” ํ•ฉ๋ฆฌ์ ์ด์ง€๋งŒ, `"Transformers?"`์™€ `"do."`ํ† ํฐ์„ ๋ณด๋ฉด ๊ฐ๊ฐ `"Transformer"`์™€ `"do"`์— ๊ตฌ๋‘์ ์ด ๋ถ™์–ด์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +๊ตฌ๋‘์ ์„ ๊ณ ๋ คํ•ด์•ผ ๋ชจ๋ธ์ด ๋‹จ์–ด์˜ ๋‹ค๋ฅธ ํ‘œํ˜„๊ณผ ๊ทธ ๋’ค์— ์˜ฌ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ๊ตฌ๋‘์ ์„ ํ•™์Šตํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ชจ๋ธ์ด ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” ํ‘œํ˜„์˜ ์ˆ˜๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. + +๊ตฌ๋‘์ ์„ ๊ณ ๋ คํ•œ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: + +``` +["Don", "'", "t", "you", "love", "๐Ÿค—", "Transformers", "?", "We", "sure", "do", "."] +``` + +์ด์ „๋ณด๋‹ค ๋‚˜์•„์กŒ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, `"Don't"`์˜ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋„ ์ˆ˜์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. +`"Don't"`๋Š” `"do not"`์˜ ์ค„์ž„๋ง์ด๊ธฐ ๋•Œ๋ฌธ์— `["Do", "n't"]`๋กœ ํ† ํฐํ™”๋˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. +์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ๋ณต์žกํ•ด์ง€๊ธฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ์ ์ด ๊ฐ ๋ชจ๋ธ๋งˆ๋‹ค ๊ณ ์œ ํ•œ ํ† ํฐํ™” ์œ ํ˜•์ด ์กด์žฌํ•˜๋Š” ์ด์œ  ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. +ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๋ฐ ์ ์šฉํ•˜๋Š” ๊ทœ์น™์— ๋”ฐ๋ผ ๋™์ผํ•œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ํ† ํฐํ™”๋œ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. +์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ ๊ฒƒ๊ณผ ๋™์ผํ•œ ๊ทœ์น™์œผ๋กœ ํ† ํฐํ™”๋œ ์ž…๋ ฅ์„ ์ œ๊ณตํ•ด์•ผ๋งŒ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. + +[spaCy](https://spacy.io/)์™€ [Moses](http://www.statmt.org/moses/?n=Development.GetStarted)๋Š” ์œ ๋ช…ํ•œ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ† ํฌ๋‚˜์ด์ €์ž…๋‹ˆ๋‹ค. ์˜ˆ์ œ์— *spaCy*์™€ *Moses* ๋ฅผ ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: + +``` +["Do", "n't", "you", "love", "๐Ÿค—", "Transformers", "?", "We", "sure", "do", "."] +``` + +๋ณด์‹œ๋‹ค์‹œํ”ผ ๊ณต๋ฐฑ ๋ฐ ๊ตฌ๋‘์  ํ† ํฐํ™”์™€ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ† ํฐํ™”๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. +๊ณต๋ฐฑ ๋ฐ ๊ตฌ๋‘์ , ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ† ํฐํ™”์€ ๋ชจ๋‘ ๋‹จ์–ด ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ์ชผ๊ฐœ๋Š” ๋‹จ์–ด ํ† ํฐํ™”์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. +์ด ํ† ํฐํ™” ๋ฐฉ๋ฒ•์€ ํ…์ŠคํŠธ๋ฅผ ๋” ์ž‘์€ ๋ฌถ์Œ(chunk)๋กœ ๋ถ„ํ• ํ•˜๋Š” ๊ฐ€์žฅ ์ง๊ด€์ ์ธ ๋ฐฉ๋ฒ•์ด์ง€๋งŒ, ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ ๋ง๋ญ‰์น˜์— ๋Œ€ํ•ด์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +์ด ๊ฒฝ์šฐ ๊ณต๋ฐฑ ๋ฐ ๊ตฌ๋‘์  ํ† ํฐํ™”๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งค์šฐ ํฐ ์–ดํœ˜(์‚ฌ์šฉ๋œ ๋ชจ๋“  ๊ณ ์œ  ๋‹จ์–ด์™€ ํ† ํฐ ์ง‘ํ•ฉ)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. +*์˜ˆ๋ฅผ ๋“ค์–ด*, [Transformer XL](model_doc/transformerxl)์€ ๊ณต๋ฐฑ ๋ฐ ๊ตฌ๋‘์  ํ† ํฐํ™”๋ฅผ ์‚ฌ์šฉํ•ด ์–ดํœ˜(vocabulary) ํฌ๊ธฐ๊ฐ€ 267,735์ž…๋‹ˆ๋‹ค! + +์–ดํœ˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋ฉด ๋ชจ๋ธ์— ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ๋ ˆ์ด์–ด๋กœ ์—„์ฒญ๋‚œ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ์ด ํ•„์š”ํ•˜๋ฏ€๋กœ ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„ ๋ณต์žก์„ฑ์ด ๋ชจ๋‘ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. +์ผ๋ฐ˜์ ์œผ๋กœ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์€ ์–ดํœ˜ ํฌ๊ธฐ๊ฐ€ 50,000๊ฐœ๋ฅผ ๋„˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋“œ๋ฌผ๋ฉฐ, ํŠนํžˆ ๋‹จ์ผ ์–ธ์–ด์— ๋Œ€ํ•ด์„œ๋งŒ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๊ฒฝ์šฐ์—๋Š” ๋”์šฑ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. +๋‹จ์ˆœํ•œ ๊ณต๋ฐฑ๊ณผ ๊ตฌ๋‘์  ํ† ํฐํ™”๊ฐ€ ๋งŒ์กฑ์Šค๋Ÿฝ์ง€ ์•Š๋‹ค๋ฉด ๋‹จ์ˆœํžˆ ๋ฌธ์ž๋ฅผ ํ† ํฐํ™”ํ•˜๋ฉด ์–ด๋–จ๊นŒ์š”? + + + +๋ฌธ์ž ํ† ํฐํ™”๋Š” ์•„์ฃผ ๊ฐ„๋‹จํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ชจ๋ธ์ด ์˜๋ฏธ ์žˆ๋Š” ์ž…๋ ฅ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๊ธฐ์—๋Š” ํ›จ์”ฌ ๋” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. + +*์˜ˆ๋ฅผ ๋“ค์–ด*, ๋ฌธ์ž `"t"`์— ๋Œ€ํ•œ ์˜๋ฏธ ์žˆ๋Š” ๋ฌธ๋งฅ ๋…๋ฆฝ์  ํ‘œํ˜„์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ ๋ณด๋‹ค ๋‹จ์–ด `"today"`์— ๋Œ€ํ•œ ์˜๋ฏธ ์žˆ๋Š” ๋ฌธ๋งฅ ๋…๋ฆฝ์  ํ‘œํ˜„์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ๋” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. +๋ฌธ์ž ํ† ํฐํ™”๋Š” ์ข…์ข… ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋™๋ฐ˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‘ ๊ฐ€์ง€ ์žฅ์ ์„ ๋ชจ๋‘ ์–ป๊ธฐ ์œ„ํ•ด ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์€ **์„œ๋ธŒ์›Œ๋“œ** ํ† ํฐํ™”๋ผ๊ณ  ํ•˜๋Š” ๋‹จ์–ด ์ˆ˜์ค€๊ณผ ๋ฌธ์ž ์ˆ˜์ค€ ํ† ํฐํ™”์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. + +## ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™”[[subword-tokenization]] + + + +์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๋‹จ์–ด๋Š” ๋” ์ž‘์€ ํ•˜์œ„ ๋‹จ์–ด๋กœ ์ชผ๊ฐœ๊ณ , ๋“œ๋ฌธ ๋‹จ์–ด๋Š” ์˜๋ฏธ ์žˆ๋Š” ํ•˜์œ„ ๋‹จ์–ด๋กœ ๋ถ„ํ•ด๋˜์–ด์•ผ ํ•œ๋‹ค๋Š” ์›์น™์— ๋”ฐ๋ผ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. +์˜ˆ๋ฅผ ๋“ค์–ด `"annoyingly"`๋Š” ๋“œ๋ฌธ ๋‹จ์–ด๋กœ ๊ฐ„์ฃผ๋˜์–ด `"annoying"`๊ณผ `"ly"`๋กœ ๋ถ„ํ•ด๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +`"annoyingly"`๊ฐ€ `"annoying"`๊ณผ `"ly"`์˜ ํ•ฉ์„ฑ์–ด์ธ ๋ฐ˜๋ฉด, `"annoying"`๊ณผ `"ly"` ๋‘˜ ๋‹ค ๋…๋ฆฝ์ ์ธ ์„œ๋ธŒ์›Œ๋“œ๋กœ ์ž์ฃผ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. +์ด๋Š” ํ„ฐํ‚ค์–ด์™€ ๊ฐ™์€ ์‘์ง‘์„ฑ ์–ธ์–ด์—์„œ ํŠนํžˆ ์œ ์šฉํ•˜๋ฉฐ, ์„œ๋ธŒ์›Œ๋“œ๋ฅผ ๋ฌถ์–ด ์ž„์˜๋กœ ๊ธด ๋ณตํ•ฉ ๋‹จ์–ด๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + +์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ์ด ์˜๋ฏธ ์žˆ๋Š” ๋ฌธ๋งฅ ๋…๋ฆฝ์  ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋ฉด์„œ ํ•ฉ๋ฆฌ์ ์ธ ์–ดํœ˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +๋˜ํ•œ, ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™”๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์€ ์ด์ „์— ๋ณธ ์ ์ด ์—†๋Š” ๋‹จ์–ด๋ฅผ ์•Œ๋ ค์ง„ ์„œ๋ธŒ์›Œ๋“œ๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + +์˜ˆ๋ฅผ ๋“ค์–ด, [`~transformers.BertTokenizer`]๋Š” `"I have a new GPU!"` ๋ผ๋Š” ๋ฌธ์žฅ์„ ์•„๋ž˜์™€ ๊ฐ™์ด ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค: + +```py +>>> from transformers import BertTokenizer + +>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") +>>> tokenizer.tokenize("I have a new GPU!") +["i", "have", "a", "new", "gp", "##u", "!"] +``` + +๋Œ€์†Œ๋ฌธ์ž๊ฐ€ ์—†๋Š” ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๋ฌธ์žฅ์˜ ์‹œ์ž‘์ด ์†Œ๋ฌธ์ž๋กœ ํ‘œ๊ธฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. +๋‹จ์–ด `["i", "have", "a", "new"]`๋Š” ํ† ํฌ๋‚˜์ด์ €์˜ ์–ดํœ˜์— ์†ํ•˜์ง€๋งŒ, `"gpu"`๋Š” ์†ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +๊ฒฐ๊ณผ์ ์œผ๋กœ ํ† ํฌ๋‚˜์ด์ €๋Š” `"gpu"`๋ฅผ ์•Œ๋ ค์ง„ ๋‘ ๊ฐœ์˜ ์„œ๋ธŒ์›Œ๋“œ๋กœ ์ชผ๊ฐญ๋‹ˆ๋‹ค: `["gp" and "##u"]`. +`"##"`์€ ํ† ํฐ์˜ ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์ด ๊ณต๋ฐฑ ์—†์ด ์ด์ „ ํ† ํฐ์— ์—ฐ๊ฒฐ๋˜์–ด์•ผ(attach) ํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค(ํ† ํฐํ™” ๋””์ฝ”๋”ฉ ๋˜๋Š” ์—ญ์ „์„ ์œ„ํ•ด). + +๋˜ ๋‹ค๋ฅธ ์˜ˆ๋กœ, [`~transformers.XLNetTokenizer`]๋Š” ์ด์ „์— ์˜ˆ์‹œ ๋ฌธ์žฅ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค: +```py +>>> from transformers import XLNetTokenizer + +>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased") +>>> tokenizer.tokenize("Don't you love ๐Ÿค— Transformers? We sure do.") +["โ–Don", "'", "t", "โ–you", "โ–love", "โ–", "๐Ÿค—", "โ–", "Transform", "ers", "?", "โ–We", "โ–sure", "โ–do", "."] +``` + +`"โ–"`๊ฐ€ ๊ฐ€์ง€๋Š” ์˜๋ฏธ๋Š” [SentencePiece](#sentencepiece)์—์„œ ๋‹ค์‹œ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. +๋ณด๋‹ค์‹œํ”ผ `"Transformers"` ๋ผ๋Š” ๋“œ๋ฌธ ๋‹จ์–ด๋Š” ์„œ๋ธŒ์›Œ๋“œ `"Transform"`์™€ `"ers"`๋กœ ์ชผ๊ฐœ์ง‘๋‹ˆ๋‹ค. + +์ด์ œ ๋‹ค์–‘ํ•œ ํ•˜์œ„ ๋‹จ์–ด ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. +์ด๋Ÿฌํ•œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ•ด๋‹น ๋ชจ๋ธ์ด ํ•™์Šต๋˜๋Š” ๋ง๋ญ‰์น˜์— ๋Œ€ํ•ด ์ˆ˜ํ–‰๋˜๋Š” ์–ด๋–ค ํ˜•ํƒœ์˜ ํ•™์Šต์— ์˜์กดํ•œ๋‹ค๋Š” ์ ์— ์œ ์˜ํ•˜์„ธ์š”. + + + +### ๋ฐ”์ดํŠธ ํŽ˜์–ด ์ธ์ฝ”๋”ฉ (Byte-Pair Encoding, BPE)[[bytepair-encoding-bpe]] + +๋ฐ”์ดํŠธ ํŽ˜์–ด ์ธ์ฝ”๋”ฉ(BPE)์€ [Neural Machine Translation of Rare Words with Subword Units (Sennrich et +al., 2015)](https://arxiv.org/abs/1508.07909) ์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. +BPE๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹จ์–ด๋กœ ๋ถ„ํ• ํ•˜๋Š” ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €(pre-tokenizer)์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. +์‚ฌ์ „ ํ† ํฐํ™”(Pretokenization)์—๋Š” [GPT-2](model_doc/gpt2), [Roberta](model_doc/roberta)์™€ ๊ฐ™์€ ๊ฐ„๋‹จํ•œ ๊ณต๋ฐฑ ํ† ํฐํ™”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. +๋ณต์žกํ•œ ์‚ฌ์ „ ํ† ํฐํ™”์—๋Š” ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ† ํฐํ™”๊ฐ€ ํ•ด๋‹นํ•˜๋Š”๋ฐ, ํ›ˆ๋ จ ๋ง๋ญ‰์น˜์—์„œ ๊ฐ ๋‹จ์–ด์˜ ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. +[XLM](model_doc/xlm), ๋Œ€๋ถ€๋ถ„์˜ ์–ธ์–ด์—์„œ Moses๋ฅผ ์‚ฌ์šฉํ•˜๋Š” [FlauBERT](model_doc/flaubert), Spacy์™€ ftfy๋ฅผ ์‚ฌ์šฉํ•˜๋Š” [GPT](model_doc/gpt)๊ฐ€ ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. + + +์‚ฌ์ „ ํ† ํฐํ™” ์ดํ›„์—, ๊ณ ์œ  ๋‹จ์–ด ์ง‘ํ•ฉ๊ฐ€ ์ƒ์„ฑ๋˜๊ณ  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ๋นˆ๋„๊ฐ€ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค. +๋‹ค์Œ์œผ๋กœ, BPE๋Š” ๊ณ ์œ  ๋‹จ์–ด ์ง‘ํ•ฉ์— ๋‚˜ํƒ€๋‚˜๋Š” ๋ชจ๋“  ๊ธฐํ˜ธ๋กœ ๊ตฌ์„ฑ๋œ ๊ธฐ๋ณธ ์–ดํœ˜๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๊ธฐ๋ณธ ์–ดํœ˜์˜ ๋‘ ๊ธฐํ˜ธ์—์„œ ์ƒˆ๋กœ์šด ๊ธฐํ˜ธ๋ฅผ ํ˜•์„ฑํ•˜๋Š” ๋ณ‘ํ•ฉ ๊ทœ์น™์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. +์–ดํœ˜๊ฐ€ ์›ํ•˜๋Š” ์–ดํœ˜ ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ์œ„์˜ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. +์–ดํœ˜ ํฌ๊ธฐ๋Š” ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ ์ „์— ์ •์˜ํ•ด์•ผ ํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ผ๋Š” ์ ์„ ์œ ์˜ํ•˜์„ธ์š”. + +์˜ˆ๋ฅผ ๋“ค์–ด, ์‚ฌ์ „ ํ† ํฐํ™” ํ›„ ๋นˆ๋„๋ฅผ ํฌํ•จํ•œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์–ดํœ˜ ์ง‘ํ•ฉ์ด ๊ฒฐ์ •๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค: + +``` +("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) +``` + +๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ธฐ๋ณธ ์–ดํœ˜๋Š” `["b", "g", "h", "n", "p", "s", "u"]` ์ด๊ณ , ๊ฐ ๋‹จ์–ด๋ฅผ ๊ธฐ๋ณธ ์–ดํœ˜์— ์†ํ•˜๋Š” ๊ธฐํ˜ธ๋กœ ์ชผ๊ฐœ๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค: + +``` +("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5) +``` + +๊ทธ๋Ÿฐ ๋‹ค์Œ BPE๋Š” ๊ฐ€๋Šฅํ•œ ๊ฐ ๊ธฐํ˜ธ ์Œ์˜ ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ€์žฅ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ๊ธฐํ˜ธ ์Œ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. +์œ„์˜ ์˜ˆ์‹œ์—์„œ `"h"` ๋’ค์— ์˜ค๋Š” `"u"`๋Š” _10 + 5 = 15_ ๋ฒˆ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. (`"hug"`์—์„œ 10๋ฒˆ, `"hugs"`์—์„œ 5๋ฒˆ ๋“ฑ์žฅ) + +ํ•˜์ง€๋งŒ, ๊ฐ€์žฅ ๋“ฑ์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ ๊ธฐํ˜ธ ์Œ์€ `"u"` ๋’ค์— ์˜ค๋Š” `"g"`์ž…๋‹ˆ๋‹ค. _10 + 5 + 5 = 20_ ์œผ๋กœ ์ด 20๋ฒˆ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. +๋”ฐ๋ผ์„œ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ๋ณ‘ํ•ฉํ•˜๋Š” ๊ฐ€์žฅ ์ฒซ ๋ฒˆ์งธ ์Œ์€ `"u"` ๋’ค์— ์˜ค๋Š” `"g"`์ž…๋‹ˆ๋‹ค. `"ug"`๊ฐ€ ์–ดํœ˜์— ์ถ”๊ฐ€๋˜์–ด ์–ดํœ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: + +``` +("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5) +``` + +BPE๋Š” ๋‹ค์Œ์œผ๋กœ ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ๊ธฐํ˜ธ ์Œ์„ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค. +`"u"` ๋’ค์— ์˜ค๋Š” `"n"`์€ 16๋ฒˆ ๋“ฑ์žฅํ•ด `"un"` ์œผ๋กœ ๋ณ‘ํ•ฉ๋˜์–ด ์–ดํœ˜์— ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. +๊ทธ ๋‹ค์Œ์œผ๋กœ ๋นˆ๋„์ˆ˜๊ฐ€ ๋†“์€ ๊ธฐํ˜ธ ์Œ์€ `"h"` ๋’ค์— ์˜ค๋Š” `"ug"`๋กœ 15๋ฒˆ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. +๋‹ค์‹œ ํ•œ ๋ฒˆ `"hug"`๋กœ ๋ณ‘ํ•ฉ๋˜์–ด ์–ดํœ˜์— ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. + +ํ˜„์žฌ ๋‹จ๊ณ„์—์„œ ์–ดํœ˜๋Š” `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]` ์ด๊ณ , ๊ณ ์œ  ๋‹จ์–ด ์ง‘ํ•ฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: + +``` +("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5) +``` + +์ด ์‹œ์ ์—์„œ ๋ฐ”์ดํŠธ ํŽ˜์–ด ์ธ์ฝ”๋”ฉ ํ›ˆ๋ จ์ด ์ค‘๋‹จ๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด, ํ›ˆ๋ จ๋œ ๋ณ‘ํ•ฉ ๊ทœ์น™์€ ์ƒˆ๋กœ์šด ๋‹จ์–ด์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค(๊ธฐ๋ณธ ์–ดํœ˜์— ํฌํ•จ๋œ ๊ธฐํ˜ธ๊ฐ€ ์ƒˆ๋กœ์šด ๋‹จ์–ด์— ํฌํ•จ๋˜์ง€ ์•Š๋Š” ํ•œ). +์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹จ์–ด `"bug"`๋Š” `["b", "ug"]`๋กœ ํ† ํฐํ™”๋˜์ง€๋งŒ, `"m"`์ด ๊ธฐ๋ณธ ์–ดํœ˜์— ์—†๊ธฐ ๋•Œ๋ฌธ์— `"mug"`๋Š” `["", "ug"]`๋กœ ํ† ํฐํ™”๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. +ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—๋Š” ๋‹จ์ผ ๋ฌธ์ž๊ฐ€ ์ตœ์†Œํ•œ ํ•œ ๋ฒˆ ๋“ฑ์žฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜์ ์œผ๋กœ `"m"`๊ณผ ๊ฐ™์€ ๋‹จ์ผ ๋ฌธ์ž๋Š” `""` ๊ธฐํ˜ธ๋กœ ๋Œ€์ฒด๋˜์ง€ ์•Š์ง€๋งŒ, ์ด๋ชจํ‹ฐ์ฝ˜๊ณผ ๊ฐ™์€ ํŠน๋ณ„ํ•œ ๋ฌธ์ž์ธ ๊ฒฝ์šฐ์—๋Š” ๋Œ€์ฒด๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + +์ด์ „์— ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ์–ดํœ˜ ํฌ๊ธฐ(์ฆ‰ ๊ธฐ๋ณธ ์–ดํœ˜ ํฌ๊ธฐ + ๋ณ‘ํ•ฉ ํšŸ์ˆ˜)๋Š” ์„ ํƒํ•ด์•ผํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค. +์˜ˆ๋ฅผ ๋“ค์–ด [GPT](model_doc/gpt)์˜ ๊ธฐ๋ณธ ์–ดํœ˜ ํฌ๊ธฐ๋Š” 478, 40,000๋ฒˆ์˜ ๋ณ‘ํ•ฉ ์ดํ›„์— ํ›ˆ๋ จ์„ ์ข…๋ฃŒํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ดํœ˜ ํฌ๊ธฐ๊ฐ€ 40,478์ž…๋‹ˆ๋‹ค. + +#### ๋ฐ”์ดํŠธ ์ˆ˜์ค€ BPE (Byte-level BPE)[[bytelevel-bpe]] + +๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๊ธฐ๋ณธ ๋ฌธ์ž๋ฅผ ํฌํ•จํ•˜๋Š” ๊ธฐ๋ณธ ์–ดํœ˜์˜ ํฌ๊ธฐ๋Š” ๊ต‰์žฅํžˆ ์ปค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์˜ˆ: ๋ชจ๋“  ์œ ๋‹ˆ์ฝ”๋“œ ๋ฌธ์ž๋ฅผ ๊ธฐ๋ณธ ๋ฌธ์ž๋กœ ๊ฐ„์ฃผํ•˜๋Š” ๊ฒฝ์šฐ) +๋” ๋‚˜์€ ๊ธฐ๋ณธ ์–ดํœ˜๋ฅผ ๊ฐ–๋„๋ก [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)๋Š” ๊ธฐ๋ณธ ์–ดํœ˜๋กœ ๋ฐ”์ดํŠธ(bytes)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. +์ด ๋ฐฉ์‹์€ ๋ชจ๋“  ๊ธฐ๋ณธ ๋ฌธ์ž๊ฐ€ ์–ดํœ˜์— ํฌํ•จ๋˜๋„๋ก ํ•˜๋ฉด์„œ ๊ธฐ๋ณธ ์–ดํœ˜์˜ ํฌ๊ธฐ๋ฅผ 256์œผ๋กœ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. +๊ตฌ๋‘์ ์„ ๋‹ค๋ฃจ๋Š” ์ถ”๊ฐ€์ ์ธ ๊ทœ์น™์„ ์‚ฌ์šฉํ•ด GPT2 ํ† ํฌ๋‚˜์ด์ €๋Š” ๋ชจ๋“  ํ…์ŠคํŠธ๋ฅผ ๊ธฐํ˜ธ ์—†์ด ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +[GPT-2](model_doc/gpt)์˜ ์–ดํœ˜ ํฌ๊ธฐ๋Š” 50,257๋กœ 256 ๋ฐ”์ดํŠธ ํฌ๊ธฐ์˜ ๊ธฐ๋ณธ ํ† ํฐ, ํŠน๋ณ„ํ•œ end-of-text ํ† ํฐ๊ณผ 50,000๋ฒˆ์˜ ๋ณ‘ํ•ฉ์œผ๋กœ ํ•™์Šตํ•œ ๊ธฐํ˜ธ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. + + + +### ์›Œ๋“œํ”ผ์Šค (WordPiece)[[wordpiece]] + +์›Œ๋“œํ”ผ์Šค๋Š” [BERT](model_doc/bert), [DistilBERT](model_doc/distilbert), [Electra](model_doc/electra)์— ์‚ฌ์šฉ๋œ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. +์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ [Japanese and Korean Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf)์—์„œ ์†Œ๊ฐœ๋˜์—ˆ๊ณ , BPE์™€ ๊ต‰์žฅํžˆ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. +์›Œ๋“œํ”ผ์Šค๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋“ฑ์žฅํ•˜๋Š” ๋ชจ๋“  ๋ฌธ์ž๋กœ ๊ธฐ๋ณธ ์–ดํœ˜๋ฅผ ์ดˆ๊ธฐํ™”ํ•œ ํ›„, ์ฃผ์–ด์ง„ ๋ณ‘ํ•ฉ ๊ทœ์น™์— ๋”ฐ๋ผ ์ ์ง„์ ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. +BPE์™€๋Š” ๋Œ€์กฐ์ ์œผ๋กœ ์›Œ๋“œํ”ผ์Šค๋Š” ๊ฐ€์žฅ ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๊ธฐํ˜ธ ์Œ์„ ์„ ํƒํ•˜์ง€ ์•Š๊ณ , ์–ดํœ˜์— ์ถ”๊ฐ€๋˜์—ˆ์„ ๋•Œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์šฐ๋„๊ฐ€ ์ตœ๋Œ€ํ™”๋˜๋Š” ์Œ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. + +์ •ํ™•ํžˆ ๋ฌด์Šจ ์˜๋ฏธ์ผ๊นŒ์š”? +์ด์ „ ์˜ˆ์‹œ๋ฅผ ์ฐธ์กฐํ•˜๋ฉด, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์šฐ๋„ ๊ฐ’์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์€ ๋ชจ๋“  ๊ธฐํ˜ธ ์Œ ์ค‘์—์„œ ์ฒซ ๋ฒˆ์งธ ๊ธฐํ˜ธ์™€ ๋‘ ๋ฒˆ์งธ ๊ธฐํ˜ธ์˜ ํ™•๋ฅ ๋กœ ๋‚˜๋ˆˆ ํ™•๋ฅ ์ด ๊ฐ€์žฅ ํฐ ๊ธฐํ˜ธ ์Œ์„ ์ฐพ๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. +์˜ˆ๋ฅผ ๋“ค์–ด `"ug"`์˜ ํ™•๋ฅ ์ด `"u"`์™€ `"g"` ๊ฐ๊ฐ์œผ๋กœ ์ชผ๊ฐœ์กŒ์„ ๋•Œ ๋ณด๋‹ค ๋†’์•„์•ผ `"u"` ๋’ค์— ์˜ค๋Š” `"g"`๋Š” ๋ณ‘ํ•ฉ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. +์ง๊ด€์ ์œผ๋กœ ์›Œ๋“œํ”ผ์Šค๋Š” ๋‘ ๊ธฐํ˜ธ๋ฅผ ๋ณ‘ํ•ฉํ•˜์—ฌ _์žƒ๋Š”_ ๊ฒƒ์„ ํ‰๊ฐ€ํ•˜์—ฌ ๊ทธ๋งŒํ•œ _๊ฐ€์น˜_๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค๋Š” ์ ์—์„œ BPE์™€ ์•ฝ๊ฐ„ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. + + + +### ์œ ๋‹ˆ๊ทธ๋žจ (Unigram)[[unigram]] + +์œ ๋‹ˆ๊ทธ๋žจ์€ [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf)์—์„œ ์ œ์•ˆ๋œ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. +BPE๋‚˜ ์›Œ๋“œํ”ผ์Šค์™€ ๋‹ฌ๋ฆฌ ์œ ๋‹ˆ๊ทธ๋žจ์€ ๊ธฐ๋ณธ ์–ดํœ˜๋ฅผ ๋งŽ์€ ์ˆ˜์˜ ๊ธฐํ˜ธ๋กœ ์ดˆ๊ธฐํ™”ํ•œ ํ›„ ๊ฐ ๊ธฐํ˜ธ๋ฅผ ์ ์ง„์ ์œผ๋กœ ์ค„์—ฌ ๋” ์ž‘์€ ์–ดํœ˜๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. +์˜ˆ๋ฅผ ๋“ค์–ด ๊ธฐ๋ณธ ์–ดํœ˜๋Š” ๋ชจ๋“  ์‚ฌ์ „ ํ† ํฐํ™”๋œ ๋‹จ์–ด์™€ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ํ•˜์œ„ ๋ฌธ์ž์—ด์— ํ•ด๋‹นํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +์œ ๋‹ˆ๊ทธ๋žจ์€ transformers ๋ชจ๋ธ์—์„œ ์ง์ ‘์ ์œผ๋กœ ์‚ฌ์šฉ๋˜์ง€๋Š” ์•Š์ง€๋งŒ, [SentencePiece](#sentencepiece)์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. + +๊ฐ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ ์œ ๋‹ˆ๊ทธ๋žจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ˜„์žฌ ์–ดํœ˜์™€ ์œ ๋‹ˆ๊ทธ๋žจ ์–ธ์–ด ๋ชจ๋ธ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์†์‹ค(ํ”ํžˆ ๋กœ๊ทธ ์šฐ๋„๋กœ ์ •์˜๋จ)์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. +๊ทธ๋Ÿฐ ๋‹ค์Œ ์–ดํœ˜์˜ ๊ฐ ๊ธฐํ˜ธ์— ๋Œ€ํ•ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•ด๋‹น ๊ธฐํ˜ธ๋ฅผ ์–ดํœ˜์—์„œ ์ œ๊ฑฐํ•  ๊ฒฝ์šฐ ์ „์ฒด ์†์‹ค์ด ์–ผ๋งˆ๋‚˜ ์ฆ๊ฐ€ํ• ์ง€ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. +์ดํ›„์— ์œ ๋‹ˆ๊ทธ๋žจ์€ ์†์‹ค ์ฆ๊ฐ€์œจ์ด ๊ฐ€์žฅ ๋‚ฎ์€ ๊ธฐํ˜ธ์˜ p(๋ณดํ†ต 10% ๋˜๋Š” 20%) ํผ์„ผํŠธ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. (์ œ๊ฑฐ๋˜๋Š” ๊ธฐํ˜ธ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ „์ฒด ์†์‹ค์— ๊ฐ€์žฅ ์ž‘์€ ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.) +์–ดํœ˜๊ฐ€ ์›ํ•˜๋Š” ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ์ด ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. +์œ ๋‹ˆ๊ทธ๋žจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•ญ์ƒ ๊ธฐ๋ณธ ๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด ์–ด๋–ค ๋‹จ์–ด๋ผ๋„ ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +์œ ๋‹ˆ๊ทธ๋žจ์ด ๋ณ‘ํ•ฉ ๊ทœ์น™์— ๊ธฐ๋ฐ˜ํ•˜์ง€ ์•Š๊ธฐ ๋–„๋ฌธ์— (BPE๋‚˜ ์›Œ๋“œํ”ผ์Šค์™€๋Š” ๋Œ€์กฐ์ ์œผ๋กœ), ํ•ด๋‹น ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ›ˆ๋ จ ์ดํ›„์— ์ƒˆ๋กœ์šด ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š”๋ฐ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. + +์˜ˆ๋ฅผ ๋“ค์–ด, ํ›ˆ๋ จ๋œ ์œ ๋‹ˆ๊ทธ๋žจ ํ† ํฐํ™”๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์–ดํœ˜๋ฅผ ๊ฐ€์ง„๋‹ค๋ฉด: + +``` +["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"], +``` + +`"hugs"`๋Š” ๋‘ ๊ฐ€์ง€๋กœ ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. `["hug", "s"]`์™€ `["h", "ug", "s"]` ๋˜๋Š” `["h", "u", "g", "s"]`. + +๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ค ํ† ํฐํ™” ๋ฐฉ๋ฒ•์„ ์„ ํƒํ•ด์•ผ ํ• ๊นŒ์š”? +์œ ๋‹ˆ๊ทธ๋žจ์€ ์–ดํœ˜๋ฅผ ์ €์žฅํ•˜๋Š” ๊ฒƒ ์™ธ์—๋„ ํ›ˆ๋ จ ๋ง๋ญ‰์น˜์— ๊ฐ ํ† ํฐ์˜ ํ™•๋ฅ ์„ ์ €์žฅํ•˜์—ฌ ํ›ˆ๋ จ ํ›„ ๊ฐ€๋Šฅํ•œ ๊ฐ ํ† ํฐํ™”์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. +์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹จ์ˆœํžˆ ์‹ค์ œ๋กœ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ํ† ํฐํ™”๋ฅผ ์„ ํƒํ•˜์ง€๋งŒ, ํ™•๋ฅ ์— ๋”ฐ๋ผ ๊ฐ€๋Šฅํ•œ ํ† ํฐํ™”๋ฅผ ์ƒ˜ํ”Œ๋งํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€๋Šฅ์„ฑ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. +์ด๋Ÿฌํ•œ ํ™•๋ฅ ์€ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ํ•™์Šตํ•œ ์†์‹ค์— ์˜ํ•ด ์ •์˜๋ฉ๋‹ˆ๋‹ค. + +๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ \\(x_{1}, \dots, x_{N}\\)๋ผ ํ•˜๊ณ , ๋‹จ์–ด \\(x_{i}\\)์— ๋Œ€ํ•œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ฅผ \\(S(x_{i})\\)๋ผ ํ•œ๋‹ค๋ฉด, ์ „์ฒด ์†์‹ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค: + +$$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$ + + + + + +### ์„ผํ…์Šคํ”ผ์Šค (SentencePiece)[[sentencepiece]] + +์ง€๊ธˆ๊นŒ์ง€ ๋‹ค๋ฃฌ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋™์ผํ•œ ๋ฌธ์ œ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค: ์ž…๋ ฅ ํ…์ŠคํŠธ๋Š” ๊ณต๋ฐฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ๊ตฌ๋ถ„ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. +ํ•˜์ง€๋งŒ, ๋ชจ๋“  ์–ธ์–ด์—์„œ ๋‹จ์–ด๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ๊ณต๋ฐฑ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. +ํ•œ๊ฐ€์ง€ ๊ฐ€๋Šฅํ•œ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์€ ํŠน์ • ์–ธ์–ด์— ํŠนํ™”๋œ ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด [XLM](model_doc/xlm)์€ ํŠน์ • ์ค‘๊ตญ์–ด, ์ผ๋ณธ์–ด, ํƒœ๊ตญ์–ด ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. +์ด ๋ฌธ์ œ๋ฅผ ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf)๋Š” ์ž…๋ ฅ์„ ์ŠคํŠธ๋ฆผ์œผ๋กœ ์ฒ˜๋ฆฌํ•ด ๊ณต๋ฐฑ๋ฅผ ํ•˜๋‚˜์˜ ๋ฌธ์ž๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. +์ดํ›„์— BPE ๋˜๋Š” ์œ ๋‹ˆ๊ทธ๋žจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ด ์ ์ ˆํ•œ ์–ดํœ˜๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. + +[`XLNetTokenizer`]๋Š” ์„ผํ…์Šคํ”ผ์Šค๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์œ„์—์„œ ๋‹ค๋ฃฌ ์˜ˆ์‹œ์—์„œ ์–ดํœ˜์— `"โ–"`๊ฐ€ ํฌํ•จ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค. +๋ชจ๋“  ํ† ํฐ์„ ํ•ฉ์นœ ํ›„ `"โ–"`์„ ๊ณต๋ฐฑ์œผ๋กœ ๋Œ€์ฒดํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์„ผํ…์Šคํ”ผ์Šค๋กœ ํ† ํฐํ™”๋œ ๊ฒฐ๊ณผ๋Š” ๋””์ฝ”๋”ฉํ•˜๊ธฐ ์ˆ˜์›”ํ•ฉ๋‹ˆ๋‹ค. + +transformers์—์„œ ์ œ๊ณตํ•˜๋Š” ์„ผํ…์Šคํ”ผ์Šค ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ๋ชจ๋ธ์€ ์œ ๋‹ˆ๊ทธ๋žจ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. +[ALBERT](model_doc/albert), [XLNet](model_doc/xlnet), [Marian](model_doc/marian), [T5](model_doc/t5) ๋ชจ๋ธ์ด ์„ผํ…์Šคํ”ผ์Šค ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. \ No newline at end of file