Skip to content

Commit

Permalink
Release Pretrain Dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
Beomi authored Aug 22, 2020
1 parent 3b101b2 commit fff9e4f
Showing 1 changed file with 17 additions and 0 deletions.
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
# KcBERT: Korean comments BERT

** Updates on 2020.08.22 **

Pretrain Dataset ๊ณต๊ฐœ: https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments

Kaggle์— ํ•™์Šต์„ ์œ„ํ•ด ์ •์ œํ•œ(์•„๋ž˜ `clean`์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ) Dataset์„ ๊ณต๊ฐœํ•˜์˜€์Šต๋‹ˆ๋‹ค!

์ง์ ‘ ๋‹ค์šด๋ฐ›์œผ์…”์„œ ๋‹ค์–‘ํ•œ Task์— ํ•™์Šต์„ ์ง„ํ–‰ํ•ด๋ณด์„ธ์š” :)

---

๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด BERT๋Š” ๋Œ€๋ถ€๋ถ„ ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‰ด์Šค ๊ธฐ์‚ฌ, ์ฑ… ๋“ฑ ์ž˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ์‹ค์ œ๋กœ NSMC์™€ ๊ฐ™์€ ๋Œ“๊ธ€ํ˜• ๋ฐ์ดํ„ฐ์…‹์€ ์ •์ œ๋˜์ง€ ์•Š์•˜๊ณ  ๊ตฌ์–ด์ฒด ํŠน์ง•์— ์‹ ์กฐ์–ด๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์˜คํƒˆ์ž ๋“ฑ ๊ณต์‹์ ์ธ ๊ธ€์“ฐ๊ธฐ์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ํ‘œํ˜„๋“ค์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

KcBERT๋Š” ์œ„์™€ ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ๋„ค์ด๋ฒ„ ๋‰ด์Šค์—์„œ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ์ˆ˜์ง‘ํ•ด, ํ† ํฌ๋‚˜์ด์ €์™€ BERT๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ Pretrained BERT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
Expand Down Expand Up @@ -94,6 +104,13 @@ def clean(x):
return x
```

### Cleaned Data (Released on Kaggle)

์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ `clean`ํ•จ์ˆ˜๋กœ ์ •์ œํ•œ 12GB๋ถ„๋Ÿ‰์˜ txt ํŒŒ์ผ์„ ์•„๋ž˜ Kaggle Dataset์—์„œ ๋‹ค์šด๋ฐ›์œผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค :)

https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments


## Tokenizer Train

Tokenizer๋Š” Huggingface์˜ [Tokenizers](https://github.com/huggingface/tokenizers) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
Expand Down

0 comments on commit fff9e4f

Please sign in to comment.