Skip to content

Train Data Release: v2022.3Q

Latest
Compare
Choose a tag to compare
@Beomi Beomi released this 07 Nov 06:55
· 2 commits to master since this release
0da95b2

๋ถ„๊ธฐ๋ณ„ ์‹ ๊ทœ ๋ฐ์ดํ„ฐ์…‹ ๋ฆด๋ฆฌ์ฆˆ: v2022.3Q

๋ฐ์ดํ„ฐ์…‹ ์ •๋ณด

  • v2022.3Q = 2022๋…„๋„ 3๋ถ„๊ธฐ ๋ฆด๋ฆฌ์ฆˆ
  • ๋ฐ์ดํ„ฐ์…‹ ํฌํ•จ: v2019.1Q - v2022.3Q
  • ์ „์ฒด ๋ฐ์ดํ„ฐ ์ˆ˜(๊ณต๋ฐฑ์—ด ์ œ์™ธ): 345,452,030
  • ์ผ์ž: 2019.01์›” ~ 2022.09์›”

TrainData_v1์™€์˜ ์ฐจ์ด์ 

  • ๋™์ผ ํƒ€๋ž˜์˜ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์€ ๋‹จ์ผ linebreak (\n)
  • ๋‹ค๋ฅธ ํƒ€๋ž˜์˜ ๋Œ“๊ธ€๊ฐ„์—๋Š” ๋‘๊ฐœ์˜ linebreak (\n\n)
  • ์ผ์ž๋ณ„๋กœ ์ค‘๋ณต ํ…์ŠคํŠธ ์ œ๊ฑฐ
  • ๊ทธ ์™ธ์˜ clean ์ฒ˜๋ฆฌ ์ตœ๋Œ€ํ•œ ํ•˜์ง€ ์•Š์Œ

Quarterly Aggregated Korean News Comments Dataset: v2022.3Q

Dataset Spec

  • v2022.3Q = 2022 3Q Release
  • Add Dataset from v2019.1Q ~ v2022.3Q
  • Total Lines(w/o Blank lines): 345,452,030
  • Date Range: 2019.01 ~ 2022.09

Difference from TrainData_v1

  • Reply comments(in same thread) are grouped by 1 linebreak(\n)
  • Different threads are splitted by whiteline(\n\n)
  • Duplicated comments within a day are removed (only the first comment left)
  • texts are raw as much as possible