-
Notifications
You must be signed in to change notification settings - Fork 0
/
Progress Report.Rmd
74 lines (45 loc) · 3.01 KB
/
Progress Report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
title: "1st Progress Report"
author: "Soobin Choi"
date: "3 November 2022"
output:
github_document:
toc: TRUE
---
# Progress Report
## Progress in general
I was able to import both korean-learner-corpus (KLC) and PELIC data from github and finish sorting out what I need for my project as in the file 'interim check'.
Also, I tokenized the korean-learner-corpus text, but the problem that I face here is, I am not really sure how I can count the tokens in one text. I am thinking about using count() and then sum the number of tokens with the same user id, but I haven't tried yet. Maybe I will get a lot of errors or unwanted result. Also, I am not sure how I can count the morphemes in KLC. I believe I can just count the number of slashes(/) in the corpus since they marked morpheme boundaries with slash, but I haven't figured this out yet.
Luckily, for PELIC, the text is already tokenized and even the number of tokens of each text is in the data. So, it will definitely facilitate my project. The problem I faced when mutating the PELIC data is that the information I need is quite scattered in different data sets. So I had to import almost all of the data set and mutate then using join() function.
After I figure out how to count the tokens in KLC, I believe (and hope) there would not be many issues.
## sharing plan
Regarding Korean learner corpus, I am not very sure at this moment because the data owner does not specify the lisence in the repository. Since the repository is in public domain, it might be okay to use the data and share the result, but I think it would be a better approach to reach out to the owner and asks his permission.
## Updates
### Progress report 1 (11-09-2022)
#### KLC
sorting out the data based on the nationality
counting tokens using `unnest_tokes()`
#### PELIC
merging datasets with the necessary columns only
filtering data based on text length (larger than 10)
sorting based on the native language (only korean)
### Progress report 2 (11-14-2022)
I am struggling with collecting word types from the tokens. I am not sure how I can use `unique()` or `duplicate()` function on the tokens.
Because of this, I could not move further.
Also, in the case of PELIC, I need to make another column that only contains lemma and POS, which I am still figuring out. I need Dan's help here.
#### KLC
tokenize and count the tokens for each essay.
I'm sort of stuck here because I do not know how to count word type.
How can I use unique() or distinct function to the values
whose type is list?
#### PELIC
Changing `full_join()` to `left_join()` to make the data smaller
### Progress report 3 (12-01-2022)
#### KLC
Almost done with data wrangling - need to sort out the necessary codes.
data visualization, statistical analysis is needed.
#### PELIC
only 2nd part is left in terms of data wrangling.
but R crashes repeatedly due to the huge size of the data.
but I think I will not have a big problem for my presentation next tuesday.
hope that I still remember stats I learned last semester..