forked from WFU-TLC/default_website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NB_v4_new_data.Rmd
426 lines (337 loc) · 19.6 KB
/
NB_v4_new_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
---
title: "Naive Bayes Classifier"
date: "`r format(Sys.time(), '%b %d, %Y')`"
bibliography: library.bib
biblio-style: apalike
---
```{r, child="_setup.Rmd"}
```
This is the current version of the Naive Bayes workflow used to classify TA comments based on Subject and Structure. Features of **Version 2r.2c**:
* Data rows used for training and test datasets are assigned randomly but reproducibly using 'set.seed'.
* Detailed final output matrix tables.
* Back-analysis evaluates both a testing sub-set and the entire input dataset.
* Outputs a tidy data table in CSV format with unmodified original comments, and hand-scored and NB-predicted Subject and Structure assignments.
####\
#Initial Setup
Calling required libraries.
```{r libraries}
library(tidyverse)
library(tm)
library(tidytext)
library(e1071)
library(caret)
library(ggplot2)
```
Import the initial dataset. Then extract comments and classification data as subsets of CSV.
```{r input data}
#Read in TA comments from CSV file.
base_dataVERB <- read_csv(file='data/coded_full_comments_dataset_Spring18anonVERB.csv')
#Select rows representing the sub-groups to compare.
comments_subsetVERB <- filter(base_dataVERB,code.subject=="1_basic"|code.subject=="2_writing"|code.subject=="3_technical"|code.subject=="4_logic")
#Reduce larger dataframe to 2 required columns of data, and put columns in order needed.
comments_rawVERB <- comments_subsetVERB %>% select(23,24,22)
#Rename the columns.
names(comments_rawVERB)[1] <- "subject"
names(comments_rawVERB)[2] <- "structure"
names(comments_rawVERB)[3] <- "text"
#Change "subject" element from character to a factor for analysis.
comments_rawVERB$subject <- factor(comments_rawVERB$subject)
str(comments_rawVERB$subject)
table(comments_rawVERB$subject)
#Change "structure" element from character to a factor for analysis.
comments_rawVERB$structure <- factor(comments_rawVERB$structure)
str(comments_rawVERB$structure)
table(comments_rawVERB$structure)
```
Randomize data used for training and testing sets. This block of code generates a reproducible vector of ~10,000 randomly assorted numbers using set.seed, then writes the vector back to data table.
```{r randomizer}
#Adjust the number so it matches the number of rows in the original data frame
#Next, set the seed for random numbers, and randomize the order of the vector
vector <- 1:9340 #second value must match number of rows in working dataset
set.seed(123) #change this seed value, and the data are randomized in a new sequence.
vector <- sample(vector)
glimpse(vector)
#Append the vector as a new column to "comments_rawVERB" with name "randomizer"
#Sort rows of dataframe according to "randomizer" column using dplyer 'arrange' function.
comments_rawVERB$randomizer <- c(vector)
comments_rawVERB_R <-comments_rawVERB %>% arrange(randomizer)
print(comments_rawVERB_R)
```
Convert data set to a volative corpus using "tm" library.
```{r corpusmaker}
#Create the volatile corpus that contains the "text" vector from data frame.
comments_corpusVERB_R <- VCorpus(VectorSource(comments_rawVERB_R$text))
print(comments_corpusVERB_R)
#Check out the first few text comments in the new corpus, which is basically a list that can be manipulated with list operations.
inspect(comments_corpusVERB_R[1:3])
#Use "as.character" function to see what a single text comment looks like.
as.character(comments_corpusVERB_R[[5]])
```
####\
#Text Data Transformations
The text data transforms in this code block are intentially separate "tm_map"" and "content_transformer" functions. This simplifies the process of testing how various combinations affect classification. Converting the text to lower case and removing punctuation are the baseline transformations. Switch between "{r}" and "{}" to turn other transformation blocks off or on as desired. **Removing extra white spaces must be the last transformation, and is required.**
```{r textcleanup}
#Convert text to all lower case letters.
comments_corpus_cleanVERB_R <- tm_map(comments_corpusVERB_R, content_transformer(tolower))
#Remove punctuation using the "removePunctuation" function.
#This step removes evidence of questions, so may remove data.
#THIS ALSO REMOVES EVIDENCE OF POINTER ITEMS
comments_corpus_cleanVERB_R <- tm_map(comments_corpus_cleanVERB_R, removePunctuation)
#Remove numerals.
#This step may remove evidence indicating technical comments
comments_corpus_cleanVERB_R <- tm_map(comments_corpus_cleanVERB_R, removeNumbers)
#Stopword removal.
#This is a standard cleanup step, but again may remove useful terms.
#The "stopwords" file is a convention; other files can be substituted for it
comments_corpus_cleanVERB_R <- tm_map(comments_corpus_cleanVERB_R, removeWords, stopwords())
#The option to apply stemming has been removed from this version. Stemming reduced accuracy in all trials. Future iterations should examine lemmatization instead.
#Final step removes white space from the document. This is NOT an optional step.
comments_corpus_cleanVERB_R <- tm_map(comments_corpus_cleanVERB_R, stripWhitespace)
#Look at an example of cleaned text comment to see if cleaned comment matches what is expected. Change the value inside double brackets to look at a different comment.
as.character((comments_corpus_cleanVERB_R[[5]]))
```
####\
#Tokenize the N-Grams
This code block uses tm's NLP commands to assess different n-grams. For single words, change the "#" variable in "(words(x), #)" to "1". Use other values to change the n-gram size. Data are stored in document-term matrix (dtm) format.
```{r tokenizer}
NLP_Tokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
}
comments_dtm_1gram <- DocumentTermMatrix(comments_corpus_cleanVERB_R, control=list(tokenize = NLP_Tokenizer))
comments_dtm_1gram_tidy <- tidy(comments_dtm_1gram)
inspect(comments_dtm_1gram)
```
####\
#Split Data to Training, Testing Sets
This code splits data into 75% training and 25% testing sets, so that after Naive Bayes algorithm is trained it can be tested on unseen data.
```{}
#This block is for convenience. It is not required if the number of training, test rows is known.
.75 * 9340 #Number of rows in the dataset; product is #rows for training
.25 * 9340 #Product is #rows for testing set
```
This code divides the previously randomized comments into training, testing, and re-analysis subsets of data.
```{r divider}
comments_dtm_all <- comments_dtm_1gram[1:9340, ]
comments_dtm_train <- comments_dtm_1gram[1:7005, ]
comments_dtm_test <- comments_dtm_1gram[7006:9340, ]
```
This saves 3 vectors containing labels for the rows in the training and testing vectors
```{r labeler}
comments_all_labels_subject <- comments_rawVERB_R[1:9340,]$subject
comments_train_labels_subject <- comments_rawVERB_R[1:7005, ]$subject
comments_test_labels_subject <- comments_rawVERB_R[7006:9340,]$subject
comments_all_labels_structure <- comments_rawVERB_R[1:9340,]$structure
comments_train_labels_structure <- comments_rawVERB_R[1:7005, ]$structure
comments_test_labels_structure <- comments_rawVERB_R[7006:9340,]$structure
```
This checks the proportions of sub-categories in training and testing groups. The proportion of each sub-category should be similar (<1% difference) between the full dataset, and training and testing data subsets.
```{r proportion_check}
prop.table(table(comments_all_labels_subject))
prop.table(table(comments_train_labels_subject))
prop.table(table(comments_test_labels_subject))
prop.table(table(comments_all_labels_structure))
prop.table(table(comments_train_labels_structure))
prop.table(table(comments_test_labels_structure))
```
####\
#Preparing Data for Naive Bayes Training
Remove words from the document-term matrix that appear less than 5 times.
```{r common_words}
comments_freq_words <- findFreqTerms(comments_dtm_train, 5)
str(comments_freq_words)
```
Limit document-term matrix to words in the comments_freq_words vector. We are using all of the rows, but we want to limit the columns to these words in the frequency vector.
```{r limit_dtm}
comments_dtm_freq_all <- comments_dtm_all[ , comments_freq_words]
comments_dtm_freq_train <- comments_dtm_train[ , comments_freq_words]
comments_dtm_freq_test <- comments_dtm_test[ , comments_freq_words]
```
The e1071 Naive Bayes classifier works with categorical features, so the matrix must be converted "yes" and "no" categorical variables. This is done using a **convert_counts function** and applying it to the data. This replaces values greater than 0 with yes, and values not greater than 0 with no.
```{r convert_function}
convert_counts2 <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
```
The resulting matrices have cells indicating "yes" or "no" if the word represented by the column appears in the text comment represented by the row.
The block below is an **alternate version** of the ```convert_function``` code block that KEEPs word frequencies while still using the same variable names and code structure. The processed matrices have cells with word frequencies instead of categorical yes/no values. This block **only** works when performing binary classification (ingroup/outgroup). The classifier fails when frequency values are used with data in multiple categories.
```{}
convert_counts2 <- function(x) {
y <- x
}
```
This block applies whichever "convert_counts2" function is currently active.
```{r apply_conversion}
comments_all <- apply(comments_dtm_freq_all, MARGIN = 2, convert_counts2)
comments_train <- apply(comments_dtm_freq_train, MARGIN = 2, convert_counts2)
comments_test <- apply(comments_dtm_freq_test, MARGIN = 2, convert_counts2)
```
####\
#Train Model, Predict & Evaluate for Subject
This block predicts whether a message is likely to be in group 1_basic, 2_writing, 3_technical, or 4_logic.
```{r bayes_training subject}
# Train the classifier on training data, then test on test dataset, and full dataset.
comments_classifier_subject <- naiveBayes(comments_train, comments_train_labels_subject, laplace=1)
comments_test_pred_subject <- predict(comments_classifier_subject, comments_test)
comments_all_pred_subject <- predict(comments_classifier_subject, comments_all)
```
```{r bayes_training structure}
comments_classifier_structure <- naiveBayes(comments_train, comments_train_labels_structure, laplace=1)
comments_test_pred_structure <- predict(comments_classifier_structure, comments_test)
comments_all_pred_structure <- predict(comments_classifier_structure, comments_all)
```
####\
A truth table lists how many of the predicted categories are in their correct categories. Adding diagonally and dividing by total count gives the percent accuracy at predicting Subject.
```{r}
# Create a truth table by tabulating the predicted class labels with the actual class labels
table("Predictions"= comments_test_pred_subject, "Actual Subject Test" = comments_test_labels_subject )
table("Predictions"= comments_all_pred_subject, "Actual Subject All" = comments_all_labels_subject )
#table("Predictions"= comments_test_pred_structure, "Actual Structure Test" = comments_test_labels_structure )
#table("Predictions"= comments_all_pred_structure, "Actual Structure All" = comments_all_labels_structure )
#An alternative is to use the Cross Table command (from "gmodels" package) to generate a more detailed output table with frequency values for each individual cell.
#CrossTable(comments_test_pred_subject, comments_test_labels_subject, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted', 'actual'))
```
The faster method is to calculate a confusion matrix, which displays BOTH the truth table, and the final overall accuracy. A confusion matrix can be queried for specific values.
```{r reportOut}
# Prepare the confusion matrix.
conf.mat1 <- confusionMatrix(comments_test_pred_subject, comments_test_labels_subject)
conf.mat2 <- confusionMatrix(comments_all_pred_subject, comments_all_labels_subject)
#conf.mat3 <- confusionMatrix(comments_test_pred_structure, comments_test_labels_structure)
#conf.mat4 <- confusionMatrix(comments_all_pred_structure, comments_all_labels_structure)
#These commands show how to print out individual pieces of the confusion matrix. They show both the overall and by-group accuracy values for Subject x Test Subset.
print ("Test, Subject")
conf.mat1
# This block outputs the by-group accuracy values in a single column for Subject x Test Subset.
print ("Test, Subject")
conf.mat1$byClass
# This block outputs a single line table with overall accuracy, kappa, lower and higher limits of accuracy, accuracy null value, P value for accuracy, and Mcnemar P value for Subject x Test Subset.
print ("Test, Subject")
conf.mat1$overall
# This block outputs JUST the overall accuracy value for Subject x Test Subset.
print ("Test, Subject")
conf.mat1$overall['Accuracy']
# This block outputs both the overall and by-group accuracy values for Subject x All Data.
print ("All, Subject")
conf.mat2
# This block outputs by-group accuracy values in a single column for Subject x All Data.
print ("All, Subject")
conf.mat2$byClass
# This block outputs a single line table with overall accuracy, kappa, lower and higher limits of accuracy, accuracy null value, P value for accuracy, and Mcnemar P value for Subject x All Data.
print ("All, Subject")
conf.mat2$overall
# This block outputs JUST the overall accuracy value for Subject x All Data.
print ("All, Subject")
conf.mat2$overall['Accuracy']
# This block outputs both the overall and by-group accuracy values for Structure x Test Subset.
#print ("Test, Structure")
#conf.mat3
# This block outputs by-group accuracy values in a single column for Structure x Test Subset.
#print ("Test, Structure")
#conf.mat3$byClass
# This block outputs a single line table with overall accuracy, kappa, lower and higher limits of accuracy, accuracy null value, P value for accuracy, and Mcnemar P value for Structure x Test Subset.
#print ("Test, Structure")
#conf.mat3$overall
# This block outputs JUST the overall accuracy value for Structure x Test Subset.
#conf.mat3$overall['Accuracy']
#print ("Test, Structure")
# This block outputs both the overall and by-group accuracy values for Structure x All Data.
#print ("All, Structure")
#conf.mat4
# This block outputs by-group accuracy values in a single column for Structure x All Data.
#print ("All, Structure")
#conf.mat4$byClass
# This block outputs a single line table with overall accuracy, kappa, lower and higher limits of accuracy, accuracy null value, P value for accuracy, and Mcnemar P value for Structure x All Data.
#print ("All, Structure")
#conf.mat4$overall
# This block outputs JUST the overall accuracy value for Structure x All Data.
#print ("All, Structure")
#conf.mat4$overall['Accuracy']
```
####\
#Output for Review, Analysis
The code block below creates a CSV file containing the **original** unmodified TA comments, the subject and structure labels assigned by a human observer, and the corresponding predicted categories for each comment based on the Naive Bayes estimator. Results are exported to a CSV file. By default the CSV file should be saved to the current working directory.
```{r writeCSV}
#Creates a data frame that compares predicted and actual labels for subject and structure
tallymatrix <- data.frame("text"=c(comments_rawVERB_R[1:9340,]$text),"subject_label"=c(comments_all_labels_subject), "subject_predicted"=c(comments_all_pred_subject),"structure_label"=c(comments_all_labels_structure), "structure_predicted"=c(comments_all_pred_structure))
#Creates a summary counts table that compares hand-coded counts to Bayes-estimated counts.
final.tally <- table("Hand Coded Subject"=tallymatrix$subject_label, "Hand Coded Structure"=tallymatrix$structure_label)
final.tally2 <- table("Bayes Predicted Subject"=tallymatrix$subject_predicted, "Bayes Predicted Structure"=tallymatrix$structure_predicted)
#Write out the final data from the run to a CSV file. For reference, include the randomizer value used in the name of the file.
write.csv(tallymatrix, file = "SandS_Matrix_Randomizer_123.csv")
#Print to console the hand coded versus estimated counts as a 2-way contingency table of subject vs. structure.
final.tally
prop_tally<-prop.table(final.tally)
print(prop_tally)
print(line)
print(line)
final.tally2
prop_tally2<-prop.table(final.tally2)
print(prop_tally2)
```
This block is an **alternative method** to get the same data tables out using ```tidy``` format. It seems much less direct, but may be over-written.
```{r tidy_prediction_table}
#Use broom::tidy to reorganize into a data frame.
tidy_tally1 <-tidy(final.tally)
tidy_tally2 <-tidy(final.tally2)
tidy_tally3 <-tidy(final.tally)
tidy_tally4 <-tidy(final.tally2)
#Use mutate to add a proportion of total column. And yes, it calculates based on subgroups rather than whole column.
prop_tally1<-mutate(tidy_tally1, prop = Freq/sum(Freq))
prop_tally1<-select(prop_tally1, 1,2,4)
prop_tally2<-mutate(tidy_tally2, prop = Freq/sum(Freq))
prop_tally2<-select(prop_tally2, 1,2,4)
prop_tally3<-mutate(tidy_tally3, prop = Freq/sum(Freq))
prop_tally3<-select(prop_tally3, 1,2,4)
prop_tally4<-mutate(tidy_tally4, prop = Freq/sum(Freq))
prop_tally4<-select(prop_tally4, 1,2,4)
#Spread tidy data into a readable table
spread(prop_tally1,key=Hand.Coded.Structure, value=prop)
spread(prop_tally2,key=Bayes.Predicted.Structure, value=prop)
spread(prop_tally3,key=Hand.Coded.Subject, value=prop)
spread(prop_tally4,key=Bayes.Predicted.Subject, value=prop)
#Calculate the sum of values for each group.
proptallysum1<-prop_tally1 %>% group_by(Hand.Coded.Subject) %>% summarize(total=sum(prop))
proptallysum1
proptallysum2<-prop_tally2 %>% group_by(Bayes.Predicted.Subject) %>% summarize(total=sum(prop))
proptallysum2
proptallysum3<-prop_tally3 %>% group_by(Hand.Coded.Structure) %>% summarize(total=sum(prop))
proptallysum3
proptallysum4<-prop_tally4 %>% group_by(Bayes.Predicted.Structure) %>% summarize(total=sum(prop))
proptallysum4
```
#Comparing Coded and Predicted Categories
These code blocks compare the hand-coded versus NB-predicted frequencies. There are separate comparisons for the Subject and Structure categories, and both tabular and graphical comparisons.
```{r}
colnames(proptallysum1)[colnames(proptallysum1)=="Hand.Coded.Subject"] <- "Subject"
colnames(proptallysum1)[colnames(proptallysum1)=="total"] <- "Hand"
colnames(proptallysum2)[colnames(proptallysum2)=="Bayes.Predicted.Subject"] <- "Subject"
colnames(proptallysum2)[colnames(proptallysum2)=="total"] <- "Bayes"
propmerge12<-merge.data.frame(proptallysum1,proptallysum2, all=TRUE)
propmerge12
```
```{r}
blendSubject = ggplot()+
geom_col(data=proptallysum1, aes(`Subject`, `Hand`, group=1), color="blue", fill="cornflowerblue", alpha=0.5, width=0.4) +
geom_col(data=proptallysum2, aes(`Subject`, `Bayes`, group=1), color="brown", fill="indianred4", alpha=0.5, width=0.4,position = position_nudge(x=0.12))+
xlab("Subject Groups")+
ylab("Fraction of All Comments")+
ggtitle("Comment Subject Frequencies: Coded (blue) vs Predicted (red)")
print(blendSubject)
```
```{r}
colnames(proptallysum3)[colnames(proptallysum3)=="Hand.Coded.Structure"] <- "Structure"
colnames(proptallysum3)[colnames(proptallysum3)=="total"] <- "Hand"
colnames(proptallysum4)[colnames(proptallysum4)=="Bayes.Predicted.Structure"] <- "Structure"
colnames(proptallysum4)[colnames(proptallysum4)=="total"] <- "Bayes"
propmerge34<-merge.data.frame(proptallysum3,proptallysum4, all=TRUE)
propmerge34
```
```{r}
blendStructure = ggplot()+
geom_col(data=proptallysum3, aes(`Structure`, `Hand`,group=1), color="springgreen4", fill="springgreen3", alpha=0.5, width=0.4) +
geom_col(data=proptallysum4, aes(`Structure`, `Bayes`, group=1), color="goldenrod3", fill="goldenrod2", alpha=0.5, width=0.4, position = position_nudge(x=0.12))+
xlab("Structure Groups")+
ylab("Fraction of All Comments")+
ggtitle("Comment Structure Frequencies: Coded (green) vs Predicted (orange)")
print(blendStructure)
```
####\