forked from WFU-TLC/default_website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
dataimport.Rmd
201 lines (160 loc) · 8.82 KB
/
dataimport.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "Data Import & Checking"
author: "Dan Johnson"
date: "2/22/2019"
output: html_document
bibliography: library.bib
biblio-style: apalike
---
```{r, child="_setup.Rmd"}
```
#Overview
This page describes the target dataset, data pre-checks and import process, post-import validation, and how selected elements are recoded for analysis.
The dataset being used to develop this project is an Excel table containing 10,808 TA comments extracted from student lab reports submitted during Spring 2018. Each TA comment is stored as a separate row of the table, and is a mixed alphanumeric string of one or more words, numbers, and punctuation. Other columns record unique report, student, and TA IDs; grade assigned to the report; other standardized information about the original report from which the comment was extracted; and the hand-coded subject and structure of each comment.
####\
#Pre-Check and Import
Prior to import, review the data table **in Excel** and check that:
1. Student and TA names have been recoded using pre-assigned anonymous IDs.
2. Vocabulary terms used in coding columns matches comment codebook criteria described at https://adanieljohnson.github.io/default_website/codebook.html.
3. Data table headers match those listed below.
Column Number|Column Name|Accepted Values or Format
------------|----------------------|-----------------------
1|unique.record|Sp18.00001 (semester and year, then order in dataset)
2|report.id|R_0fanqHFhaY2yLT7 or similar
3|sort|remnant of coding
4|report.title (empty)|string
5|student|Std_drVZ2x1W (format is Std\_nnn)
6|course|113, 114, or 214
7|ta |TA_bc8EVX09 (format is TA\_nnn)
8|lab|113: allocation, betta; 114: manduca, frog; 214: enzymes (old) photosynthesis, physarum, cyanophage (new)
9|tag|first, second
10|type.TA|submission, revision
11|grade.TA|A, B, C, F
12|grading.time|time in minutes
13|Rank|remnant of coding
14|hypothesis.ok|Yes, No
15|data.ok|Yes, No
16|citation.ok|Yes, No
17|interpretation.ok|Yes, No
18|organization.ok|Yes, No
19|techflaws.ok|Yes, No
20|writing.ok|Yes, No
21|comments.incorporated|Yes, Somewhat, No
22|ta.comment|text string
23|code.subject|values in Ref. Tables
24|code.structure|values in Ref. Tables
25|code.locus|values in Ref. Tables
26|code.scope|values in Ref. Tables
27|code.tone|values in Ref. Tables
28|code.notes|values in Ref. Tables
Data were anonymized in Excel, replacing names, email addresses, and other confidential information was replaced with unique randomized alphanumeric strings.
Once all data are in correct format, export Excel file to CSV with name formatted as **"coded_full_comments_dataset_SemYear.csv"**
####\
#Target Dataset
The development dataset is an Excel table of 10,808 rows. One column has "TA comments" extracted from student lab reports in Spring 2018. Each TA comment is stored as a separate row of the table, and is a mixed alphanumeric string of one or more words, numbers, and punctuation. Other columns record unique report, student, and TA IDs; grade assigned to the report; other standardized information about the original report from which the comment was extracted; and the hand-coded subject and structure of each comment.
Data were anonymized in Excel prior to conversion to CSV format. Names, email addresses, and other confidential information was replaced with unique randomized alphanumeric strings.
####\
#Pre-Checks Before Import
Prior to import, review the data table **in Excel** and check that:
1. Student and TA names have been recoded using pre-assigned anonymous IDs.
2. Vocabulary terms used in coding columns matches comment codebook criteria described at https://adanieljohnson.github.io/default_website/codebook.html.
3. Data table headers match those listed below.
Column Number|Column Name|Accepted Values or Format
------------|----------------------|-----------------------
1|unique.record|Sp18.00001 (semester and year, then order in dataset)
2|report.id|R_0fanqHFhaY2yLT7 or similar
3|sort|remnant of coding
4|report.title (empty)|string
5|student|Std_drVZ2x1W (format is Std\_nnn)
6|course|113, 114, or 214
7|ta |TA_bc8EVX09 (format is TA\_nnn)
8|lab|113: allocation, betta; 114: manduca, frog; 214: enzymes (old) photosynthesis, physarum, cyanophage (new)
9|tag|first, second
10|type.TA|submission, revision
11|grade.TA|A, B, C, F
12|grading.time|time in minutes
13|Rank|remnant of coding
14|hypothesis.ok|Yes, No
15|data.ok|Yes, No
16|citation.ok|Yes, No
17|interpretation.ok|Yes, No
18|organization.ok|Yes, No
19|techflaws.ok|Yes, No
20|writing.ok|Yes, No
21|comments.incorporated|Yes, Somewhat, No
22|ta.comment|text string
23|code.subject|values in Ref. Tables
24|code.structure|values in Ref. Tables
25|code.locus|values in Ref. Tables
26|code.scope|values in Ref. Tables
27|code.tone|values in Ref. Tables
28|code.notes|values in Ref. Tables
Once all data are in correct format, export Excel file to CSV with name formatted as **"coded_full_comments_dataset_SemYear.csv"**
####\
#Initial Data Import
```{r Libraries_1}
# Call in tidyverse, other required libraries.
library(tidyverse)
library(tidytext)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
```
```{r}
#This block reads in full CSV into a starting dataframe named "base_data".
#Subsequent analyses are performed on extracted subsets of "base_data".
base_data <- read_csv(file='data/coded_full_comments_dataset_Spring18anonVERB.csv')
```
If dataset imports correctly, there will be a dataframe of ~11,000 rows and 28 columns named as listed in the **Data Pre-Checks** table above.
####\
#Post-Import Checks
Given some data points may be entered incorrectly by instructors or students, these post-import data checks ensure data are properly coded and columns contain valid entries only. (This process can be automated using the ```janitor``` data cleaning package.)
```{r}
#"Course" column should only have entries that correspond to 113, 114, 214, NA.
course<-unique(base_data$course)
```
```{r}
#Entries in "TAs" column should match the anonymous IDs that correspond to the TAs assigned to the relevant courses in the semester being analyzed.
#If there are incorrect or extra anonymous IDs in this list, the mostly likely source of the error is improper de-identification of the Excel file prior to import.
ta<-unique(base_data$ta)
```
```{r}
#"Lab" should only have entries that match the allowed limited keywords in codebook.
lab<-unique(base_data$lab)
```
```{r}
#It is impractical to validate all randomized identifiers. Instead look at the NUMBER of unique anonymous student IDs. Quantity should be within 10% of (but not more than) the combined enrollment of the 3 target courses for the semester.
student<-unique(base_data$student)
```
```{r}
#Check that the list of "code.subject" topics here matches allowed terms in codebook.
subject<-unique(base_data$code.subject)
```
```{r}
#Check that the list of "code.structure" terms extracted from the dataset and displayed here matches allowed terms in codebook.
structure<-unique(base_data$code.structure)
```
####\
##Post-Import Processing
Comment labels used in the project codebook contain punctuation and spacing that complicates subsequent analysis. Rather than modify the project codebook or external processes, it is simpler to recode the problematic labels after import. For example, the following code extracts a subset of rows from the full dataset, reduces those to 3 columns, and renames the **comment.subject** column as **subject**. Four problematic code labels in the subject column are replaced with simpler terms.
```{r}
#Read in TA comments from CSV file.
base_data <- read_csv(file='data/coded_full_comments_dataset_Spring18anon.csv')
#Select rows representing the sub-groups to compare.
comments_subset <- filter(base_data,code.subject=="1. Basic Criteria"|code.subject=="2. Writing Quality"|code.subject=="3. Technical and Scientific"|code.subject=="4. Logic and Thinking")
#Reduce larger dataframe to 3 required columns of data (subject, structure, and text of comment), and put columns in order needed.
comments_raw <- comments_subset %>% select(23,24,22)
#Rename the columns.
names(comments_raw)[1] <- "subject"
names(comments_raw)[2] <- "structure"
names(comments_raw)[3] <- "text"
#Simplify coding terms used in "subject" column.
comments_raw[,1] <- ifelse(comments_raw[,1] == "1. Basic Criteria","1_basic", ifelse(comments_raw[,1] == "2. Writing Quality","2_writing", ifelse(comments_raw[,1] == "3. Technical and Scientific","3_technical", ifelse(comments_raw[,1] == "4. Logic and Thinking","4_logic",99))))
#Change "subject" element from character to a factor for analysis.
comments_raw$subject <- factor(comments_raw$subject)
str(comments_raw$subject)
table(comments_raw$subject)
```
Most analyses will use a similar extraction and post-processing code block. In practice, it is easier to modify the post-processing code than to maintain multiple separate copies of the original CSV dataset for different analyses.
####\