-
Notifications
You must be signed in to change notification settings - Fork 0
/
05_measurement.qmd
477 lines (368 loc) · 21.3 KB
/
05_measurement.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
---
title: "Measuring Workforce Outcomes"
author: "Corey Sparks and Benjamin Feder"
subtitle: "Module 2: Notebook 5"
output:
html_document:
css: "training.css"
toc: true
toc_depth: 3
toc_float:
collapsed: no
smooth_scroll: yes
number-sections: true
number-offset: 0
format:
html:
embed-resources: true
df-print: paged
---
```{css echo=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT
h2 {margin: 2m 0 !important;}
details {
margin-left: 4em;
margin-bottom: .5rem;
}
summary {
margin-left: -2em;
}
```
```{r setup, include=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT
knitr::opts_chunk$set(include=TRUE, echo=TRUE, eval = FALSE, warning = FALSE, fig.align = 'center') #results='hide') # needs to delete results='hide'
```
```{r, include=FALSE, echo=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT
## Load libraries
library(RJDBC) # For connecting to the database
# For data manipulation/visualization
library(tidyverse)
# For faster date conversions
library(lubridate)
library(dbplyr)
library(odbc)
library(DBI)
library(RPostgreSQL)
source("P:/pr-ci-training/connection_setup//adrf_redshift.r")
con <- adrf_redshift(usertype = "CI")
```
# Introduction
Welcome to the fifth notebook for Module 2 of this course, covering the **construction of employment outcomes for a cohort**. Up to this point in the course, most of our work with the Arkansas data has been focused on project scoping and development, culminating in the development of our cohort analytic frame and an initial longitudinal analysis in the previous [notebook](P:/tr-enrollment-to-employment/ETA%20Class%201/Notebooks/04_longitudinal_analysis.html). In this notebook, we will extend our longitudinal analysis to track employment outcomes after introducing a new component of the class data model based on the Unemployment Insurance (UI) wage records. This analysis will help us address our same research topic consistent throughout the notebooks, which is aimed at identifying promising pathways for a specific set of TANF-enrolled individuals before COVID-imposed restrictions were enforced in the state.
As we've discussed in class, when we are analyzing administrative data not developed for research purposes - such as the UI wage records - it is important to create new measures that will help us answer our policy-relevant questions. When we say "measure", we usually mean a **person-level variable** that we can use to compare outcomes for individuals in our cohort. Creating measures at the person level allows us to compare outcomes for different subgroups of individuals based on their characteristics and experiences. In fact, we have already developed this workflow in the previous notebook when we measured TANF reenrollment, and then subsequently investigated its distribution by our simultaneous co-enrollment indicator!
Here, we will broaden our set of measures to focus on a suite of person-level outcomes that stem directly from the recent 2023 Fiscal Responsibility Act, which in part mandates the measurement of specific employment outcomes for TANF participants. These outcome measures are quite similar to the WIOA-based standards and will help describe our cohort's employment experience and workforce outcomes after exiting TANF. For reference, we will construct the circled outcomes in the diagram below for our cohort of interest:
1. Employment Rate 2nd Quarter After Exit
2. Employment Rate 2nd and 4th Quarter After Exit
3. Median Earnings in the 2nd Quarter After Exit
![](images/outcomes.png)
While your group may choose to generate different measures based on your research question, the code displayed here should provide a good starting place for thinking about how to best create and analyze workforce-focused, person-level outcome measures.
# Technical setup
As in previous notebooks, we will reintroduce the code required to set up our R environment to connect to the proper database and load certain packages. If you plan on running the SQL code separately, you can copy and paste the code from the SQL cells into your own .sql script in DBeaver. Instructions for creating a new .sql script are available in the `Technical Setup` section of the first Foundations Module [notebook](P:/tr-enrollment-to-employment/ETA%20Class%201/Foundations%20Module%20Materials/notebook_1.html#2_Technical_setup).
**If you would like to view the material to establish your own R environment for running the code displayed in this notebook, you can expand the following "R Environment Setup" section by clicking on its heading.**
::: {.callout collapse="true"}
## R Environment Setup
## Load Libraries {.unnumbered}
We will start by loading necessary packages not readily available in the base R setup.
> As a reminder, every time you create a new R file, you should copy and run the following code snippet.
```{r}
options(scipen = 999) # avoid scientific notation
library(RJDBC)
library(tidyverse)
library(lubridate) # working with dates
library(dbplyr)
```
## Establish Database Connection {.unnumbered}
Now, the following set of commands will set up a connection to the Redshift database:
```{r eval=FALSE}
dbusr=Sys.getenv("DBUSER")
dbpswd=Sys.getenv("DBPASSWD")
url <- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;loginToRp=urn:amazon:webservices:govcloud;ssl=true;AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;ssl_insecure=true;plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"
driver <- JDBC(
"com.amazon.redshift.jdbc42.Driver",
classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
identifier.quote="`"
)
con <- dbConnect(driver, url, dbusr, dbpswd)
```
## `.Renviron` File {.unnumbered}
For this code to work, you need to have an `.Renviron` file in your user folder (i.e. `U:\\John.Doe.P00002`) that contains the following:
```
DBUSER='adrf\John.Doe.P00002'
DBPASSWD='xxxxxxxxxxxx'
```
where `John.Doe.P00002` is replaced with your username and `xxxxxxxxxx` is replaced with your password (both still in quotes!). `DBUSER` should now end with `.T00113`.
A detailed video from the Foundations Module, "Introduction to RStudio," demonstrating how to create an .Renviron file is available on the Resources page on class website in the subsection "Quick Links."
:::
# Data Model: Workforce Records
Until now, we have not encountered any tables in the dimensional data model containing references to workforce experiences. Thankfully, we will introduce our final fact table within the data model, `fact_person_ui_wage` also located in the `tr_e2e` schema, which captures wages reported for each person in each quarter they are employed. This fact table was created directly from Arkansas' UI wage records, aggregating person/employer/quarter data to a person/quarter combination. While the UI wage records are not a perfect population of everyone employed in Arkansas, it does capture roughly 95 percent of private non-farm wage and salary employment in the state. A refresher of the raw UI wage data is available in the [first class notebook](P:/tr-enrollment-to-employment/ETA%20Class%201/Notebooks/01_EDA.html#arkansas-ui-wage-data-ds_ar_dws.ui_wages_lehd).
The `tr_e2e.fact_person_ui_wage` table contains identical identifiers as the other program-focused fact tables. This will make it easy to link this to our cohort, which was initially built on the raw TANF data and then linked to the program fact tables. As visualized in the diagram below (again available by selecting the "ER Diagram" option after clicking on the specific table in the Database Navigator of DBeaver), this fact table links to two of the three dimension tables leveraged in previous notebooks:
- Person dimension, storing information on the unique collection of persons available in the data, merging person-level attributes from a variety of sources, resulting in a "golden record" with a higher completeness than is available in individual sources
- Time dimension, storing all possible values for a period of time (day, week, quarter, month, year) across a long period and allows for easy cross-referencing across different periods
You may notice the lack of `program_key` identifier to link to the `dim_program` table - this is by design, as this fact table does not contain any information on program participation.
![](images/wage_fact_er.png)
We can see a subset of this table and confirm that data is recorded at the person/quarter level with the following query:
::: panel-tabset
## SQL Query
```{sql}
SELECT *
FROM tr_e2e.fact_person_ui_wage
LIMIT 5
```
## `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_ui_wage")) %>%
head(5)
```
:::
Additionally, we can validate the presence of the surrogate columns linking to two of our dimension tables: `person_key` and `year_quarter_key`. Beyond this information, the table just contains a unique row identifier for the table, `person_ui_wage_id`, which does not link to any other table in a meaningful fashion.
# Linking Cohort to Workforce Fact
Before we can begin to build out the aforementioned employment outcomes for our cohort, we need to link our cohort to this new fact table. Recall the structure and contents of our cohort table, which we saved in the `tr_e2e` schema as `nb_cohort`:
::: panel-tabset
## SQL Query
```{sql}
SELECT *
FROM tr_e2e.nb_cohort
LIMIT 5
```
## `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "nb_cohort")) %>%
head(5)
```
:::
Since we have already linked the original TANF data to the data model in developing our cohort, we have variables such as `person_key` and `exit_year_quarter_key`, which we can use to link directly to our workforce fact table. Since the workforce fact table only contains quarterly observations for those present in the UI wage records, we must use a **left join** if we want to preserve observations for individuals in our cohort who were never present in the UI wage records. Additionally, while we can look at workforce experiences over the entire range of available wage records, we will take a more limited view by only bringing in employment records within 5 quarters of exit. We will read this resulting table into R for future use:
> The code in the SQL query can be run in DBeaver by pasting just the code inside the quotations after `qry <-`.
::: panel-tabset
## SQL Query
```{r}
qry <- "
SELECT nc.*,
wage.year_quarter_key as wage_year_quarter_key,
wage.ui_quarterly_wages,
--CAN CREATE NEW VARIABLE FOR WAGE QUARTER RELATIVE TO TANF EXIT
wage.year_quarter_key - nc.exit_year_quarter_key AS relative_quarter
FROM tr_e2e.nb_cohort nc
LEFT JOIN tr_e2e.fact_person_ui_wage wage ON
--include ui_quarterly_wages > 0 in join clause to maintain structure of left join
(nc.person_key = wage.person_key AND wage.ui_quarterly_wages > 0 AND
--add additional clause to limit wage record focus for within 5 quarters of exit
nc.exit_year_quarter_key <= wage.year_quarter_key + 5 AND nc.exit_year_quarter_key >= wage.year_quarter_key - 5
)
ORDER BY nc.person_key, relative_quarter
"
cohort_emp <- dbGetQuery(con, qry)
head(cohort_emp)
```
## `dbplyr` query
```{r}
nc <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "nb_cohort"))
wage <- con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_ui_wage")) %>%
filter(ui_quarterly_wages > 0) %>%
select(person_key, year_quarter_key, ui_quarterly_wages) %>%
rename(wage_year_quarter_key = year_quarter_key) %>%
# cannot join on computer variables, so need to create before join
mutate(
wage_year_quarter_key_high = wage_year_quarter_key + 5,
wage_year_quarter_key_low = wage_year_quarter_key - 5
)
cohort_emp <- nc %>%
left_join(
wage,
# join_by supports inequality conditions (ex. greater than or equal to)
join_by(person_key, exit_year_quarter_key <= wage_year_quarter_key_high, exit_year_quarter_key >= wage_year_quarter_key_low)
) %>%
mutate(
relative_quarter = wage_year_quarter_key - exit_year_quarter_key
) %>%
select(-c(wage_year_quarter_key_low, wage_year_quarter_key_high)) %>%
arrange(person_key, relative_quarter) %>%
collect()
head(cohort_emp)
```
:::
We can confirm the integrity of the left join by confirming that we still have the same amount of unique individuals as we had in our [cohort](P:/tr-enrollment-to-employment/ETA%20Class%201/Notebooks/04_longitudinal_analysis.html#defining-characteristics-applying-them-to-our-data):
```{r}
cohort_emp %>%
summarize(
n_rows = n(),
n_ppl = n_distinct(person_key),
)
```
We now have more rows than individuals, which should align with our understanding of the left join - since we matched to all quarterly employment observations in the fact table within 5 quarters of TANF exit, we might have more than one observation per individual if they appeared in the UI wage records with positive earnings in multiple quarters. We can look at this distribution by `relative_quarter`:
```{r, max.print = 15}
cohort_emp %>%
group_by(relative_quarter) %>%
summarize(
n_rows = n(),
n_ppl = n_distinct(person_key)
)
```
As intended, there should be one row per individual in a given quarter. There also appear to be `NA` `relative_quarter` values. Due to the logic of our left join, these observations will arise in situations where the individual in the cohort did not appear with positive earnings in the UI wage records in a single quarter within this 11 quarter time frame.
Visually, we can observe trends in employment over time with a line graph. For reference, we will add in a dotted red line denoting the quarter of TANF exit:
```{r}
cohort_emp %>%
group_by(relative_quarter) %>%
summarize(
n_ppl = n_distinct(person_key)
) %>%
ungroup() %>%
ggplot(aes(x = relative_quarter, y = n_ppl)) +
geom_line() +
# add vertical red dotted line at TANF exit (relative_quarter = 0)
geom_vline(
xintercept = 0,
linetype = "dotted",
color = "red"
)
```
::: callout-important
## Checkpoint
If you were to match your cohort, or analytic frame, to the available wage records, what would you expect to see? Note this in your project template and discuss with your group if you plan to leverage the employment data.
Keep in mind that the 5th quarter after exit for our cohort corresponds to 2020Q2, when COVID-imposed restrictions began affecting employment opportunities in Arkansas.
:::
# Employment Measures
With our linked cohort-employment data frame developed, we can further explore our cohort's patterns of work and earnings through our three measures.
> Note: Even though we will bring in our simultaneous co-enrollment indicator later, our cohort is defined by TANF exit date, not common exit. If you are interested in applying common exit in co-enrollment situations, you can do so by modifying the code in the cohort creation (longitudinal analysis) notebook.
## Employment Rate - 2nd Quarter After Exit
Since our data frame `cohort_emp` contains at least one record for everyone in our original cohort, we can isolate the denominator of our employment rate calculations - original cohort size - and bring it back in after finding the number of individuals employed in the 2nd quarter after TANF exit:
```{r}
# find denominator
total_cohort <- cohort_emp %>%
summarize(
n_ppl = n_distinct(person_key)
) %>%
pull(n_ppl)
# find numerator as n_distinct(person_key) and use to find employment rate
cohort_emp %>%
filter(relative_quarter == 2) %>%
summarize(
quarter_2_emp_rate = n_distinct(person_key)*100/total_cohort
)
```
## Employment Rate - 2nd and 4th Quarter After Exit
We can slightly modify our approach to find the percentage of individuals in our cohort work-eligible at exit in unsubsidized employment during both their 2nd and 4th quarters after exit, according to the UI wage records. To do so, after *filtering* for all observations in the 2nd and 4th quarters after TANF exit, we will isolate those who appear in both quarters, as indicated by the presence of multiple observations in the filtered data frame.
> Note: We have already accounted for the work-eligible restriction in our initial cohort construction.
```{r}
cohort_emp %>%
filter(relative_quarter %in% c(2, 4)) %>%
group_by(person_key) %>%
summarize(
n = n()
) %>%
ungroup() %>%
filter(n == 2) %>%
summarize(
quarter_2_and_4_emp_rate = 100*n_distinct(person_key)/total_cohort
)
```
As a reference point, it might be helpful to see the 4th quarter employment rate in isolation. This calculation is quite similar to that of the 2nd quarter employment rate:
```{r}
# find numerator as n_distinct(person_key) and use to find employment rate
cohort_emp %>%
filter(relative_quarter == 4) %>%
summarize(
quarter_4_emp_rate = n_distinct(person_key)*100/total_cohort
)
```
Interesting! So according to our outcome measures here, it appears as though there is some transition between those employed in their 2nd and 4th quarters after exit. Let's look further investigate this measure by our joint race/ethnicity variable, which requires a slightly different technique due to the groupings in our denominator:
```{r}
# recalculate denominator of number of people in cohort by race/ethnicity grouping
cohort_by_eth <- cohort_emp %>%
group_by(eth_recode_person) %>%
summarize(
n_total = n_distinct(person_key)
) %>%
ungroup()
cohort_emp %>%
filter(relative_quarter %in% c(2, 4)) %>%
# include eth_recode_person in group_by
group_by(person_key, eth_recode_person) %>%
summarize(
n = n()
) %>%
ungroup() %>%
filter(n == 2) %>%
group_by(eth_recode_person) %>%
summarize(
n_ppl = n_distinct(person_key)
) %>%
ungroup() %>%
# join to data frame with total counts to bring in denominaator
inner_join(cohort_by_eth, by = "eth_recode_person") %>%
mutate(
quarter_2_and_4_emp_rate = 100*n_ppl/n_total
)
```
While further investigation is required into the underlying dynamics that may be driving these outcomes, we can observe some interesting results here.
## Median Earnings - 2nd Quarter After Exit
We can also get a basic understanding of earnings with our final measure, which are tracked in `cohort_emp` as `ui_quarterly_wages`.
```{r}
cohort_emp %>%
mutate(
ui_quarterly_wages = as.numeric(ui_quarterly_wages)
) %>%
filter(relative_quarter == 2) %>%
summarize(
quarter_2_median_earnings = median(ui_quarterly_wages)
)
```
In interpreting this value, keep in mind that this median is calculated based on *only those who appear in the UI wage records*. Any individual in our cohort with missing or zero earnings according to the UI wage records in their second quarter after exit is not included in this calculation. We will briefly discuss techniques for dealing with missing data in the Missingness and Inference lecture. In the meantime, it is helpful to cross-reference this with our first measure tracking 2nd quarter post-exit employment rate.
We can extend this analysis by analyzing median earnings in the 2nd quarter after exit by our simultaneous co-enrollment indicator:
```{r}
cohort_emp %>%
mutate(
ui_quarterly_wages = as.numeric(ui_quarterly_wages)
) %>%
filter(relative_quarter == 2) %>%
group_by(co_enroll_ind) %>%
summarize(
quarter_2_median_earnings = median(ui_quarterly_wages)
)
```
As we just discussed, it may be helpful to supplement this measure with our first one looking at the employment rate at this time:
```{r}
# recalculate denominator of number of people in cohort by co-enrollment grouping
cohort_by_co_enroll <- cohort_emp %>%
group_by(co_enroll_ind) %>%
summarize(
n_total = n_distinct(person_key)
) %>%
ungroup()
cohort_emp %>%
mutate(
ui_quarterly_wages = as.numeric(ui_quarterly_wages)
) %>%
filter(relative_quarter == 2) %>%
group_by(co_enroll_ind) %>%
# need to find numerator of number of people first before bringing in denominator
summarize(
quarter_2_median_earnings = median(ui_quarterly_wages),
n_ppl = n_distinct(person_key)
) %>%
inner_join(cohort_by_co_enroll, by = "co_enroll_ind") %>%
mutate(
quarter_2_emp_rate = 100*n_ppl/n_total
) %>%
# ignore unnecessary columns
select(-c(n_ppl, n_total))
```
::: callout-important
## Checkpoint
Does anything surprise you about these results? Compare this to our [TANF re-enrollment outcomes](P:/tr-enrollment-to-employment/ETA%20Class%201/Notebooks/04_longitudinal_analysis.html#tanf-reenrollment) by `co_enroll_ind`. What further investigation would you like to carry out here?
Applying this to your own project, in building employment outcomes, would it be worthwhile including some sort of employment or job stability measure? Or perhaps something else? Jot these ideas down in your project template.
:::
# Next steps: Applying this notebook to your project
Hopefully, by this point in the notebook, you have been inspired to apply some of these measures to your own cohort and overall project. You are encouraged to adapt and apply the base code available in this notebook to your own work. In the realm of enrollment to employment trajectories, there is a wealth of potential measures that can be created by linking the program participation information and UI wage records, and we encourage you to think through the different ways you might be able to create new measures and proxies to help answer your primary research question.
# Citations
Wisconsin Applied Data Analytics Training Program - Notebook 4: Measurement (citation to be updated)
```{r, echo=FALSE, results='hide'}
dbDisconnect(con)
```