05_measurement.qmd

---
title: "Measuring Workforce Outcomes"
author: "Corey Sparks and Benjamin Feder"
subtitle: "Module 2: Notebook 5"
output:
  html_document:
    css: "training.css"
toc: true
toc_depth: 3
toc_float:
      collapsed: no
      smooth_scroll: yes
number-sections: true
number-offset: 0
format:
  html:
    embed-resources: true
    df-print: paged
---

```{css echo=FALSE}

#HIDE THIS CHUNK FROM KNITTED OUTPUT
h2 {margin: 2m 0 !important;} 

details {
  margin-left: 4em;
  margin-bottom: .5rem;
}

summary { 
  margin-left: -2em;
}
```

```{r setup, include=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT

knitr::opts_chunk$set(include=TRUE, echo=TRUE, eval = FALSE,  warning = FALSE, fig.align = 'center')  #results='hide') # needs to delete results='hide'
```

```{r, include=FALSE, echo=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT

## Load libraries
library(RJDBC)     # For connecting to the database
# For data manipulation/visualization
library(tidyverse)
# For faster date conversions
library(lubridate)
library(dbplyr)
library(odbc)
library(DBI)
library(RPostgreSQL)
source("P:/pr-ci-training/connection_setup//adrf_redshift.r")

con <- adrf_redshift(usertype = "CI")

```

# Introduction

Welcome to the fifth notebook for Module 2 of this course, covering the **construction of employment outcomes for a cohort**. Up to this point in the course, most of our work with the Arkansas data has been focused on project scoping and development, culminating in the development of our cohort analytic frame and an initial longitudinal analysis in the previous [notebook](P:/tr-enrollment-to-employment/ETA%20Class%201/Notebooks/04_longitudinal_analysis.html). In this notebook, we will extend our longitudinal analysis to track employment outcomes after introducing a new component of the class data model based on the Unemployment Insurance (UI) wage records. This analysis will help us address our same research topic consistent throughout the notebooks, which is aimed at identifying promising pathways for a specific set of TANF-enrolled individuals before COVID-imposed restrictions were enforced in the state.

As we've discussed in class, when we are analyzing administrative data not developed for research purposes - such as the UI wage records - it is important to create new measures that will help us answer our policy-relevant questions. When we say "measure", we usually mean a **person-level variable** that we can use to compare outcomes for individuals in our cohort. Creating measures at the person level allows us to compare outcomes for different subgroups of individuals based on their characteristics and experiences. In fact, we have already developed this workflow in the previous notebook when we measured TANF reenrollment, and then subsequently investigated its distribution by our simultaneous co-enrollment indicator!

Here, we will broaden our set of measures to focus on a suite of person-level outcomes that stem directly from the recent 2023 Fiscal Responsibility Act, which in part mandates the measurement of specific employment outcomes for TANF participants. These outcome measures are quite similar to the WIOA-based standards and will help describe our cohort's employment experience and workforce outcomes after exiting TANF. For reference, we will construct the circled outcomes in the diagram below for our cohort of interest:

1.  Employment Rate 2nd Quarter After Exit
2.  Employment Rate 2nd and 4th Quarter After Exit
3.  Median Earnings in the 2nd Quarter After Exit

![](images/outcomes.png)

While your group may choose to generate different measures based on your research question, the code displayed here should provide a good starting place for thinking about how to best create and analyze workforce-focused, person-level outcome measures.

# Technical setup

As in previous notebooks, we will reintroduce the code required to set up our R environment to connect to the proper database and load certain packages. If you plan on running the SQL code separately, you can copy and paste the code from the SQL cells into your own .sql script in DBeaver. Instructions for creating a new .sql script are available in the `Technical Setup` section of the first Foundations Module [notebook](P:/tr-enrollment-to-employment/ETA%20Class%201/Foundations%20Module%20Materials/notebook_1.html#2_Technical_setup).

**If you would like to view the material to establish your own R environment for running the code displayed in this notebook, you can expand the following "R Environment Setup" section by clicking on its heading.**

::: {.callout collapse="true"}
## R Environment Setup

## Load Libraries {.unnumbered}

We will start by loading necessary packages not readily available in the base R setup.

> As a reminder, every time you create a new R file, you should copy and run the following code snippet.

```{r}
options(scipen = 999) # avoid scientific notation
library(RJDBC)
library(tidyverse) 
library(lubridate) # working with dates
library(dbplyr)
```

## Establish Database Connection {.unnumbered}

Now, the following set of commands will set up a connection to the Redshift database:

```{r eval=FALSE}
dbusr=Sys.getenv("DBUSER")
dbpswd=Sys.getenv("DBPASSWD")

url <- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;loginToRp=urn:amazon:webservices:govcloud;ssl=true;AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;ssl_insecure=true;plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"

driver <- JDBC(
  "com.amazon.redshift.jdbc42.Driver",
  classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
  identifier.quote="`"
)

con <- dbConnect(driver, url, dbusr, dbpswd)
```

## `.Renviron` File {.unnumbered}

For this code to work, you need to have an `.Renviron` file in your user folder (i.e. `U:\\John.Doe.P00002`) that contains the following:

```         
DBUSER='adrf\John.Doe.P00002'
DBPASSWD='xxxxxxxxxxxx'
```

where `John.Doe.P00002` is replaced with your username and `xxxxxxxxxx` is replaced with your password (both still in quotes!). `DBUSER` should now end with `.T00113`.

A detailed video from the Foundations Module, "Introduction to RStudio," demonstrating how to create an .Renviron file is available on the Resources page on class website in the subsection "Quick Links."
:::

# Data Model: Workforce Records

Until now, we have not encountered any tables in the dimensional data model containing references to workforce experiences. Thankfully, we will introduce our final fact table within the data model, `fact_person_ui_wage` also located in the `tr_e2e` schema, which captures wages reported for each person in each quarter they are employed. This fact table was created directly from Arkansas' UI wage records, aggregating person/employer/quarter data to a person/quarter combination. While the UI wage records are not a perfect population of everyone employed in Arkansas, it does capture roughly 95 percent of private non-farm wage and salary employment in the state. A refresher of the raw UI wage data is available in the [first class notebook](P:/tr-enrollment-to-employment/ETA%20Class%201/Notebooks/01_EDA.html#arkansas-ui-wage-data-ds_ar_dws.ui_wages_lehd).

The `tr_e2e.fact_person_ui_wage` table contains identical identifiers as the other program-focused fact tables. This will make it easy to link this to our cohort, which was initially built on the raw TANF data and then linked to the program fact tables. As visualized in the diagram below (again available by selecting the "ER Diagram" option after clicking on the specific table in the Database Navigator of DBeaver), this fact table links to two of the three dimension tables leveraged in previous notebooks:

-   Person dimension, storing information on the unique collection of persons available in the data, merging person-level attributes from a variety of sources, resulting in a "golden record" with a higher completeness than is available in individual sources
-   Time dimension, storing all possible values for a period of time (day, week, quarter, month, year) across a long period and allows for easy cross-referencing across different periods

You may notice the lack of `program_key` identifier to link to the `dim_program` table - this is by design, as this fact table does not contain any information on program participation.

![](images/wage_fact_er.png)

We can see a subset of this table and confirm that data is recorded at the person/quarter level with the following query:

::: panel-tabset
## SQL Query

```{sql}
SELECT *
FROM tr_e2e.fact_person_ui_wage
LIMIT 5
```

## `dbplyr` query

```{r}
con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "fact_person_ui_wage")) %>%
  head(5)
```
:::

Additionally, we can validate the presence of the surrogate columns linking to two of our dimension tables: `person_key` and `year_quarter_key`. Beyond this information, the table just contains a unique row identifier for the table, `person_ui_wage_id`, which does not link to any other table in a meaningful fashion.

# Linking Cohort to Workforce Fact

Before we can begin to build out the aforementioned employment outcomes for our cohort, we need to link our cohort to this new fact table. Recall the structure and contents of our cohort table, which we saved in the `tr_e2e` schema as `nb_cohort`:

::: panel-tabset
## SQL Query

```{sql}
SELECT *
FROM tr_e2e.nb_cohort
LIMIT 5
```

## `dbplyr` query

```{r}
con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "nb_cohort")) %>%
  head(5)
```
:::

Since we have already linked the original TANF data to the data model in developing our cohort, we have variables such as `person_key` and `exit_year_quarter_key`, which we can use to link directly to our workforce fact table. Since the workforce fact table only contains quarterly observations for those present in the UI wage records, we must use a **left join** if we want to preserve observations for individuals in our cohort who were never present in the UI wage records. Additionally, while we can look at workforce experiences over the entire range of available wage records, we will take a more limited view by only bringing in employment records within 5 quarters of exit. We will read this resulting table into R for future use:

> The code in the SQL query can be run in DBeaver by pasting just the code inside the quotations after `qry <-`.

::: panel-tabset
## SQL Query

```{r}
qry <- "
SELECT nc.*, 
	wage.year_quarter_key as wage_year_quarter_key,
	wage.ui_quarterly_wages,
		--CAN CREATE NEW VARIABLE FOR WAGE QUARTER RELATIVE TO TANF EXIT 
	wage.year_quarter_key - nc.exit_year_quarter_key AS relative_quarter
FROM tr_e2e.nb_cohort nc 
LEFT JOIN tr_e2e.fact_person_ui_wage wage ON
	--include ui_quarterly_wages > 0 in join clause to maintain structure of left join
	(nc.person_key = wage.person_key AND wage.ui_quarterly_wages > 0 AND
		--add additional clause to limit wage record focus for within 5 quarters of exit 
		nc.exit_year_quarter_key <= wage.year_quarter_key + 5 AND nc.exit_year_quarter_key >= wage.year_quarter_key - 5
	)
ORDER BY nc.person_key, relative_quarter
"

cohort_emp <- dbGetQuery(con, qry)

head(cohort_emp)
```

## `dbplyr` query

```{r}
nc <- con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "nb_cohort"))

wage <- con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "fact_person_ui_wage")) %>%
  filter(ui_quarterly_wages > 0) %>%
  select(person_key, year_quarter_key, ui_quarterly_wages) %>%
  rename(wage_year_quarter_key = year_quarter_key) %>%
  # cannot join on computer variables, so need to create before join
  mutate(
    wage_year_quarter_key_high = wage_year_quarter_key + 5,
    wage_year_quarter_key_low = wage_year_quarter_key - 5
  )

cohort_emp <- nc %>%
  left_join(
    wage, 
    # join_by supports inequality conditions (ex. greater than or equal to)
    join_by(person_key, exit_year_quarter_key <= wage_year_quarter_key_high, exit_year_quarter_key >= wage_year_quarter_key_low)
  ) %>%
  mutate(
    relative_quarter = wage_year_quarter_key - exit_year_quarter_key
  ) %>%
  select(-c(wage_year_quarter_key_low, wage_year_quarter_key_high)) %>%
  arrange(person_key, relative_quarter) %>%
  collect()

head(cohort_emp)
```
:::

We can confirm the integrity of the left join by confirming that we still have the same amount of unique individuals as we had in our [cohort](P:/tr-enrollment-to-employment/ETA%20Class%201/Notebooks/04_longitudinal_analysis.html#defining-characteristics-applying-them-to-our-data):

```{r}
cohort_emp %>%
  summarize(
    n_rows = n(),
    n_ppl = n_distinct(person_key),
  )
```

We now have more rows than individuals, which should align with our understanding of the left join - since we matched to all quarterly employment observations in the fact table within 5 quarters of TANF exit, we might have more than one observation per individual if they appeared in the UI wage records with positive earnings in multiple quarters. We can look at this distribution by `relative_quarter`:

```{r, max.print = 15}
cohort_emp %>% 
  group_by(relative_quarter) %>%
  summarize(
    n_rows = n(),
    n_ppl = n_distinct(person_key)
  )
```

As intended, there should be one row per individual in a given quarter. There also appear to be `NA` `relative_quarter` values. Due to the logic of our left join, these observations will arise in situations where the individual in the cohort did not appear with positive earnings in the UI wage records in a single quarter within this 11 quarter time frame.

Visually, we can observe trends in employment over time with a line graph. For reference, we will add in a dotted red line denoting the quarter of TANF exit:

```{r}
cohort_emp %>% 
  group_by(relative_quarter) %>%
  summarize(
    n_ppl = n_distinct(person_key)
  ) %>%
  ungroup() %>%
  ggplot(aes(x = relative_quarter, y = n_ppl)) +
  geom_line() + 
  # add vertical red dotted line at TANF exit (relative_quarter = 0)
  geom_vline(
    xintercept = 0,
    linetype = "dotted",
    color = "red"
  )
```

::: callout-important
## Checkpoint

If you were to match your cohort, or analytic frame, to the available wage records, what would you expect to see? Note this in your project template and discuss with your group if you plan to leverage the employment data.

Keep in mind that the 5th quarter after exit for our cohort corresponds to 2020Q2, when COVID-imposed restrictions began affecting employment opportunities in Arkansas.
:::

# Employment Measures

With our linked cohort-employment data frame developed, we can further explore our cohort's patterns of work and earnings through our three measures.

> Note: Even though we will bring in our simultaneous co-enrollment indicator later, our cohort is defined by TANF exit date, not common exit. If you are interested in applying common exit in co-enrollment situations, you can do so by modifying the code in the cohort creation (longitudinal analysis) notebook.

## Employment Rate - 2nd Quarter After Exit

Since our data frame `cohort_emp` contains at least one record for everyone in our original cohort, we can isolate the denominator of our employment rate calculations - original cohort size - and bring it back in after finding the number of individuals employed in the 2nd quarter after TANF exit:

```{r}
# find denominator
total_cohort <- cohort_emp %>%
  summarize(
    n_ppl = n_distinct(person_key)
  ) %>%
  pull(n_ppl)

# find numerator as n_distinct(person_key) and use to find employment rate
cohort_emp %>%
  filter(relative_quarter == 2) %>%
  summarize(
    quarter_2_emp_rate = n_distinct(person_key)*100/total_cohort
  )
```

## Employment Rate - 2nd and 4th Quarter After Exit

We can slightly modify our approach to find the percentage of individuals in our cohort work-eligible at exit in unsubsidized employment during both their 2nd and 4th quarters after exit, according to the UI wage records. To do so, after *filtering* for all observations in the 2nd and 4th quarters after TANF exit, we will isolate those who appear in both quarters, as indicated by the presence of multiple observations in the filtered data frame.

> Note: We have already accounted for the work-eligible restriction in our initial cohort construction.

```{r}
cohort_emp %>%
  filter(relative_quarter %in% c(2, 4)) %>%
  group_by(person_key) %>%
  summarize(
    n = n()
  ) %>%
  ungroup() %>%
  filter(n == 2) %>%
  summarize(
    quarter_2_and_4_emp_rate = 100*n_distinct(person_key)/total_cohort
  )
```

As a reference point, it might be helpful to see the 4th quarter employment rate in isolation. This calculation is quite similar to that of the 2nd quarter employment rate:

```{r}
# find numerator as n_distinct(person_key) and use to find employment rate
cohort_emp %>%
  filter(relative_quarter == 4) %>%
  summarize(
    quarter_4_emp_rate = n_distinct(person_key)*100/total_cohort
  )
```

Interesting! So according to our outcome measures here, it appears as though there is some transition between those employed in their 2nd and 4th quarters after exit. Let's look further investigate this measure by our joint race/ethnicity variable, which requires a slightly different technique due to the groupings in our denominator:

```{r}
# recalculate denominator of number of people in cohort by race/ethnicity grouping
cohort_by_eth <- cohort_emp %>%
  group_by(eth_recode_person) %>%
  summarize(
    n_total = n_distinct(person_key)
  ) %>% 
  ungroup()

cohort_emp %>%
  filter(relative_quarter %in% c(2, 4)) %>%
  # include eth_recode_person in group_by
  group_by(person_key, eth_recode_person) %>%
  summarize(
    n = n()
  ) %>%
  ungroup() %>%
  filter(n == 2) %>%
  group_by(eth_recode_person) %>%
  summarize(
    n_ppl = n_distinct(person_key)
  ) %>%
  ungroup() %>%
  # join to data frame with total counts to bring in denominaator
  inner_join(cohort_by_eth, by = "eth_recode_person") %>%
  mutate(
    quarter_2_and_4_emp_rate = 100*n_ppl/n_total
  )
```

While further investigation is required into the underlying dynamics that may be driving these outcomes, we can observe some interesting results here.

## Median Earnings - 2nd Quarter After Exit

We can also get a basic understanding of earnings with our final measure, which are tracked in `cohort_emp` as `ui_quarterly_wages`.

```{r}
cohort_emp %>%
  mutate(
    ui_quarterly_wages = as.numeric(ui_quarterly_wages)
  ) %>%
  filter(relative_quarter == 2) %>%
  summarize(
    quarter_2_median_earnings = median(ui_quarterly_wages)
  )
```

In interpreting this value, keep in mind that this median is calculated based on *only those who appear in the UI wage records*. Any individual in our cohort with missing or zero earnings according to the UI wage records in their second quarter after exit is not included in this calculation. We will briefly discuss techniques for dealing with missing data in the Missingness and Inference lecture. In the meantime, it is helpful to cross-reference this with our first measure tracking 2nd quarter post-exit employment rate.

We can extend this analysis by analyzing median earnings in the 2nd quarter after exit by our simultaneous co-enrollment indicator:

```{r}
cohort_emp %>%
  mutate(
    ui_quarterly_wages = as.numeric(ui_quarterly_wages)
  ) %>%
  filter(relative_quarter == 2) %>%
  group_by(co_enroll_ind) %>%
  summarize(
    quarter_2_median_earnings = median(ui_quarterly_wages)
  )
```

As we just discussed, it may be helpful to supplement this measure with our first one looking at the employment rate at this time:

```{r}
# recalculate denominator of number of people in cohort by co-enrollment grouping
cohort_by_co_enroll <- cohort_emp %>%
  group_by(co_enroll_ind) %>%
  summarize(
    n_total = n_distinct(person_key)
  ) %>% 
  ungroup()

cohort_emp %>%
  mutate(
    ui_quarterly_wages = as.numeric(ui_quarterly_wages)
  ) %>%
  filter(relative_quarter == 2) %>%
  group_by(co_enroll_ind) %>%
  # need to find numerator of number of people first before bringing in denominator
  summarize(
    quarter_2_median_earnings = median(ui_quarterly_wages),
    n_ppl = n_distinct(person_key)
  ) %>%
  inner_join(cohort_by_co_enroll, by = "co_enroll_ind") %>%
  mutate(
    quarter_2_emp_rate = 100*n_ppl/n_total
  ) %>%
  # ignore unnecessary columns
  select(-c(n_ppl, n_total))
```

::: callout-important
## Checkpoint

Does anything surprise you about these results? Compare this to our [TANF re-enrollment outcomes](P:/tr-enrollment-to-employment/ETA%20Class%201/Notebooks/04_longitudinal_analysis.html#tanf-reenrollment) by `co_enroll_ind`. What further investigation would you like to carry out here?

Applying this to your own project, in building employment outcomes, would it be worthwhile including some sort of employment or job stability measure? Or perhaps something else? Jot these ideas down in your project template.
:::

# Next steps: Applying this notebook to your project

Hopefully, by this point in the notebook, you have been inspired to apply some of these measures to your own cohort and overall project. You are encouraged to adapt and apply the base code available in this notebook to your own work. In the realm of enrollment to employment trajectories, there is a wealth of potential measures that can be created by linking the program participation information and UI wage records, and we encourage you to think through the different ways you might be able to create new measures and proxies to help answer your primary research question.

# Citations

Wisconsin Applied Data Analytics Training Program - Notebook 4: Measurement (citation to be updated)

```{r, echo=FALSE, results='hide'}
dbDisconnect(con)
```