-
Notifications
You must be signed in to change notification settings - Fork 1
/
micro.qmd
353 lines (243 loc) · 17.2 KB
/
micro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
---
title: "Event History Models for Microdemography"
---
```{r, include=FALSE, message=FALSE}
rm(list=ls())
library(tidyverse, quietly = T)
library(ggplot2, quietly = T)
library(kableExtra, quietly = T)
theme_set(theme_classic())
```
\newpage
# Micro-demography
Demographic studies began to focus on individual level outcomes once the availability of individual level census and social/health survey data became more prevalent in the 1960's and 1970's. Corresponding with this availability of individual level data, demography began to be influenced by sociological thought, which brought a proper theoretical perspective to demographic studies, which, before this period were dominated by methodological issues and descriptions of macro scale demographic processes we saw in the previous chapter.
This chapter focuses on how we model individual level observational data, with a focus on data from demographic surveys. In doing this, I focus on two primary goals, first to illustrate how to conduct commonly used regression analysis of demographic outcomes as measured in surveys, drawing heavily on data from the ******. Secondly, I describe the event-history framework that is used in much of social science when we analyze changes in outcomes between time points, or in describing the time to some demographic event, such as having a child or dying.
### Individual-level focus
In contrast to the macro-demographic perspective described in the previous chapter, the micro-demographic perspective is focused on the **individual** rather than the aggregate. Barbara Entwisle [-@entwisle_2007] explains, micro-demography focuses on how individuals are modeled within demographic studies, which focus on individual level behaviors and outcomes as a prime importance. Most of the time in an individual, or micro-demographic focus, our hypotheses are related to how the individual outcome or behavior is influenced by characteristics of the individual, usually without concern or interest in the physical or social context in which the individual lives their life. For instance, in a micro-demographic study of mortality, the researcher may make ask:
*Do individuals with a low level of education face higher risk of death, compared to people with a college education*
This in and of itself says nothing about the spatial context in which the person lives, it is only concerned with characteristics of the person.
### Individual level propositions
If we are concerned with how individual-level factors affect each person's outcome, then we are stating an individual level, or micro, proposition. In this type of analysis, we have *y* and *x*, both measured on our individual level units. We may also have a variable *Z* measured on a second-level unit, but we'll wait and discuss these in the next chapter. An example of this micro-level, or individual level, proposition would be that we think a persons health is affected by their level of education.
![Micro-level proposition](./images/indi.png)
## Types of individual level outcomes
In the analysis of micro-demographic outcomes, we encounter a wide variety of outcomes. Some are relatively simple recoding of existing survey data, other times the outcomes depend on repeated measures of some behavior in a longitudinal framework. At the heart of different outcomes are different assumed statistical distributions used to model them, and the choices that we face related to which outcome best suits a particular outcome. Based on the Generalized Linear Model framework introduced in the previous chapter, in this chapter my goal is to illustrate how to use this framework for analyzing a variety of different types of outcomes. Also, I show how to use appropriate corrections for using complex survey data in the context of the various outcome types, relying heavily on the `survey` and `svyVGAM` packages [@lumley2004; @lumleysvyvgam].
### Continuous outcomes
The workhorse for continuous outcomes in statistics is the Gaussian, or Normal distribution linear model, which was reviewed in the previous chapter. The Normal distribution is incredibly flexible in its form, and once certain restrictions are removed via techniques such as generalized least squares, the model becomes even more flexible. I do not intend to go over the details of the model estimation in this chapter, as this was covered previously. Instead, I will use a common demographic survey data source, the Demographic and Health Survey model data to illustrate how to include survey design information and estimate the models. In addition to the Normal distribution, other continuous distributions including the Student-t and Gamma distributions are also relevant for modeling heteroskedastic and skewed continuous outcomes, as described in the previous chapters.
#### Data for examples
#### Discrete
#### Binary Logistic regression model
#### Marginal effects
#### Ordinal logistic regression model
#### Multinomial logistic regression model
#### Changes in an outcome over time
#### Creating an event history dataset
### Creating a longitudinal dataset
## Event history framework
### When to conduct an event history analysis?
* When you questions include
+ When or Whether
+ When > how long until an event occurs
+ Whether > does an event occur or not
* If your question does not include either of these ideas (or cannot be made to) then you do not need to do event history analysis
### Basic Propositions
* Since most of the methods we will discuss originate from studies of mortality, they have morbid names
+ Survival – This is related to how long a case lasts until it experiences the event of interest
+ How long does it take?
+ Risk – How likely is it that the case will experience the event
+ Will it happen or not?
### Focus on comparison
* Most of the methods we consider are comparative by their nature
* How long does a case with trait x survive, compared to a case with trait y?
* How likely is it for a person who is married to die of homicide relative to someone who is single?
* Generally we are examining relative risk and relative survival
### Some terminology
* **State** – discrete condition an individual may occupy that occur within a state space. Most survival analysis methods assume a single state to state transition
* **State space** – full set of state alternatives
* **Episodes/Events/Transitions** – a change in states
* **Durations** – length of an episode
* **Time axis** – Metric for measuring durations (days, months, years)
![State Space Illustration](C:/Users/ozd504/GitHub/DEM7223/images/state_concepts.png)
### Issues in event history data
#### Censoring
* **Censoring** occurs when you do not actually observe the event of interest within the period of data collection
+ e.g. you know someone gets married, but you never observe them having a child
+ e.g. someone leaves alcohol treatment and is never observed drinking again
![Censoring](C:/Users/ozd504/Documents/GitHub/DEM7223/images/censoring.png)
#### Non-informative censoring
* The individual is not observed because the observer ends the study period
* The censoring is not related to any trait or action of the case, but related to the observer
+ We want most of our censoring to be this kind
#### Informative censoring
* The individual is not observed because they represent a special case
* The censoring IS related to something about the individual, and these people differ inherently from uncensored cases
* People that are censored ARE likely to have experience the event
#### Right censoring
* An event time is unknown because it is not observed.
+ This is easier to deal with
#### Left censoring
* An event time is unknown because it occurred prior to the beginning of data collection, but not when
+ This is difficult to deal with
#### Interval censoring
* The event time is known to have occurred within a period of time, but it is unknown exactly when
+ This can be dealt with
## Time Scales
* Continuous time
+ Time is measured in very precise, unique increments > miles until a tire blows out
+ Each observed duration is unique
* Discrete time
+ Time is measured in discrete lumps > semester a student leaves college
+ Each observed duration is not necessarily unique, and takes one of a set of discrete values
![Time Scales](./images/timescales.png)
#### Making continuous outcomes discrete
* Ideally you should measure the duration as finely as possible (see Freedman et al)
* Often you may choose to discretize the data > take continuous time and break it into discrete chunks
* Problems
+ This removes possibly informative information on duration variability
+ Any discrete dividing point is arbitrary
+ You may arrive at different conclusions given the interval you choose
+ You lose information about late event occurrence
+ Lose all information on mean or average durations
## Kinds of studies with event history data
* Cross sectional
+ Measured at one time point (no change observed)
+ Can measure lots of things at once
* Panel data
+ Multiple measurements at discrete time points on the same individuals
+ Can look at change over time
* Event history
+ Continuous measurement of units over a fixed period of time, focusing on change in states
+ Think clinical follow-ups
* Longitudinal designs
+ Prospective designs
+ Studies that follow a group (cohort) and follow them over time
+ Expensive and take a long time, but can lead to extremely valuable information on changes in behaviors
* Retrospective designs
+ Taken at a cross section
+ Ask respondents about events that have previously occurred.
+ Generate birth/migration/marital histories for individuals
+ Problems with recall bias
+ DHS includes a detailed history of births over the last 5 years
* Record linkage procedures
+ Begin with an event of interest (birth, marriage) and follow individuals using various record types
+ Birth > Census 1880 > Census 1890 > Marriage > Birth of children > Census 1900 > Tax records >Death certificate
+ Mostly used in historical studies
+ Modern studies link health surveys to National Death Index (NHANES, NHIS)
# Functions of Survival Time
## Some arrangements for event history data
### Counting process data
* This is what we are accustomed to in the life table
```{r}
t1<-data.frame(Time_start=c(1,2,3,4),
Time_end=c(2,3,4,5),
Failing=c(25,15,12,20),
At_Risk=c(100, 75, 60, 40))
t1%>%
kable()%>%
column_spec(1:4, border_left = T, border_right = T)%>%
kable_styling()
#knitr::kable(t1,format = "html", caption = "Counting Process data" ,align = "c", )
```
### Case - duration, or person level data
* This is the general form of continuous time survival data.
```{r}
t2<-data.frame(ID = c(1,2,3,4),
Duration=c(5, 2, 9 , 6),
Event_Occurred=c("Yes (1)","Yes (1)","No (0)", "Yes (1)" ))
t2%>%
kable()%>%
column_spec(1:3, border_left = T, border_right = T)%>%
kable_styling(row_label_position = "c", position = "center" )
#knitr::kable(t2, format = "html", caption = "Case-duration data", align = "c")
```
This can be transformed into person-period data, or discrete time data.
### Person – Period data
* Express exposure as discrete periods
* Event occurrence is coded at each period
```{r}
t3<-data.frame(ID=c(rep(1, 5), rep(2, 2), rep(3, 9), rep(4, 6)),
Period = c(seq(1:5), seq(1:2), seq(1:9), seq(1:6)),
Event=c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
t3%>%
kable()%>%
column_spec(1:3, border_left = T, border_right = T)%>%
kable_styling(row_label_position = "c", position = "center" )
```
## Homage to the life table
In life tables, we had lots of functions of the death process. Some of these were more interesting than others, with two being of special interest to use here. These are the $l(x)$ and $q(x, n)$ functions. If you recall, $l(x)$ represents the population size of the stationary population that is alive at age $x$, and the risk of dying between age $x, x+n$ is $q(x, n)$.
These are genearlized more in the event history analysis literature, but we can still describe the distrubion of survival time using three functions. These are the **Survival Function**, $S(t)$, the **probability density function**, $f(t)$, and the **hazard function**, $h(t)$. These three are related and we can derive one from the others.
Now we must generalize these ideas to incorporate them into the broader event-history framework
Survival/duration times measure the *time to a certain event.*
These times are subject to random variations, and are considered to be random *iid* (independent and identically distributed; random) variates from some distribution
* The distribution of survival times is described by 3 functions
* The survivorship function, $S(t)$
* The probability density function, $f(t)$
* The hazard function, $h(t)$
![3 functions](C:/Users/ozd504/Documents/GitHub/DEM7223/images/functions.png)
```{r}
Ft<-cumsum(dlnorm(x = seq(0, 110, 1), meanlog = 4.317488, sdlog = 2.5)) #mean of 75 years, sd of 12.1 years
ft<-diff(Ft)
St<-1-Ft
ht<-ft/St[1:110]
plot(Ft, ylim=c(0,1))
lines(St, col="red")
plot(ht, col="green")
```
* These three are mathematically related, and if given one, we can calculate the others
+ These 3 functions each represent a different aspect of the survival time distribution.
* The fundamental problem in survival analysis is coming up with a way to estimate these functions.
### Defining the functions
Let *T* denote the survival time, our goal is to characterize the distribution of *T* using these 3 functions.
Let *T* be a discrete(or continuous) *iid* random variable and let $t_i$, be an occurrence of that variable, such that $Pr(t_i)=Pr(T=t_i)$
### The distribution function, or *pdf*
Like any other random variates survival times have a simple distribution function that gives the probability of observing a particular survival time within a finite interval
The density function is defined as the limit of the probability that an individual fails (experiences the event) in a short interval $t+\Delta t$ (read delta t), per width of $\Delta t$, or simply the probability of failure in a small interval per unit time, $f(t_i) = Pr(T=t_i)$.
If $F(t)$ is the cumulative distribution function for *T*, given by:
$$ F(t) = \int_{0}^{t} f(u) du = Pr(T \leqslant t )$$
Which is the probability of observing a value of *T* prior to the current value, *t*.
The density function is then:
$$ f(t) = \frac{F(t)}{d(t)} = F'(t)$$
or
$$ f(t) = \lim_{\delta t \rightarrow 0} \frac{F(t+\Delta t) - F(t)}{\Delta t}$$
The density function gives the unconditional instantaneous failure rate in the (very small) interval between *t* and *dt*, $\Delta t$
### Survival Function
The survival function, *S(t)* is expressed:
$$ S(t) = 1 - F(t) = Pr (T \geqslant t)$$
Which is the probability that *T* takes a value larger than *t*. i.e. the event happens some time after the present time.
At $t = 0, S(t) =1 \text { and at } t= \infty, \text { and } S(t) =0$
As time passes, *S(t)* decreases, and is called a *strictly decreasing function of time*.
Empirically, *S(t)* takes the form of a step function:
```{r}
St<- c(1, cumprod(1-(t1$Failing/t1$At_Risk)))
plot(St, type="s", xlab = "Time", ylab="S(t)", ylim=c(0,1))
```
```{r, echo=FALSE, eval=FALSE}
ht<-t1$Failing/t1$At_Risk
plot(ht, type="s", xlab = "Time", ylab="h(t)", ylim=c(0,1))
```
### The hazard function
The hazard function relates death, *f(t)*, and survival, *S(t)*, to one another
$$h(t) = \frac{f(t)}{S(t)}$$
$$h(t) = \lim_{\Delta t \rightarrow 0} \frac{Pr(t \leqslant T \leqslant t + \Delta t | T \geqslant t)}{\Delta t}$$
Which is the failure rate per unit time in the interval *t*, $t+\Delta t$, the hazard may increase or decrease with time, or stay the same. This is really dependent on the distribution of failure times.
## Relationships among the three functions
If $ft = \frac{dF(t)}{dt}$ and $S(t) = 1- F(t)$ and, $h(t) = \frac{f(t)}{S(t)}$, then we can write:
$$f(t) = \frac{-dS(t)}{dt}$$
and the hazard function as:
$$h(t) = \frac{-d \text{ log } S(t)}{dt}$$
If we integrate this and let $S(0)=1$, then
$$S(t) = exp^{-\int_{0}^t h(u) du} = e^{-H(t)}$$
where the quantity, $H(t)$ is called the *cumulative hazard function* and, $H(t) = \int h(u) du$, then
$$H(t) = -\text{log } S(t)$$
The density can be written as:
$$f(t) = h(t) e ^{-H(t)}$$
and
$$h(t) = \frac{h(t) e^{-H(t)}}{e^{-H(t)}} = \frac{f(t)}{S(t)}$$
### More on the hazard function...
Unlike the *f(t)* or *S(t)*, *h(t)* describes the risk an individual faces of experiencing the event, given they have survived up to that time.
This kind of conditional probability is of special interest to us.
We can extend this framework to include effects of *individual characteristics on one's risk*, thus not only introducing dependence on time, but also on these characteristics (covariates).
We can re-express the hazard rate with both these conditions as:
$$h(t|x) = \lim_{\Delta t \rightarrow 0} \frac{Pr(t \leqslant T \leqslant t + \Delta t | T \geqslant t, x)}{\Delta t}$$
## Logistic regression
When a variable is coded as binary, meaning either 1 or 0, the Bernoulli distribution is used, as in the logistic regression model. When coded like this, the model tries to use the other measured information to predict the 1 value versus the 0 value. So in the basic sense, we want to construct a model like:
$$Pr(y=1) =\pi = \text{some function of predictors}$$