Skip to content

Commit

Permalink
Ex6 (#71)
Browse files Browse the repository at this point in the history
* add edited notebook 6

* add questions notebook

* organise materials

* expand checker range

* update readme

* remove answers from questions notebook
  • Loading branch information
aezarebski authored Dec 8, 2022
1 parent 8552a15 commit f57c16d
Show file tree
Hide file tree
Showing 30 changed files with 1,822 additions and 5 deletions.
5 changes: 1 addition & 4 deletions README.org
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,6 @@ notebooks: =example-<n>-questions.ipynb= and =example-<n>-answers.ipynb=. The
before looking at the answers as this is the best way to check that you really
understand them.

*Please note*: we are in the process of revising the tutorial notebooks on a
weekly basis based on the lectures and previous sessions. As such, you will need
to pull down changes from the main branch each week for the tutorial.

* Contents

** Notebooks
Expand All @@ -24,6 +20,7 @@ to pull down changes from the main branch each week for the tutorial.
- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-3][Example 3]] reviews multiple linear regression.
- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-4][Example 4]] reviews hypothesis tests for contingency tables.
- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-5][Example 5]] reviews logisic regression.
- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-6][Example 6]] reviews mult-level models.

** Miscellaneous

Expand Down
863 changes: 863 additions & 0 deletions example-6/example-6-answers.ipynb

Large diffs are not rendered by default.

328 changes: 328 additions & 0 deletions example-6/example-6-answers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.14.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---

# Example 6

This notebook is available on github
[here](https://github.com/aezarebski/aas-extended-examples). If you find errors
or would like to suggest an improvement, feel free to create an issue.

As usual we will start by importing some useful libraries.

```python
%matplotlib inline
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.tools.tools as smt
import matplotlib.pyplot as plt
```

# Salaries for Professors

The 2008--09 nine-month academic salary for Assistant Professors, Associate
Professors and Professors in a college in the U.S. The data were collected as
part of the on-going effort of the college's administration to monitor salary
differences between male and female faculty members.

- `rank` a factor with levels `AssocProf`` `AsstProf`` `Prof`.
- `discipline` a factor with levels `A` ("theoretical" departments) or ``B`` ("applied" departments).
- `yrs.since.phd` years since PhD.
- `yrs.service` years of service.
- `sex` a factor with levels `Female` `Male`.
- `salary` nine-month salary, in dollars.

This data comes with the `statsmodels` package along with some metadata. Here we
will rename a couple of variables so they are a bit easier to access.

```python
salaries_dataset = sm.datasets.get_rdataset("Salaries", "carData")
#print(salaries_dataset.__doc__)
salaries_df = salaries_dataset.data
salaries_df = salaries_df.rename(columns={"yrs.since.phd": "years_post_phd", "yrs.service": "years_service", "rank": "job"})
```

We can use a scatter plot to get an idea for how the salary changes. There is a
general trend that the salary increases through time, but the variance appears
to change.

```python
plt.figure()
plt.scatter(salaries_df.years_post_phd, salaries_df.salary)
plt.xlabel("Years post PhD", fontsize="large")
plt.ylabel("Salary", fontsize="large")
plt.show()
```

If we colour these points based on the professors' ranks a very different
pattern emerges

```python
jobs = salaries_df["job"].unique()

plt.figure()
for job in jobs:
tmp = salaries_df[salaries_df["job"] == job]
plt.scatter(tmp.years_post_phd, tmp.salary, label = job)
plt.legend(title = "Rank", title_fontsize = "large")
plt.xlabel("Years post PhD", fontsize="large")
plt.ylabel("Salary", fontsize="large")
plt.show()
```

### Question

What assumption of the variance component model clearly does not hold for this
data set?


### Answer

The within group variance is differs wildly!

```python
for job in jobs:
print(job)
print("mean: " + str(salaries_df[salaries_df["job"] == job].salary.mean()))
print("scale: " + str(salaries_df[salaries_df["job"] == job].salary.std()))
```

# Simulation

Before we analyse this data we should familiarise ourselves with the
functionality provided by `statsmodels`. To have a data set where we know the
"true" values we will simulate a very similar dataset. Note that we have set the
mean values for each rank to the true values and set a constant scale across all
the jobs of $15000$ (note $15000^{2} = 225000000$). The "years post PhD" is
sampled randomly from a Poisson distribution with a relevant mean.

```python
demo_job = np.repeat(a=['Prof', 'AsstProf', 'AssocProf'], repeats=100)
demo_ypp = np.concatenate(
(stats.poisson.rvs(30, size = 100),
stats.poisson.rvs(5, size = 100),
stats.poisson.rvs(15, size = 100)))

demo_salary_means = [126772,80775,93876]
demo_salary_scale = 15000
demo_salary = stats.norm.rvs(loc = np.repeat(a=demo_salary_means, repeats=100), scale = demo_salary_scale, size = 300)

demo_df = pd.DataFrame({'job': demo_job,
'years_post_phd': demo_ypp,
'salary': demo_salary})
demo_df = smt.add_constant(demo_df)

jobs = demo_df["job"].unique()
```

We can use the same plotting code from before to confirm that this is a relevant
dataset to work with.

```python
plt.figure()
for job in jobs:
tmp = demo_df[demo_df["job"] == job]
plt.scatter(tmp.years_post_phd, tmp.salary, label = job)
plt.title("Simulated data")
plt.legend(title = "Rank", title_fontsize = "large")
plt.xlabel("Years post PhD", fontsize="large")
plt.ylabel("Salary", fontsize="large")
plt.show()
```

### Question

As a way to get a rough idea of the components of the variance, estimate the
variance among the known means and use it to compute the variance partition
coefficient (VPC) for the simulated dataset. Obviously since we are estimating
the variance based on a data set of size 3 we should not put too much faith in
the results.

### Answer

```python
tmp = np.array(demo_salary_means)
silly_group_var_est = 0.5 * np.power(tmp - tmp.mean(), 2).sum()
silly_vpc = silly_group_var_est / (demo_salary_scale**2 + silly_group_var_est)
print(f"Silly VPC: {silly_vpc:.2f}")
```

### Question

Write down a way to describe the salaries with a variance components model. What
parameters will be estimated?


### Answer

Let $s_{ij}$ be the salary of individual $i$ with job $j$ then

$$
s_{ij} = \beta_{0} + u_{j} + e_{ij}
$$

where $u_{j} \sim N(0, \sigma_{u}^{2})$ and $e_{ij} \sim N(0, \sigma_{e}^{2})$.
The parameters that will be estimated are $\beta_{0}$, and $\sigma_{u}^{2}$ and
$\sigma_{e}^{2}$.


### Question

The following cell fits this model to the data. What are the estimated parameter
values?

Hint: Due to an odd choice of names, the "Scale" parameter in the summary is the
individual level variance.

```python
mlm_0 = smf.mixedlm(formula = "salary ~ 1", data = demo_df, groups = demo_df.job).fit()
mlm_0.summary()
```

### Answer

Note that because the data here have been randomly generated you may get
different numbers...

- $\hat{\beta}_{0} = 1.0 \times 10^{5}$
- $\hat{\sigma}_{u}^{2} = 5.5 \times 10^{8}$
- $\hat{\sigma}_{e}^{2} = 2.3 \times 10^{8}$


### Question

Does this look reasonable?


### Answer

The value of $\beta_{0}$ (the global mean) looks reasonable given the values we
simulated with and the individual level variance is approximately correct as
shown in the following snippet. Given there are only three jobs (groups) it is
harder to investigate if we are getting the scale correct there.

```python
individual_var_est = mlm_0.scale
print(f"Estimated individual salary scale: {np.sqrt(individual_var_est):.2e}")
print(f"True individual salary scale (from simulation): {demo_salary_scale:.2e}")

print(f"\nEstimated average salary: {mlm_0.params.Intercept:.2e}")
print(f"True average salary (from simulation): {demo_df.salary.mean():.2e}")
```

### Question

Compute the VPC from the model fit. Does it agree with the previous estimate?


### Answer

Yes

```python
group_var = float(mlm_0.summary().tables[1]["Coef."]["Group Var"])
vpc = group_var / (group_var + individual_var_est)
print(f"Silly VPC: {silly_vpc:.2f}")
print(f"Model VPC: {vpc:.2f}")
```

### Question

Test whether including the effects of job is important in this model.


### Answer

The case of no job level effect is a nested model so we can use a chi-squared
test to reject the null hypothesis that there is no random effect at the job
level.

```python
lm_1 = smf.ols(formula = "salary ~ 1", data = demo_df).fit()
mlm_llhd = mlm_0.llf
print("MLM log-likelihood")
print(mlm_llhd)

lm_llhd = lm_1.llf
print("LM log-likelihood")
print(lm_llhd)

my_chi = -2 * (lm_llhd - mlm_llhd)
print("Chi-squared statistic")
print(my_chi)

print("p-value")
print(1 - stats.chi2.cdf(my_chi, df=1))
```

### Question

If we were to add `years_post_phd` as a covariate, what sort of model would this
be? What was the name given to this type of model in the lecture?


### Answer

It would be a random intercept model.


### Question

Fit the model including the `years_post_phd` as a covariate. Does this parameter
have a significant association?


### Answer

No, as expected since the salaries were simulated independent of this value
(given the job).

```python
mlm_1 = smf.mixedlm(formula = "salary ~ 1 + years_post_phd", groups = demo_df.job, data = demo_df).fit()
mlm_1.summary()
```

### Question

Apply the methodology above to establish if `years_post_phd` has a significant
association with a Professor's salary while adjusting for random job-specific
effects.

### Answer

This model does not find a significant correlation

```python
mlm_2 = smf.mixedlm(formula = "salary ~ 1 + years_post_phd", groups = salaries_df.job, data = salaries_df).fit()
mlm_2.summary()
```

### Question

How much of the variance is explained by the professors rank?

### Answer

```python
print(f"VPC: {596087722.791 / (559427686.2493 + 596087722.791):.3f}")
```

```python
group_var = float(mlm_2.summary().tables[1]["Coef."]["Group Var"])
ind_var = mlm_2.scale
print(f"VPC: {group_var / (group_var + ind_var):.3f}")
```
Loading

0 comments on commit f57c16d

Please sign in to comment.