Ex6 (#71)

* add edited notebook 6 * add questions notebook * organise materials * expand checker range * update readme * remove answers from questions notebook
aezarebski · Dec 8, 2022 · f57c16d · f57c16d
1 parent 8552a15
commit f57c16d
Show file tree

Hide file tree

Showing 30 changed files with 1,822 additions and 5 deletions.
diff --git a/README.org b/README.org
@@ -8,10 +8,6 @@ notebooks: =example-<n>-questions.ipynb= and =example-<n>-answers.ipynb=. The
 before looking at the answers as this is the best way to check that you really
 understand them.
 
-*Please note*: we are in the process of revising the tutorial notebooks on a
-weekly basis based on the lectures and previous sessions. As such, you will need
-to pull down changes from the main branch each week for the tutorial.
-
 * Contents
 
 ** Notebooks
@@ -24,6 +20,7 @@ to pull down changes from the main branch each week for the tutorial.
 - [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-3][Example 3]] reviews multiple linear regression.
 - [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-4][Example 4]] reviews hypothesis tests for contingency tables.
 - [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-5][Example 5]] reviews logisic regression.
+- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-6][Example 6]] reviews mult-level models.
 
 ** Miscellaneous
 

diff --git a/example-6/example-6-answers.ipynb b/example-6/example-6-answers.ipynb
diff --git a/example-6/example-6-answers.md b/example-6/example-6-answers.md
@@ -0,0 +1,328 @@
+---
+jupyter:
+  jupytext:
+    formats: ipynb,md
+    text_representation:
+      extension: .md
+      format_name: markdown
+      format_version: '1.3'
+      jupytext_version: 1.14.1
+  kernelspec:
+    display_name: Python 3 (ipykernel)
+    language: python
+    name: python3
+---
+
+# Example 6
+
+This notebook is available on github
+[here](https://github.com/aezarebski/aas-extended-examples). If you find errors
+or would like to suggest an improvement, feel free to create an issue.
+
+As usual we will start by importing some useful libraries.
+
+```python
+%matplotlib inline
+import pandas as pd
+import numpy as np
+import scipy.stats as stats
+import statsmodels.api as sm
+import statsmodels.formula.api as smf
+import statsmodels.tools.tools as smt
+import matplotlib.pyplot as plt
+```
+
+# Salaries for Professors
+
+The 2008--09 nine-month academic salary for Assistant Professors, Associate
+Professors and Professors in a college in the U.S. The data were collected as
+part of the on-going effort of the college's administration to monitor salary
+differences between male and female faculty members.
+
+- `rank` a factor with levels `AssocProf`` `AsstProf`` `Prof`.
+- `discipline` a factor with levels `A` ("theoretical" departments) or ``B`` ("applied" departments).
+- `yrs.since.phd` years since PhD.
+- `yrs.service` years of service.
+- `sex` a factor with levels `Female` `Male`.
+- `salary` nine-month salary, in dollars.
+
+This data comes with the `statsmodels` package along with some metadata. Here we
+will rename a couple of variables so they are a bit easier to access.
+
+```python
+salaries_dataset = sm.datasets.get_rdataset("Salaries", "carData")
+#print(salaries_dataset.__doc__)
+salaries_df = salaries_dataset.data
+salaries_df = salaries_df.rename(columns={"yrs.since.phd": "years_post_phd", "yrs.service": "years_service", "rank": "job"})
+```
+
+We can use a scatter plot to get an idea for how the salary changes. There is a
+general trend that the salary increases through time, but the variance appears
+to change.
+
+```python
+plt.figure()
+plt.scatter(salaries_df.years_post_phd, salaries_df.salary)
+plt.xlabel("Years post PhD", fontsize="large")
+plt.ylabel("Salary", fontsize="large")
+plt.show()
+```
+
+If we colour these points based on the professors' ranks a very different
+pattern emerges
+
+```python
+jobs = salaries_df["job"].unique()
+
+plt.figure()
+for job in jobs:
+    tmp = salaries_df[salaries_df["job"] == job]
+    plt.scatter(tmp.years_post_phd, tmp.salary, label = job)
+plt.legend(title = "Rank", title_fontsize = "large")
+plt.xlabel("Years post PhD", fontsize="large")
+plt.ylabel("Salary", fontsize="large")
+plt.show()
+```
+
+### Question
+
+What assumption of the variance component model clearly does not hold for this
+data set?
+
+
+### Answer
+
+The within group variance is differs wildly!
+
+```python
+for job in jobs:
+    print(job)
+    print("mean: " + str(salaries_df[salaries_df["job"] == job].salary.mean()))
+    print("scale: " + str(salaries_df[salaries_df["job"] == job].salary.std()))
+```
+
+# Simulation
+
+Before we analyse this data we should familiarise ourselves with the
+functionality provided by `statsmodels`. To have a data set where we know the
+"true" values we will simulate a very similar dataset. Note that we have set the
+mean values for each rank to the true values and set a constant scale across all
+the jobs of $15000$ (note $15000^{2} = 225000000$). The "years post PhD" is
+sampled randomly from a Poisson distribution with a relevant mean.
+
+```python
+demo_job = np.repeat(a=['Prof', 'AsstProf', 'AssocProf'], repeats=100)
+demo_ypp = np.concatenate(
+    (stats.poisson.rvs(30, size = 100),
+    stats.poisson.rvs(5, size = 100),
+    stats.poisson.rvs(15, size = 100)))
+
+demo_salary_means = [126772,80775,93876]
+demo_salary_scale = 15000
+demo_salary = stats.norm.rvs(loc = np.repeat(a=demo_salary_means, repeats=100), scale = demo_salary_scale, size = 300)
+
+demo_df = pd.DataFrame({'job': demo_job,
+                      'years_post_phd': demo_ypp,
+                       'salary': demo_salary})
+demo_df = smt.add_constant(demo_df)
+
+jobs = demo_df["job"].unique()
+```
+
+We can use the same plotting code from before to confirm that this is a relevant
+dataset to work with.
+
+```python
+plt.figure()
+for job in jobs:
+    tmp = demo_df[demo_df["job"] == job]
+    plt.scatter(tmp.years_post_phd, tmp.salary, label = job)
+plt.title("Simulated data")
+plt.legend(title = "Rank", title_fontsize = "large")
+plt.xlabel("Years post PhD", fontsize="large")
+plt.ylabel("Salary", fontsize="large")
+plt.show()
+```
+
+### Question
+
+As a way to get a rough idea of the components of the variance, estimate the
+variance among the known means and use it to compute the variance partition
+coefficient (VPC) for the simulated dataset. Obviously since we are estimating
+the variance based on a data set of size 3 we should not put too much faith in
+the results.
+
+### Answer
+
+```python
+tmp = np.array(demo_salary_means)
+silly_group_var_est = 0.5 * np.power(tmp - tmp.mean(), 2).sum()
+silly_vpc = silly_group_var_est / (demo_salary_scale**2 + silly_group_var_est)
+print(f"Silly VPC: {silly_vpc:.2f}")
+```
+
+### Question
+
+Write down a way to describe the salaries with a variance components model. What
+parameters will be estimated?
+
+
+### Answer
+
+Let $s_{ij}$ be the salary of individual $i$ with job $j$ then
+
+$$
+s_{ij} = \beta_{0} + u_{j} + e_{ij}
+$$
+
+where $u_{j} \sim N(0, \sigma_{u}^{2})$ and $e_{ij} \sim N(0, \sigma_{e}^{2})$.
+The parameters that will be estimated are $\beta_{0}$, and $\sigma_{u}^{2}$ and
+$\sigma_{e}^{2}$.
+
+
+### Question
+
+The following cell fits this model to the data. What are the estimated parameter
+values?
+
+Hint: Due to an odd choice of names, the "Scale" parameter in the summary is the
+individual level variance.
+
+```python
+mlm_0 = smf.mixedlm(formula = "salary ~ 1", data = demo_df, groups = demo_df.job).fit()
+mlm_0.summary()
+```
+
+### Answer
+
+Note that because the data here have been randomly generated you may get
+different numbers...
+
+- $\hat{\beta}_{0} = 1.0 \times 10^{5}$
+- $\hat{\sigma}_{u}^{2} = 5.5 \times 10^{8}$
+- $\hat{\sigma}_{e}^{2} = 2.3 \times 10^{8}$
+
+
+### Question
+
+Does this look reasonable?
+
+
+### Answer
+
+The value of $\beta_{0}$ (the global mean) looks reasonable given the values we
+simulated with and the individual level variance is approximately correct as
+shown in the following snippet. Given there are only three jobs (groups) it is
+harder to investigate if we are getting the scale correct there.
+
+```python
+individual_var_est = mlm_0.scale
+print(f"Estimated individual salary scale: {np.sqrt(individual_var_est):.2e}")
+print(f"True individual salary scale (from simulation): {demo_salary_scale:.2e}")
+
+print(f"\nEstimated average salary: {mlm_0.params.Intercept:.2e}")
+print(f"True average salary (from simulation): {demo_df.salary.mean():.2e}")
+```
+
+### Question
+
+Compute the VPC from the model fit. Does it agree with the previous estimate?
+
+
+### Answer
+
+Yes
+
+```python
+group_var = float(mlm_0.summary().tables[1]["Coef."]["Group Var"])
+vpc = group_var / (group_var + individual_var_est)
+print(f"Silly VPC: {silly_vpc:.2f}")
+print(f"Model VPC: {vpc:.2f}")
+```
+
+### Question
+
+Test whether including the effects of job is important in this model.
+
+
+### Answer
+
+The case of no job level effect is a nested model so we can use a chi-squared
+test to reject the null hypothesis that there is no random effect at the job
+level.
+
+```python
+lm_1 = smf.ols(formula = "salary ~ 1", data = demo_df).fit()
+mlm_llhd = mlm_0.llf
+print("MLM log-likelihood")
+print(mlm_llhd)
+
+lm_llhd = lm_1.llf
+print("LM log-likelihood")
+print(lm_llhd)
+
+my_chi = -2 * (lm_llhd - mlm_llhd)
+print("Chi-squared statistic")
+print(my_chi)
+
+print("p-value")
+print(1 - stats.chi2.cdf(my_chi, df=1))
+```
+
+### Question
+
+If we were to add `years_post_phd` as a covariate, what sort of model would this
+be? What was the name given to this type of model in the lecture?
+
+
+### Answer
+
+It would be a random intercept model.
+
+
+### Question
+
+Fit the model including the `years_post_phd` as a covariate. Does this parameter
+have a significant association?
+
+
+### Answer
+
+No, as expected since the salaries were simulated independent of this value
+(given the job).
+
+```python
+mlm_1 = smf.mixedlm(formula = "salary ~ 1 + years_post_phd", groups = demo_df.job, data = demo_df).fit()
+mlm_1.summary()
+```
+
+### Question
+
+Apply the methodology above to establish if `years_post_phd` has a significant
+association with a Professor's salary while adjusting for random job-specific
+effects.
+
+### Answer
+
+This model does not find a significant correlation
+
+```python
+mlm_2 = smf.mixedlm(formula = "salary ~ 1 + years_post_phd", groups = salaries_df.job, data = salaries_df).fit()
+mlm_2.summary()
+```
+
+### Question
+
+How much of the variance is explained by the professors rank?
+
+### Answer
+
+```python
+print(f"VPC: {596087722.791 / (559427686.2493 + 596087722.791):.3f}")
+```
+
+```python
+group_var = float(mlm_2.summary().tables[1]["Coef."]["Group Var"])
+ind_var = mlm_2.scale
+print(f"VPC: {group_var / (group_var + ind_var):.3f}")
+```