Skip to content

Commit

Permalink
Ex5 (#69)
Browse files Browse the repository at this point in the history
* include ex5 answers files

* add questions notebook
  • Loading branch information
aezarebski authored Nov 30, 2022
1 parent e28b55a commit 8552a15
Show file tree
Hide file tree
Showing 9 changed files with 1,367 additions and 934 deletions.
6 changes: 6 additions & 0 deletions README.org
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,19 @@ to pull down changes from the main branch each week for the tutorial.

* Contents

** Notebooks

- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-0][Example 0]] contains an overview of the software packages we will be using and
basic python skills (primarily plotting with =matplotlib=).
- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-1][Example 1]] reviews the central limit theorem and looks at confidence intervals
and hypothesis tests.
- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-2][Example 2]] reviews simple linear regression.
- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-3][Example 3]] reviews multiple linear regression.
- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-4][Example 4]] reviews hypothesis tests for contingency tables.
- [[https://github.com/aezarebski/aas-extended-examples/tree/main/example-5][Example 5]] reviews logisic regression.

** Miscellaneous

- The [[https://github.com/aezarebski/aas-extended-examples/blob/main/setup.org][setup]] page gives some instructions for setting up the software and
notebooks.
- The [[https://github.com/aezarebski/aas-extended-examples/blob/main/additional-resources.org][additional resources]] page gives some pointers to additional resources
Expand Down
571 changes: 0 additions & 571 deletions example-2.75/example-2.7-answers.ipynb

This file was deleted.

363 changes: 0 additions & 363 deletions example-2.75/example-2.7-questions.ipynb

This file was deleted.

726 changes: 726 additions & 0 deletions example-5/example-5-answers.ipynb

Large diffs are not rendered by default.

272 changes: 272 additions & 0 deletions example-5/example-5-answers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.14.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---

# Example 5

This notebook is available on github
[here](https://github.com/aezarebski/aas-extended-examples). If you find errors
or would like to suggest an improvement, feel free to create an issue.

As usual we will start by importing some useful libraries.

```python
%matplotlib inline
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
```

In this notebook we will look at data on the mode of travel chosen by people
travelling between Sydney and Melbourne (Australia). Relevant variables for us
in this dataset are the mode of travel chosen: air, bus, car or train, the
travel time, the generalised cost of the journey, the income of the household
travelling, and the number of people travelling together. The relevant dataset
is provided by the `statsmodels` package.

```python
travel_dataset = sm.datasets.get_rdataset("TravelMode", "AER")
#print(travel_dataset.__doc__)
travel_df = travel_dataset.data
```

### Question

What units is the variable `income` in?

### Answer

Although it doesn't state it explicitly in the meta-data, it appears this is a
measure of income in thousands of Australian dollars. This can be guessed
because is on the order of the national average, although as a dataset from
1997, it is getting a bit old.

## Data exploration

As a first pass we will look at modelling whether the selected mode was train or
car. The following snippet puts the data in a tidy format. Note we have added a
new variable, `is_car`, which is `1` if the travel was done by car and `0` if it
was done by train. We will also subset the data to only contain those records
where the travellers had a choice in the mode of transport.

```python
cb_df = travel_df[(travel_df["mode"] == "train") | (travel_df["mode"] == "car")]
cb_df = cb_df[cb_df["choice"] == "yes"]
cb_df = cb_df[["mode", "income", "size"]]
cb_df = cb_df.rename(columns={"mode": "vehicle", "size": "num_people"})

predictor_names = ["income", "num_people"]

cb_df["is_car"] = 0
cb_df.loc[cb_df["vehicle"] == "car","is_car"] = 1
print(cb_df.head(10))
```

### Question

Generate visualisations to see how the distribution of income among car and
train trips. What do you notice?

[hint](https://aezarebski.github.io/misc/matplotlib/gallery.html#fig-06)

### Answer

The boxplots below show journeys by car are more likely to have individuals from
households with a larger income and transport more people.

```python tags=[]
print(cb_df.shape)
#cb_df.head(20)
unique_vehicles = cb_df.vehicle.unique()
grouped_values = {p : [cb_df[cb_df.vehicle == v][p] for v in unique_vehicles] for p in predictor_names}
```

```python
plt.figure()
plt.boxplot(x = grouped_values["income"],
labels = unique_vehicles)
plt.title("Traveller income")
plt.show()
```

```python
plt.figure()
plt.boxplot(x = grouped_values["num_people"],
labels = unique_vehicles)
plt.title("Number of people")
plt.show()
```

## Logistic model

### Question

Fit a logistic regression model to this data. Do the estimated coefficients make
sense?

[hint](https://www.statsmodels.org/stable/generated/statsmodels.formula.api.logit.html#statsmodels.formula.api.logit)

### Answer

We can use the `logit` function from `smf` to fit the logistic regression.

```python
logistic_model = smf.logit(formula="is_car ~ income + num_people", data = cb_df).fit()
```

```python
logistic_model.summary()
```

### Question

Write a function called `logit` which computes the logit function and
`inv_logit` which computes its inverse.

[hint](https://en.wikipedia.org/wiki/Logit#Definition)

### Answer

```python
def logit(p):
return np.log(p / (1 - p))

def inv_logit(a):
return np.exp(a) / (np.exp(a) + 1)
```

The following snippet demonstrates one way to visualise the results of the model
fit. If you have defined `logit` and `inv_logit` above this should make a
sensible figure.

```python
prob_is_car = logistic_model.predict()
log_odds_is_car = logit(prob_is_car)

plt.figure()
plt.scatter(log_odds_is_car, prob_is_car, label="Logistic regression")
plt.scatter(log_odds_is_car, cb_df["is_car"], label = "Data")
plt.legend()
plt.show()
```

### Question

For a fixed income, what change does the model predict for each additional
person on the journey? What happens to the log-odds for each additional person?
What happens to the odds?

### Answer

$$
\ln\left(\frac{p}{1-p}\right) = -2.2 + 0.06 \times \text{income} + 0.19 \times\text{num_people}
$$

For each additional person the *log-odds* of going by car increases by 0.19.

$$
\frac{p}{1-p} = e^{-2.2} (e^{0.06})^{\text{income}} + (e^{0.19})^{\text{num_people}}
$$

For each additional person the *odds* of going by car increase but the change is
not as simple as in the case of the log-odds.

### Question

For journeys with 2 people, plot the probability of going by car (as opposed to
train) as a function of household income. Determine the level of income at which
it becomes more likely they will travel by car than train.

### Answer

We can evaluate the predicted probability at a range of income levels and then
read off the plot that this changes at about 32.7 thousand, or we can just solve
for this value using the expression in the snippet.

```python
income_change_point = (2.23 - 0.189 * 2) / 0.0566
income_vals = np.linspace(10, 70, 100)

theta_int = logistic_model.params.Intercept
theta_income = logistic_model.params.income
theta_num = logistic_model.params.num_people

prob_car = inv_logit(theta_int + theta_income * income_vals + theta_num * 2)

plt.figure()
plt.plot(income_vals, prob_car)
plt.axhline(0.5, color='r')
plt.axvline(income_change_point, color='r')
plt.annotate("{x:.2f}".format(x=income_change_point), (income_change_point + 1, 0.2), color='r')
plt.show()
```

### Question

Plot the probability of going by car for 1, 2, and 3 people as a function of
income. What do you notice about the change in the probability as a function of
income?


### Answer

The change in probability is non-linear and depends upon the other variables,
but it is increasing with the number of people.

```python
income_vals = np.linspace(10, 70, 100)

theta_int = logistic_model.params.Intercept
theta_income = logistic_model.params.income
theta_num = logistic_model.params.num_people

prob_car_fn = lambda n : inv_logit(theta_int + theta_income * income_vals + theta_num * n)

plt.figure()
plt.plot(income_vals, prob_car_fn(1), color='r', label='One person')
plt.plot(income_vals, prob_car_fn(2), color='g', label='Two people')
plt.plot(income_vals, prob_car_fn(3), color='b', label='Three people')
plt.legend(loc='upper left')
plt.show()
```

## Bonus Example: Multinomial logistic regression

### Question

Set up a multinomial logistic regression model to predict the mode of transport
used based on all of the data.

### Answer

```python
all_df = travel_df[travel_df["choice"] == "yes"]
all_df = all_df[["mode", "income", "size"]]
all_df = all_df.rename(columns={"mode": "vehicle", "size": "num_people"})

all_df["vehicle_int"] = 0
all_df.loc[all_df["vehicle"] == "air","vehicle_int"] = 1
all_df.loc[all_df["vehicle"] == "bus","vehicle_int"] = 2
all_df.loc[all_df["vehicle"] == "car","vehicle_int"] = 3
all_df.loc[all_df["vehicle"] == "train","vehicle_int"] = 4
all_df.vehicle.value_counts()
```

```python
multi_logistic = smf.mnlogit(formula = "vehicle_int ~ income + num_people", data = all_df).fit()
multi_logistic.summary()
```
Loading

0 comments on commit 8552a15

Please sign in to comment.