Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
dorisziye committed Nov 21, 2024
1 parent e0bd418 commit 046fd62
Show file tree
Hide file tree
Showing 6 changed files with 101,408 additions and 152 deletions.
50,001 changes: 50,001 additions & 0 deletions data/SAR/sar_sample_code.csv

Large diffs are not rendered by default.

50,001 changes: 50,001 additions & 0 deletions data/SAR/sar_sample_label.csv

Large diffs are not rendered by default.

296 changes: 181 additions & 115 deletions docs/labs/03.QualitativeVariable.html

Large diffs are not rendered by default.

1,174 changes: 1,174 additions & 0 deletions docs/labs/04.LogisticRegression.html

Large diffs are not rendered by default.

53 changes: 31 additions & 22 deletions labs/03.QualitativeVariable.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -151,14 +151,14 @@ Use the codes of Chi-test and Cramer's V to answer this question by completing T

**Table 1 Person-level correlations with health status**

| | | | |
|------------------|------------------|-------------------|-------------------|
| **Covariates** | | **Correlation Coefficient** | **Statistical Significance** |
| | | *Cramer’s V* | *p-value* |
| *health* | *age_group* | | |
| *Health* | *highest_qual* | | |
| *health* | *marital_status* | | |
| *Health* | *nssec* | | |
| | | | |
|------------------|------------------|------------------|------------------|
| **Covariates** | | **Correlation Coefficient** | **Statistical Significance** |
| | | *Cramer’s V* | *p-value* |
| *health* | *age_group* | | |
| *Health* | *highest_qual* | | |
| *health* | *marital_status* | | |
| *Health* | *nssec* | | |

## **Implementing a linear regression model with a qualitative independent variable**

Expand Down Expand Up @@ -302,19 +302,19 @@ Last but not least, the **Measure of Model Fit**. The model output suggests the

Now, complete the following table.

| Region names | Higher or lower than London | Whether the difference is statistically significant (Yes or No) |
|:----------------:|:----------------:|:---------------------------------:|
| East Midlands | | |
| East of England | | |
| North East | | |
| North West | | |
| South East | | |
| South West | | |
| West Midlands | | |
| Yorkshire and The Humber | | |
| Wales | | |
| Scotland | | |
| Northern Ireland | | |
| Region names | Higher or lower than London | Whether the difference is statistically significant (Yes or No) |
|:-----------------:|:-----------------:|:---------------------------------:|
| East Midlands | | |
| East of England | | |
| North East | | |
| North West | | |
| South East | | |
| South West | | |
| West Midlands | | |
| Yorkshire and The Humber | | |
| Wales | | |
| Scotland | | |
| Northern Ireland | | |

### **Change the baseline category**

Expand Down Expand Up @@ -494,4 +494,13 @@ the model of Wales will be: *pct_Long_term_ill* (%) *= 47.218+ (-0.834)\* 49 + 0

the model of London will be: *pct_Long_term_ill* (%) *= 47.218+ (-0.834)\* 49 + 0.472 \* 23 + 1.072\*0+ 4.345\* 0 = 17.208*

**Therefore, the percentage of persons with long-term illness in Wales and London be 21.533% and 17.208% separately. If you got the right answers, then congratulations you can now use regression model to make prediction.**
You can also make a new object like

```{r}
obj_London <- data.frame(pct_Males = 49, pct_No_qualifications=23,pct_Higher_manager_prof =11, New_region_label ="London")
obj_Wales <- data.frame(pct_Males = 49, pct_No_qualifications=23,pct_Higher_manager_prof =11, New_region_label ="Wales")
predict(model2,obj_London)
predict(model2,obj_Wales)
```

**Therefore, the percentage of persons with long-term illness in Wales and London be 21.5% and 17.2% separately. If you got the right answers, then congratulations you can now use regression model to make prediction.**
35 changes: 20 additions & 15 deletions labs/04.LogisticRegression.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,34 +27,40 @@ The practical is split into two main parts. The first focuses on implementing a

## Preparing the input variables

Prepare the data for implementing a logistic regression model. The data set used in this practical is the “SAR.csv”.
Prepare the data for implementing a logistic regression model. The data set used in this practical is the “sar_sample_label.csv” and "sar_sample_code.csv". They are actually the same dataframe, only one uses the label as the value but the other uses the code. We will first read in both for the data overview the labels are more friendly, and then we focus on using "sar_sample_code.csv" in the regression model as it is easier for coding.

```{r, warning=FALSE}
library(tidyverse)
library(broom)
```

```{r}
#sar <- read.csv("../data/FamilyResourceSurvey/FRS16-17.csv")
sar <- haven::read_sav("../../Week 11/SAR.sav")
```{r,results='hide'}
sar_label <- read_csv("../data/SAR/sar_sample_label.csv")
sar_code <- read_csv("../data/SAR/sar_sample_code.csv")
```

```{r}
glimpse(sar)
```{r,results='hide'}
glimpse(sar_label)
glimpse(sar_code)
```

```{r}
summary(sar)
summary(sar_label)
```

The outcome variable is a person’s commuting distance captured by the variable “work_distance”.

```{r}
table(sar$work_distance)
table(sar_label$work_distance)
```

```{r}
table(sar_code$work_distance)
```

There are a variety of categories in the variable, however, we are only interested in commuting distance and therefore in people reporting their commuting distance. Thus, we will explore the numeric codes of the variable ranging from 1 to 8.

| Code for Work_distance | Cateogories |
| Code for Work_distance | Categories |
|------------------------|----------------------------------------------|
| 1 | Less than 2 km |
| 2 | 2 to \<5 km |
Expand All @@ -72,7 +78,8 @@ There are a variety of categories in the variable, however, we are only interest
As we are also interested in exploring whether people with different socio-economic statuses (or occupations) tend to be associated with varying probabilities of commuting over long distances, we further filter or select cases.

```{r}
table(sar$nssec)
table(sar_label$nssec)
table(sar_code$nssec)
```

Using `nssec`, we select people who reported an occupation, and delete cases with numeric codes from 9 to 12, who are *unemployed*, *full-time students*, *children* and *not classifiable*.
Expand All @@ -89,13 +96,13 @@ Using `nssec`, we select people who reported an occupation, and delete cases wit
| 8 | Routine occupations |
| 9 | Never worked or long-term employed |
| 10 | Full-time student |
| 11 | Not classificable |
| 11 | Not classifiable |
| 12 | Child aged 0-15 |

Now, similar to next week, we use the `filter()` to prepare our dataframe today.
Now, similar to next week, we use the `filter()` to prepare our dataframe today. You may already realise that using `sar_code` is easier to do the filtering.

```{r}
sar_df <- sar %>% filter(work_distance<=8 & nssec <=8 )
sar_df <- sar_code %>% filter(work_distance<=8 & nssec <=8 )
```

::: {style="background-color: #FFFBCC; padding: 10px; border-radius: 5px; border: 1px solid #E1C948;"}
Expand Down Expand Up @@ -180,8 +187,6 @@ library(pscl)
# Pseudo R-squared
tidy(pR2(m.glm))
AIC(m.glm)
```

### **Interpreting estimated regression coefficients**
Expand Down

0 comments on commit 046fd62

Please sign in to comment.