update

GDSL-UL · Nov 21, 2024 · 046fd62 · 046fd62
1 parent e0bd418
commit 046fd62
Show file tree

Hide file tree

Showing 6 changed files with 101,408 additions and 152 deletions.
diff --git a/data/SAR/sar_sample_code.csv b/data/SAR/sar_sample_code.csv
diff --git a/data/SAR/sar_sample_label.csv b/data/SAR/sar_sample_label.csv
diff --git a/docs/labs/03.QualitativeVariable.html b/docs/labs/03.QualitativeVariable.html
diff --git a/docs/labs/04.LogisticRegression.html b/docs/labs/04.LogisticRegression.html
diff --git a/labs/03.QualitativeVariable.qmd b/labs/03.QualitativeVariable.qmd
@@ -151,14 +151,14 @@ Use the codes of Chi-test and Cramer's V to answer this question by completing T
 
 **Table 1 Person-level correlations with health status**
 
-|                |                  |                             |                              |
-|------------------|------------------|-------------------|-------------------|
-| **Covariates** |                  | **Correlation Coefficient** | **Statistical Significance** |
-|                |                  | *Cramer’s V*                | *p-value*                    |
-| *health*       | *age_group*      |                             |                              |
-| *Health*       | *highest_qual*   |                             |                              |
-| *health*       | *marital_status* |                             |                              |
-| *Health*       | *nssec*          |                             |                              |
+|  |  |  |  |
+|------------------|------------------|------------------|------------------|
+| **Covariates** |  | **Correlation Coefficient** | **Statistical Significance** |
+|  |  | *Cramer’s V* | *p-value* |
+| *health* | *age_group* |  |  |
+| *Health* | *highest_qual* |  |  |
+| *health* | *marital_status* |  |  |
+| *Health* | *nssec* |  |  |
 
 ## **Implementing a linear regression model with a qualitative independent variable**
 
@@ -302,19 +302,19 @@ Last but not least, the **Measure of Model Fit**. The model output suggests the
 
 Now, complete the following table.
 
-|       Region names       | Higher or lower than London | Whether the difference is statistically significant (Yes or No) |
-|:----------------:|:----------------:|:---------------------------------:|
-|      East Midlands       |                             |                                                                 |
-|     East of England      |                             |                                                                 |
-|        North East        |                             |                                                                 |
-|        North West        |                             |                                                                 |
-|        South East        |                             |                                                                 |
-|        South West        |                             |                                                                 |
-|      West Midlands       |                             |                                                                 |
-| Yorkshire and The Humber |                             |                                                                 |
-|          Wales           |                             |                                                                 |
-|         Scotland         |                             |                                                                 |
-|     Northern Ireland     |                             |                                                                 |
+| Region names | Higher or lower than London | Whether the difference is statistically significant (Yes or No) |
+|:-----------------:|:-----------------:|:---------------------------------:|
+| East Midlands |  |  |
+| East of England |  |  |
+| North East |  |  |
+| North West |  |  |
+| South East |  |  |
+| South West |  |  |
+| West Midlands |  |  |
+| Yorkshire and The Humber |  |  |
+| Wales |  |  |
+| Scotland |  |  |
+| Northern Ireland |  |  |
 
 ### **Change the baseline category**
 
@@ -494,4 +494,13 @@ the model of Wales will be: *pct_Long_term_ill* (%) *= 47.218+ (-0.834)\* 49 + 0
 
 the model of London will be: *pct_Long_term_ill* (%) *= 47.218+ (-0.834)\* 49 + 0.472 \* 23 + 1.072\*0+ 4.345\* 0 = 17.208*
 
-**Therefore, the percentage of persons with long-term illness in Wales and London be 21.533% and 17.208% separately. If you got the right answers, then congratulations you can now use regression model to make prediction.**
+You can also make a new object like
+
+```{r}
+obj_London <- data.frame(pct_Males = 49, pct_No_qualifications=23,pct_Higher_manager_prof =11, New_region_label ="London")
+obj_Wales <- data.frame(pct_Males = 49, pct_No_qualifications=23,pct_Higher_manager_prof =11, New_region_label ="Wales")
+predict(model2,obj_London)
+predict(model2,obj_Wales)
+```
+
+**Therefore, the percentage of persons with long-term illness in Wales and London be 21.5% and 17.2% separately. If you got the right answers, then congratulations you can now use regression model to make prediction.**
diff --git a/labs/04.LogisticRegression.qmd b/labs/04.LogisticRegression.qmd
@@ -27,34 +27,40 @@ The practical is split into two main parts. The first focuses on implementing a
 
 ## Preparing the input variables
 
-Prepare the data for implementing a logistic regression model. The data set used in this practical is the “SAR.csv”.
+Prepare the data for implementing a logistic regression model. The data set used in this practical is the “sar_sample_label.csv” and "sar_sample_code.csv". They are actually the same dataframe, only one uses the label as the value but the other uses the code. We will first read in both for the data overview the labels are more friendly, and then we focus on using "sar_sample_code.csv" in the regression model as it is easier for coding.
 
 ```{r, warning=FALSE}
 library(tidyverse)
+library(broom)
 ```
 
-```{r}
-#sar <- read.csv("../data/FamilyResourceSurvey/FRS16-17.csv")
-sar <- haven::read_sav("../../Week 11/SAR.sav")
+```{r,results='hide'}
+sar_label <- read_csv("../data/SAR/sar_sample_label.csv")
+sar_code <- read_csv("../data/SAR/sar_sample_code.csv")
 ```
 
-```{r}
-glimpse(sar)
+```{r,results='hide'}
+glimpse(sar_label)
+glimpse(sar_code)
 ```
 
 ```{r}
-summary(sar)
+summary(sar_label)
 ```
 
 The outcome variable is a person’s commuting distance captured by the variable “work_distance”.
 
 ```{r}
-table(sar$work_distance)
+table(sar_label$work_distance)
+```
+
+```{r}
+table(sar_code$work_distance)
 ```
 
 There are a variety of categories in the variable, however, we are only interested in commuting distance and therefore in people reporting their commuting distance. Thus, we will explore the numeric codes of the variable ranging from 1 to 8.
 
-| Code for Work_distance | Cateogories                                  |
+| Code for Work_distance | Categories                                   |
 |------------------------|----------------------------------------------|
 | 1                      | Less than 2 km                               |
 | 2                      | 2 to \<5 km                                  |
@@ -72,7 +78,8 @@ There are a variety of categories in the variable, however, we are only interest
 As we are also interested in exploring whether people with different socio-economic statuses (or occupations) tend to be associated with varying probabilities of commuting over long distances, we further filter or select cases.
 
 ```{r}
-table(sar$nssec)
+table(sar_label$nssec)
+table(sar_code$nssec)
 ```
 
 Using `nssec`, we select people who reported an occupation, and delete cases with numeric codes from 9 to 12, who are *unemployed*, *full-time students*, *children* and *not classifiable*.
@@ -89,13 +96,13 @@ Using `nssec`, we select people who reported an occupation, and delete cases wit
 | 8              | Routine occupations                           |
 | 9              | Never worked or long-term employed            |
 | 10             | Full-time student                             |
-| 11             | Not classificable                             |
+| 11             | Not classifiable                              |
 | 12             | Child aged 0-15                               |
 
-Now, similar to next week, we use the `filter()` to prepare our dataframe today.
+Now, similar to next week, we use the `filter()` to prepare our dataframe today. You may already realise that using `sar_code` is easier to do the filtering.
 
 ```{r}
-sar_df <- sar %>% filter(work_distance<=8 & nssec <=8 )
+sar_df <- sar_code %>% filter(work_distance<=8 & nssec <=8 )
 ```
 
 ::: {style="background-color: #FFFBCC; padding: 10px; border-radius: 5px; border: 1px solid #E1C948;"}
@@ -180,8 +187,6 @@ library(pscl)
 
 # Pseudo R-squared
 tidy(pR2(m.glm))
-
-AIC(m.glm)
 ```
 
 ### **Interpreting estimated regression coefficients**