if(!require("pscl"))
@@ -709,14 +709,14 @@
Different from the multiple linear regression, whose R-squared indicates % of the variance in the dependent variables that is explained by the independent variable. In logistic regression model, R-squared is not directly applicable. Instead, we use pseudo R-squared measures, such as McFadden’s pseudo R-squared, or Cox & Snell pseudo R-squared to provide an indication of model fit. For the individual level dataset like SAR, value around 0.3 is considered good for well-fitting.
-
-4.1.3 Statistical significance of regression coefficients or covariate effects
+
+4.2.2 Statistical significance of regression coefficients or covariate effects
Similar to the statistical inference in a linear regression model context, p-values of regression coefficients are used to assess significances of coefficients; for instance, by comparing p-values to the conventional level of significance of 0.05:
· If the p-value of a coefficient is smaller than 0.05, the coefficient is statistically significant. In this case, you can say that the relationship between an independent variable and the outcome variable is statistically significant.
· If the p-value of a coefficient is larger than 0.05, the coefficient is statistically insignificant. In this case, you can say or conclude that there is no statistically significant association or relationship between an independent variable and the outcome variable.
-
-4.1.4 Interpreting estimated regression coefficients
+
+4.2.3 Interpreting estimated regression coefficients
The interpretation of coefficients (B) and odds ratios (Exp(B)) for the independent variables differs from that in a linear regression setting.
Interpreting the regression coefficients.
@@ -735,8 +735,8 @@
-4.1.5 Prediction using fitted regression model
+
+4.2.4 Prediction using fitted regression model
Relating to this week’s lecture, the log odds of the person who is will to long-distance commuting is equal to:
Log odds of long-distance commuting = 0.188 + 0.693 * sexFemale + 0.679 * nssec3 + 0.357*nssec4 + 3.409*nssec5 + 0.249*nssec6 + 0.237*nssec7 + 0.226*nssec8
By using R, you can create the object you would like to predict. Here we created three person, see whether you can interpret their gender and socio-economic classification?
@@ -754,8 +754,8 @@
-4.2 Extension activities
+
+4.3 Extension activities
The extension activities are designed to get yourself prepared for the Assignment 2 in progress. For this week, try whether you can:
Select a regression strategy and explain why a linear or logistic model is appropriate
diff --git a/docs/search.json b/docs/search.json
index c9ade58..b0270ca 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -14,7 +14,7 @@
"href": "labs/04.LogisticRegression.html#preparing-the-input-variables",
"title": "4 Lab: LogisticRegression",
"section": "",
- "text": "Code for Work_distance\nCategories\n\n\n\n\n1\nLess than 2 km\n\n\n2\n2 to <5 km\n\n\n3\n5 to <10 km\n\n\n4\n10 to <20 km\n\n\n5\n20 to <40 km\n\n\n6\n40 to <60 km\n\n\n7\n60km or more\n\n\n8\nAt home\n\n\n9\nNo fixed place\n\n\n10\nWork outside England and Wales but within UK\n\n\n11\nWork outside UK\n\n\n12\nWorks at offshore installation (within UK)\n\n\n\n\n\n\n\n\n\nCode for nssec\nCategory labels\n\n\n\n\n1\nLarge employers and higher managers\n\n\n2\nHigher professional occupations\n\n\n3\nLower managerial and professional occupations\n\n\n4\nIntermediate occupations\n\n\n5\nSmall employers and own account workers\n\n\n6\nLower supervisory and technical occupations\n\n\n7\nSemi-routine occupations\n\n\n8\nRoutine occupations\n\n\n9\nNever worked or long-term employed\n\n\n10\nFull-time student\n\n\n11\nNot classifiable\n\n\n12\nChild aged 0-15\n\n\n\n\n\n\nQ1. Summarise the frequencies of the two variables “work_distance” and “nssec” with the new data.\n\n\n\n\n\nQ2. Check the new sar_df dataframe with new column named New_work_distance by using the codes you have learnt.\n\n\n\n\n\n\n4.1.1 Implementing a logistic regression model\nThe binary dependent variable is long-distance commuting, variable name New_work_distance.\nThe independent variables are gender and socio-economic status.\nFor gender, we use male as the basline.\n\nsar_df$sex <- relevel(as.factor(sar_df$sex),ref=\"1\")\n\nFor socio-economic status, we use code 5 (Small employers and Own account workers) as the baseline category to explore whether people work as independent employers show lower probability of commuting longer than 60km compared with other occupations.\n\n#create the model\nm.glm = glm(New_work_distance~sex + nssec, \n data = sar_df, \n family= \"binomial\")\n# inspect the results\nsummary(m.glm) \n\n\nCall:\nglm(formula = New_work_distance ~ sex + nssec, family = \"binomial\", \n data = sar_df)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -1.67337 0.05329 -31.401 < 2e-16 ***\nsex2 -0.36678 0.04196 -8.742 < 2e-16 ***\nnssec1 -0.12881 0.11306 -1.139 0.255 \nnssec3 -0.38761 0.06467 -5.994 2.05e-09 ***\nnssec4 -1.03079 0.08439 -12.214 < 2e-16 ***\nnssec5 1.22639 0.06489 18.898 < 2e-16 ***\nnssec6 -1.38992 0.10919 -12.730 < 2e-16 ***\nnssec7 -1.43909 0.09002 -15.986 < 2e-16 ***\nnssec8 -1.48534 0.09646 -15.398 < 2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 20441 on 33025 degrees of freedom\nResidual deviance: 17968 on 33017 degrees of freedom\nAIC: 17986\n\nNumber of Fisher Scoring iterations: 6\n\n\n\n# odds ratios\nexp(coef(m.glm)) \n\n(Intercept) sex2 nssec1 nssec3 nssec4 nssec5 \n 0.1876138 0.6929649 0.8791416 0.6786766 0.3567267 3.4088847 \n nssec6 nssec7 nssec8 \n 0.2490946 0.2371432 0.2264258 \n\n\n\n# confidence intervals\nexp(confint(m.glm, level = 0.95)) \n\nWaiting for profiling to be done...\n\n\n 2.5 % 97.5 %\n(Intercept) 0.1688060 0.2080319\nsex2 0.6381810 0.7522773\nnssec1 0.7017990 1.0935602\nnssec3 0.5981911 0.7708192\nnssec4 0.3020431 0.4205270\nnssec5 3.0037298 3.8739884\nnssec6 0.2002766 0.3073830\nnssec7 0.1984396 0.2824629\nnssec8 0.1869397 0.2729172\n\n\n\nQ3. If we want to explore whether people with occupation being “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation, how will we prepare the input independent variables and what will be the specified regression model?\n\nHint: use mutate() to create a new column, set the value of “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” as original, while the rest as “Other occupations” (recall in Lab 3 what we did for assigning the regions not within “London”, “Wales”, “Scotland” and “Northern Ireland” as “Other Regions in England”). Here by using the SAR in code format, we can make this more easier by using:\n\nsar_df <- sar_df %>% mutate(New_nssec = fct_other(\n nssec,\n keep = c(\"1\", \"2\", \"8\"),\n other_level = \"0\"\n))\n\nOr by using if_else and %in% in R, we can achieve the same result. %in% is an operator used to test if elements of one vector are present in another. It returns TRUE for elements found and FALSE otherwise.\n\nsar_df <- sar_df %>% mutate(New_nssec = if_else(!nssec %in% c(1,2,8), \"0\" ,nssec))\n\nUse “Other occupations” (code: 0) as the reference category by relevel(as.factor()) and then create the regression model: glm(New_work_distance~sex + New_nssec, data = sar_df, family= \"binomial\"). Can you now run the model by yourself? Find the answer at the end of the practical.\n\n\n4.1.2 Model fit\nWe include the R library pscl for calculate the measures of fit.\n\nif(!require(\"pscl\"))\n install.packages(\"pscl\")\n\nLoading required package: pscl\n\n\nWarning: package 'pscl' was built under R version 4.3.3\n\n\nClasses and Methods for R originally developed in the\nPolitical Science Computational Laboratory\nDepartment of Political Science\nStanford University (2002-2015),\nby and under the direction of Simon Jackman.\nhurdle and zeroinfl functions by Achim Zeileis.\n\nlibrary(pscl)\n\nRelating back to this week’s lecture notes, what is the Pseudo R2 of the fitted logistic model (from the Model Summary table below)?\n\n# Pseudo R-squared\npR2(m.glm)\n\nfitting null model for pseudo-r2\n\n\n llh llhNull G2 McFadden r2ML \n-8.983928e+03 -1.022037e+04 2.472890e+03 1.209785e-01 7.214246e-02 \n r2CU \n 1.563288e-01 \n\n# or in better format\npR2(m.glm) %>% round(4) %>% tidy()\n\nfitting null model for pseudo-r2\n\n\n# A tibble: 6 × 2\n names x\n <chr> <dbl>\n1 llh -8984. \n2 llhNull -10220. \n3 G2 2473. \n4 McFadden 0.121 \n5 r2ML 0.0721\n6 r2CU 0.156 \n\n\n\nllh: The log-likelihood of the fitted model.\nllhNull: The log-likelihood of the null model (without predictors).\nG2: The likelihood ratio statistic, showing the model’s improvement over the null model.\nMcFadden: McFadden’s pseudo R-squared (a common measure of model fit).\nr2ML: Maximum likelihood pseudo R-squared.\nr2CU: Cox & Snell pseudo R-squared.\n\nDifferent from the multiple linear regression, whose R-squared indicates % of the variance in the dependent variables that is explained by the independent variable. In logistic regression model, R-squared is not directly applicable. Instead, we use pseudo R-squared measures, such as McFadden’s pseudo R-squared, or Cox & Snell pseudo R-squared to provide an indication of model fit. For the individual level dataset like SAR, value around 0.3 is considered good for well-fitting.\n\n\n4.1.3 Statistical significance of regression coefficients or covariate effects\nSimilar to the statistical inference in a linear regression model context, p-values of regression coefficients are used to assess significances of coefficients; for instance, by comparing p-values to the conventional level of significance of 0.05:\n· If the p-value of a coefficient is smaller than 0.05, the coefficient is statistically significant. In this case, you can say that the relationship between an independent variable and the outcome variable is statistically significant.\n· If the p-value of a coefficient is larger than 0.05, the coefficient is statistically insignificant. In this case, you can say or conclude that there is no statistically significant association or relationship between an independent variable and the outcome variable.\n\n\n4.1.4 Interpreting estimated regression coefficients\n\nThe interpretation of coefficients (B) and odds ratios (Exp(B)) for the independent variables differs from that in a linear regression setting.\nInterpreting the regression coefficients.\n\no For the variable sex, a negative sign and the odds ratio estimate indicate that the probability of commuting over long distances for female is 0.693 times less likely than male (the reference group), with the confidence intervals (CI) or likely range between 0.6 to 0.7, holding all other variables constant (the socio-economic classification variable). Put it differently, being females reduces the probability of long-distance commuting by 30.7% (1-0.693).\no For variable nssec, a positive significant and the odds ratio estimate indicate that the probability of long-distance commuting for those whose socio-economic classification as:\n\nsmall employers and own account workers (nssec=5) are 3.409 times more likely than the higher prof occupations, holding all other variables constant (the Sex variable), with a likely range (CI) of between 3.0 to 3.8.\nthe p-value of Large employers and higher managers (nssec=1) is > 0.05, so thre is no statistically significant relationship between large employers and higher managers and long-distance commuting.\nRoutine occupations (nssec=8) are 0.226 times (or 22.6%) less likely than the higher professional occupations, with the CI between 0.18 to 0.27. when other variable constant. Or, we can see being routine occupations decreases the probability of long-distance commuting by 77.4% (1-0.226).\n\n\nQ4. Interpret the regression coefficients (i.e. Exp(B)) of variables “nssec=Lower managerial and professional occupations” and “nssec=Semi-routine occupation”.\n\n\nQ5. Could you identify significant factors of commuting over long distances?\n\n\n\n4.1.5 Prediction using fitted regression model\nRelating to this week’s lecture, the log odds of the person who is will to long-distance commuting is equal to:\nLog odds of long-distance commuting = 0.188 + 0.693 * sexFemale + 0.679 * nssec3 + 0.357*nssec4 + 3.409*nssec5 + 0.249*nssec6 + 0.237*nssec7 + 0.226*nssec8\nBy using R, you can create the object you would like to predict. Here we created three person, see whether you can interpret their gender and socio-economic classification?\n\nobjs <- data.frame(sex=c(\"1\",\"2\",\"1\"),nssec=c(\"7\",\"3\",\"5\"))\n\nThen we can predict by using our model m.glm:\n\npredict(m.glm, objs,type = \"response\")\n\n 1 2 3 \n0.04259618 0.08108050 0.39007797 \n\n\nSo let us look at these three people. The first one, for a male who classified as Semi-routine occupation in NSSEC, the probability of he travel over 60km to work is only 4.26%. For the second one, a female who is in Lower managerial and professional occupation, the probability of long-distance commuting is 8.11%. Now you know the prediction outcomes for our last person.",
+ "text": "Code for Work_distance\nCategories\n\n\n\n\n1\nLess than 2 km\n\n\n2\n2 to <5 km\n\n\n3\n5 to <10 km\n\n\n4\n10 to <20 km\n\n\n5\n20 to <40 km\n\n\n6\n40 to <60 km\n\n\n7\n60km or more\n\n\n8\nAt home\n\n\n9\nNo fixed place\n\n\n10\nWork outside England and Wales but within UK\n\n\n11\nWork outside UK\n\n\n12\nWorks at offshore installation (within UK)\n\n\n\n\n\n\n\n\n\nCode for nssec\nCategory labels\n\n\n\n\n1\nLarge employers and higher managers\n\n\n2\nHigher professional occupations\n\n\n3\nLower managerial and professional occupations\n\n\n4\nIntermediate occupations\n\n\n5\nSmall employers and own account workers\n\n\n6\nLower supervisory and technical occupations\n\n\n7\nSemi-routine occupations\n\n\n8\nRoutine occupations\n\n\n9\nNever worked or long-term employed\n\n\n10\nFull-time student\n\n\n11\nNot classifiable\n\n\n12\nChild aged 0-15\n\n\n\n\n\n\nQ1. Summarise the frequencies of the two variables “work_distance” and “nssec” with the new data.\n\n\n\n\n\nQ2. Check the new sar_df dataframe with new column named New_work_distance by using the codes you have learnt.",
"crumbs": [
"4 Lab: LogisticRegression"
]
@@ -23,8 +23,8 @@
"objectID": "labs/04.LogisticRegression.html#extension-activities",
"href": "labs/04.LogisticRegression.html#extension-activities",
"title": "4 Lab: LogisticRegression",
- "section": "4.2 Extension activities",
- "text": "4.2 Extension activities\nThe extension activities are designed to get yourself prepared for the Assignment 2 in progress. For this week, try whether you can:\n\nSelect a regression strategy and explain why a linear or logistic model is appropriate\nPerform one or a series of regression models, including different combinations of your chosen independent variables to explain and/or predict your dependent variable\n\nAnswer for the model in Q3\nIn Q3, we we want to explore whether people with occupation being “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation. So we create the variable New_nssec with 0 “Other occupations”, but still keep “1”, “2” and “8” still as original categories.\nSo we can first have a check of our new variable New_nssec:\n\ntable(sar_df$New_nssec)\n\n\n 0 1 2 8 \n24615 887 3055 4469 \n\n\nThen we set the reference categories: sex as 1 (male) and New_nssec as 0, which is “Other occupations”:\n\nsar_df$sex <- relevel(as.factor(sar_df$sex),ref=\"1\")\nsar_df$New_nssec <- relevel(as.factor(sar_df$New_nssec),ref=\"0\")\n\nNow, we build the logistic regression model and check out the outcomes:\n\nmodel_new = glm(New_work_distance~sex + New_nssec, data = sar_df, family= \"binomial\")\n\nsummary(model_new)\n\n\nCall:\nglm(formula = New_work_distance ~ sex + New_nssec, family = \"binomial\", \n data = sar_df)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -1.92955 0.02786 -69.253 < 2e-16 ***\nsex2 -0.61757 0.03936 -15.688 < 2e-16 ***\nNew_nssec1 0.19183 0.10336 1.856 0.0634 . \nNew_nssec2 0.32582 0.05678 5.738 9.58e-09 ***\nNew_nssec8 -1.14082 0.08434 -13.526 < 2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 20441 on 33025 degrees of freedom\nResidual deviance: 19868 on 33021 degrees of freedom\nAIC: 19878\n\nNumber of Fisher Scoring iterations: 6\n\n\nFor the model interpretation, we need:\n\n# odds ratios\nexp(coef(model_new)) \n\n(Intercept) sex2 New_nssec1 New_nssec2 New_nssec8 \n 0.1452137 0.5392528 1.2114691 1.3851650 0.3195562 \n\n# confidence intervals\nexp(confint(model_new, level = 0.95)) \n\nWaiting for profiling to be done...\n\n\n 2.5 % 97.5 %\n(Intercept) 0.1374508 0.1533142\nsex2 0.4991249 0.5824112\nNew_nssec1 0.9846735 1.4770819\nNew_nssec2 1.2380128 1.5467297\nNew_nssec8 0.2698515 0.3756702\n\n# model fit\npR2(model_new) %>% round(4) %>% tidy()\n\nfitting null model for pseudo-r2\n\n\nWarning: 'tidy.numeric' is deprecated.\nSee help(\"Deprecated\")\n\n\n# A tibble: 6 × 2\n names x\n <chr> <dbl>\n1 llh -9934. \n2 llhNull -10220. \n3 G2 573. \n4 McFadden 0.028 \n5 r2ML 0.0172\n6 r2CU 0.0373",
+ "section": "4.3 Extension activities",
+ "text": "4.3 Extension activities\nThe extension activities are designed to get yourself prepared for the Assignment 2 in progress. For this week, try whether you can:\n\nSelect a regression strategy and explain why a linear or logistic model is appropriate\nPerform one or a series of regression models, including different combinations of your chosen independent variables to explain and/or predict your dependent variable\n\nAnswer for the model in Q3\nIn Q3, we we want to explore whether people with occupation being “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation. So we create the variable New_nssec with 0 “Other occupations”, but still keep “1”, “2” and “8” still as original categories.\nSo we can first have a check of our new variable New_nssec:\n\ntable(sar_df$New_nssec)\n\n\n 0 1 2 8 \n24615 887 3055 4469 \n\n\nThen we set the reference categories: sex as 1 (male) and New_nssec as 0, which is “Other occupations”:\n\nsar_df$sex <- relevel(as.factor(sar_df$sex),ref=\"1\")\nsar_df$New_nssec <- relevel(as.factor(sar_df$New_nssec),ref=\"0\")\n\nNow, we build the logistic regression model and check out the outcomes:\n\nmodel_new = glm(New_work_distance~sex + New_nssec, data = sar_df, family= \"binomial\")\n\nsummary(model_new)\n\n\nCall:\nglm(formula = New_work_distance ~ sex + New_nssec, family = \"binomial\", \n data = sar_df)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -1.92955 0.02786 -69.253 < 2e-16 ***\nsex2 -0.61757 0.03936 -15.688 < 2e-16 ***\nNew_nssec1 0.19183 0.10336 1.856 0.0634 . \nNew_nssec2 0.32582 0.05678 5.738 9.58e-09 ***\nNew_nssec8 -1.14082 0.08434 -13.526 < 2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 20441 on 33025 degrees of freedom\nResidual deviance: 19868 on 33021 degrees of freedom\nAIC: 19878\n\nNumber of Fisher Scoring iterations: 6\n\n\nFor the model interpretation, we need:\n\n# odds ratios\nexp(coef(model_new)) \n\n(Intercept) sex2 New_nssec1 New_nssec2 New_nssec8 \n 0.1452137 0.5392528 1.2114691 1.3851650 0.3195562 \n\n# confidence intervals\nexp(confint(model_new, level = 0.95)) \n\nWaiting for profiling to be done...\n\n\n 2.5 % 97.5 %\n(Intercept) 0.1374508 0.1533142\nsex2 0.4991249 0.5824112\nNew_nssec1 0.9846735 1.4770819\nNew_nssec2 1.2380128 1.5467297\nNew_nssec8 0.2698515 0.3756702\n\n# model fit\npR2(model_new) %>% round(4) %>% tidy()\n\nfitting null model for pseudo-r2\n\n\nWarning: 'tidy.numeric' is deprecated.\nSee help(\"Deprecated\")\n\n\n# A tibble: 6 × 2\n names x\n <chr> <dbl>\n1 llh -9934. \n2 llhNull -10220. \n3 G2 573. \n4 McFadden 0.028 \n5 r2ML 0.0172\n6 r2CU 0.0373",
"crumbs": [
"4 Lab: LogisticRegression"
]
@@ -168,5 +168,15 @@
"crumbs": [
"2 Lab: Correlation, Single, and Multiple Linear Regression"
]
+ },
+ {
+ "objectID": "labs/04.LogisticRegression.html#implementing-a-logistic-regression-model",
+ "href": "labs/04.LogisticRegression.html#implementing-a-logistic-regression-model",
+ "title": "4 Lab: LogisticRegression",
+ "section": "4.2 Implementing a logistic regression model",
+ "text": "4.2 Implementing a logistic regression model\nThe binary dependent variable is long-distance commuting, variable name New_work_distance.\nThe independent variables are gender and socio-economic status.\nFor gender, we use male as the basline.\n\nsar_df$sex <- relevel(as.factor(sar_df$sex),ref=\"1\")\n\nFor socio-economic status, we use code 5 (Small employers and Own account workers) as the baseline category to explore whether people work as independent employers show lower probability of commuting longer than 60km compared with other occupations.\n\n#create the model\nm.glm = glm(New_work_distance~sex + nssec, \n data = sar_df, \n family= \"binomial\")\n# inspect the results\nsummary(m.glm) \n\n\nCall:\nglm(formula = New_work_distance ~ sex + nssec, family = \"binomial\", \n data = sar_df)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -1.67337 0.05329 -31.401 < 2e-16 ***\nsex2 -0.36678 0.04196 -8.742 < 2e-16 ***\nnssec1 -0.12881 0.11306 -1.139 0.255 \nnssec3 -0.38761 0.06467 -5.994 2.05e-09 ***\nnssec4 -1.03079 0.08439 -12.214 < 2e-16 ***\nnssec5 1.22639 0.06489 18.898 < 2e-16 ***\nnssec6 -1.38992 0.10919 -12.730 < 2e-16 ***\nnssec7 -1.43909 0.09002 -15.986 < 2e-16 ***\nnssec8 -1.48534 0.09646 -15.398 < 2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 20441 on 33025 degrees of freedom\nResidual deviance: 17968 on 33017 degrees of freedom\nAIC: 17986\n\nNumber of Fisher Scoring iterations: 6\n\n\n\n# odds ratios\nexp(coef(m.glm)) \n\n(Intercept) sex2 nssec1 nssec3 nssec4 nssec5 \n 0.1876138 0.6929649 0.8791416 0.6786766 0.3567267 3.4088847 \n nssec6 nssec7 nssec8 \n 0.2490946 0.2371432 0.2264258 \n\n\n\n# confidence intervals\nexp(confint(m.glm, level = 0.95)) \n\nWaiting for profiling to be done...\n\n\n 2.5 % 97.5 %\n(Intercept) 0.1688060 0.2080319\nsex2 0.6381810 0.7522773\nnssec1 0.7017990 1.0935602\nnssec3 0.5981911 0.7708192\nnssec4 0.3020431 0.4205270\nnssec5 3.0037298 3.8739884\nnssec6 0.2002766 0.3073830\nnssec7 0.1984396 0.2824629\nnssec8 0.1869397 0.2729172\n\n\n\nQ3. If we want to explore whether people with occupation being “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation, how will we prepare the input independent variables and what will be the specified regression model?\n\nHint: use mutate() to create a new column, set the value of “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” as original, while the rest as “Other occupations” (recall in Lab 3 what we did for assigning the regions not within “London”, “Wales”, “Scotland” and “Northern Ireland” as “Other Regions in England”). Here by using the SAR in code format, we can make this more easier by using:\n\nsar_df <- sar_df %>% mutate(New_nssec = fct_other(\n nssec,\n keep = c(\"1\", \"2\", \"8\"),\n other_level = \"0\"\n))\n\nOr by using if_else and %in% in R, we can achieve the same result. %in% is an operator used to test if elements of one vector are present in another. It returns TRUE for elements found and FALSE otherwise.\n\nsar_df <- sar_df %>% mutate(New_nssec = if_else(!nssec %in% c(1,2,8), \"0\" ,nssec))\n\nUse “Other occupations” (code: 0) as the reference category by relevel(as.factor()) and then create the regression model: glm(New_work_distance~sex + New_nssec, data = sar_df, family= \"binomial\"). Can you now run the model by yourself? Find the answer at the end of the practical.\n\n4.2.1 Model fit\nWe include the R library pscl for calculate the measures of fit.\n\nif(!require(\"pscl\"))\n install.packages(\"pscl\")\n\nLoading required package: pscl\n\n\nWarning: package 'pscl' was built under R version 4.3.3\n\n\nClasses and Methods for R originally developed in the\nPolitical Science Computational Laboratory\nDepartment of Political Science\nStanford University (2002-2015),\nby and under the direction of Simon Jackman.\nhurdle and zeroinfl functions by Achim Zeileis.\n\nlibrary(pscl)\n\nRelating back to this week’s lecture notes, what is the Pseudo R2 of the fitted logistic model (from the Model Summary table below)?\n\n# Pseudo R-squared\npR2(m.glm)\n\nfitting null model for pseudo-r2\n\n\n llh llhNull G2 McFadden r2ML \n-8.983928e+03 -1.022037e+04 2.472890e+03 1.209785e-01 7.214246e-02 \n r2CU \n 1.563288e-01 \n\n# or in better format\npR2(m.glm) %>% round(4) %>% tidy()\n\nfitting null model for pseudo-r2\n\n\n# A tibble: 6 × 2\n names x\n <chr> <dbl>\n1 llh -8984. \n2 llhNull -10220. \n3 G2 2473. \n4 McFadden 0.121 \n5 r2ML 0.0721\n6 r2CU 0.156 \n\n\n\nllh: The log-likelihood of the fitted model.\nllhNull: The log-likelihood of the null model (without predictors).\nG2: The likelihood ratio statistic, showing the model’s improvement over the null model.\nMcFadden: McFadden’s pseudo R-squared (a common measure of model fit).\nr2ML: Maximum likelihood pseudo R-squared.\nr2CU: Cox & Snell pseudo R-squared.\n\nDifferent from the multiple linear regression, whose R-squared indicates % of the variance in the dependent variables that is explained by the independent variable. In logistic regression model, R-squared is not directly applicable. Instead, we use pseudo R-squared measures, such as McFadden’s pseudo R-squared, or Cox & Snell pseudo R-squared to provide an indication of model fit. For the individual level dataset like SAR, value around 0.3 is considered good for well-fitting.\n\n\n4.2.2 Statistical significance of regression coefficients or covariate effects\nSimilar to the statistical inference in a linear regression model context, p-values of regression coefficients are used to assess significances of coefficients; for instance, by comparing p-values to the conventional level of significance of 0.05:\n· If the p-value of a coefficient is smaller than 0.05, the coefficient is statistically significant. In this case, you can say that the relationship between an independent variable and the outcome variable is statistically significant.\n· If the p-value of a coefficient is larger than 0.05, the coefficient is statistically insignificant. In this case, you can say or conclude that there is no statistically significant association or relationship between an independent variable and the outcome variable.\n\n\n4.2.3 Interpreting estimated regression coefficients\n\nThe interpretation of coefficients (B) and odds ratios (Exp(B)) for the independent variables differs from that in a linear regression setting.\nInterpreting the regression coefficients.\n\no For the variable sex, a negative sign and the odds ratio estimate indicate that the probability of commuting over long distances for female is 0.693 times less likely than male (the reference group), with the confidence intervals (CI) or likely range between 0.6 to 0.7, holding all other variables constant (the socio-economic classification variable). Put it differently, being females reduces the probability of long-distance commuting by 30.7% (1-0.693).\no For variable nssec, a positive significant and the odds ratio estimate indicate that the probability of long-distance commuting for those whose socio-economic classification as:\n\nsmall employers and own account workers (nssec=5) are 3.409 times more likely than the higher prof occupations, holding all other variables constant (the Sex variable), with a likely range (CI) of between 3.0 to 3.8.\nthe p-value of Large employers and higher managers (nssec=1) is > 0.05, so thre is no statistically significant relationship between large employers and higher managers and long-distance commuting.\nRoutine occupations (nssec=8) are 0.226 times (or 22.6%) less likely than the higher professional occupations, with the CI between 0.18 to 0.27. when other variable constant. Or, we can see being routine occupations decreases the probability of long-distance commuting by 77.4% (1-0.226).\n\n\nQ4. Interpret the regression coefficients (i.e. Exp(B)) of variables “nssec=Lower managerial and professional occupations” and “nssec=Semi-routine occupation”.\n\n\nQ5. Could you identify significant factors of commuting over long distances?\n\n\n\n4.2.4 Prediction using fitted regression model\nRelating to this week’s lecture, the log odds of the person who is will to long-distance commuting is equal to:\nLog odds of long-distance commuting = 0.188 + 0.693 * sexFemale + 0.679 * nssec3 + 0.357*nssec4 + 3.409*nssec5 + 0.249*nssec6 + 0.237*nssec7 + 0.226*nssec8\nBy using R, you can create the object you would like to predict. Here we created three person, see whether you can interpret their gender and socio-economic classification?\n\nobjs <- data.frame(sex=c(\"1\",\"2\",\"1\"),nssec=c(\"7\",\"3\",\"5\"))\n\nThen we can predict by using our model m.glm:\n\npredict(m.glm, objs,type = \"response\")\n\n 1 2 3 \n0.04259618 0.08108050 0.39007797 \n\n\nSo let us look at these three people. The first one, for a male who classified as Semi-routine occupation in NSSEC, the probability of he travel over 60km to work is only 4.26%. For the second one, a female who is in Lower managerial and professional occupation, the probability of long-distance commuting is 8.11%. Now you know the prediction outcomes for our last person.",
+ "crumbs": [
+ "4 Lab: LogisticRegression"
+ ]
}
]
\ No newline at end of file
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index d6b040e..2c889d0 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -2,7 +2,7 @@
https://gdsl-ul.github.io/stats/labs/04.LogisticRegression.html
- 2024-11-29T12:32:48.033Z
+ 2024-11-29T12:37:00.488Z
https://gdsl-ul.github.io/stats/general/assessment.html
diff --git a/labs/04.LogisticRegression.qmd b/labs/04.LogisticRegression.qmd
index 05fda1f..e00e625 100644
--- a/labs/04.LogisticRegression.qmd
+++ b/labs/04.LogisticRegression.qmd
@@ -146,7 +146,7 @@ We are interested in whether people with occupations being "Higher professional
sar_df$nssec <- relevel(as.factor(sar_df$nssec), ref = "2")
```
-### **Implementing a logistic regression model**
+## **Implementing a logistic regression model**
The binary dependent variable is long-distance commuting, variable name `New_work_distance`.