diff --git a/docs/Exploring-the-Social-World---Quantitative-Block--Statistics.pdf b/docs/Exploring-the-Social-World---Quantitative-Block--Statistics.pdf
deleted file mode 100644
index af186f9..0000000
Binary files a/docs/Exploring-the-Social-World---Quantitative-Block--Statistics.pdf and /dev/null differ
diff --git a/docs/img/codeChunk.png b/docs/img/codeChunk.png
deleted file mode 100644
index 4199572..0000000
Binary files a/docs/img/codeChunk.png and /dev/null differ
diff --git a/docs/img/codeChunk_visual.png b/docs/img/codeChunk_visual.png
deleted file mode 100644
index 34f7e25..0000000
Binary files a/docs/img/codeChunk_visual.png and /dev/null differ
diff --git a/docs/img/distributions.png b/docs/img/distributions.png
deleted file mode 100644
index 77728d2..0000000
Binary files a/docs/img/distributions.png and /dev/null differ
diff --git a/docs/img/environ_R_1.png b/docs/img/environ_R_1.png
deleted file mode 100644
index 9c3c19d..0000000
Binary files a/docs/img/environ_R_1.png and /dev/null differ
diff --git a/docs/img/quartoHeader.png b/docs/img/quartoHeader.png
deleted file mode 100644
index 77a4650..0000000
Binary files a/docs/img/quartoHeader.png and /dev/null differ
diff --git a/docs/img/runChunk.png b/docs/img/runChunk.png
deleted file mode 100644
index 6d07305..0000000
Binary files a/docs/img/runChunk.png and /dev/null differ
diff --git a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-37-1.png b/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-37-1.png
deleted file mode 100644
index 0e1fcee..0000000
Binary files a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-37-1.png and /dev/null differ
diff --git a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-7-1.png b/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-7-1.png
deleted file mode 100644
index cf8a61b..0000000
Binary files a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-7-1.png and /dev/null differ
diff --git a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-8-1.png b/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-8-1.png
deleted file mode 100644
index b8a5c6f..0000000
Binary files a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-8-1.png and /dev/null differ
diff --git a/docs/robots.txt b/docs/robots.txt
deleted file mode 100644
index 20f90c3..0000000
--- a/docs/robots.txt
+++ /dev/null
@@ -1 +0,0 @@
-Sitemap: https://gdsl-ul.github.io/stats/sitemap.xml
diff --git a/docs/search.json b/docs/search.json
deleted file mode 100644
index c8da2f9..0000000
--- a/docs/search.json
+++ /dev/null
@@ -1,272 +0,0 @@
-[
- {
- "objectID": "index.html",
- "href": "index.html",
- "title": "Exploring the Social World - Quantitative Block: Statistics",
- "section": "",
- "text": "Welcome\nThis is the website for “Exploring the Social World - Quantitative Block: Statistics” (module ENVS225) at the University of Liverpool. This block of the module is designed and delivered by Dr. Gabriele Filomena and Dr. Zi Ye from the Geographic Data Science Lab at the University of Liverpool. The module seeks to provide hands-on experience and training in introductory statistics for human geographers.\nThe website is free to use and is licensed under the Attribution-NonCommercial-NoDerivatives 4.0 International. A compilation of this web course is hosted as a GitHub repository that you can access:",
- "crumbs": [
- "Welcome"
- ]
- },
- {
- "objectID": "index.html#contact",
- "href": "index.html#contact",
- "title": "Exploring the Social World - Quantitative Block: Statistics",
- "section": "Contact",
- "text": "Contact\n\nGabriele Filomena - gfilo [at] liverpool.ac.uk Lecturer in Geographic Data Science Office 1xx, Roxby Building, University of Liverpool - 74 Bedford St S, Liverpool, L69 7ZT, United Kingdom.\n\n\nZi Ye - zi.ye [at] liverpool.ac.uk Lecturer in Geographic Information Science Office 107, Roxby Building, University of Liverpool - 74 Bedford St S, Liverpool, L69 7ZT, United Kingdom.",
- "crumbs": [
- "Welcome"
- ]
- },
- {
- "objectID": "general/overview.html",
- "href": "general/overview.html",
- "title": "Overview",
- "section": "",
- "text": "Aim and Learning Objectives\nThis sub-module aims to provide training and skills on a set of basic quantitative research methods for data collection, analysis, and interpretation. You will learn how to define coherent, relevant research questions, utilise various research quantitative methods, and identify appropriate methodologies to tackle your research questions. This block serves as the foundation for the dissertation and fieldwork modules.\nBackground\nData and research are key pillars of the global economy and society today. We need rigorous approaches to collecting and analysing both the statistics that can tell us ‘how much’ and if there are observable relationships between phenomena; and the information gives us a nuanced understanding of cultural contexts and human dynamics. Quantitative skills enable us to explore and measure socio-economic activities and processes at large scales, while qualitative skills enable understanding of social, cultural, and political contexts and diverse lived experiences. Rather than being in opposition, qualitative and quantitative research can complement one another in the investigation of today’s pressing research questions.\nTo these ends, this block will help you develop your quantitative (statistical) skills, as critical tools. This course will help you understand what quantitative statistical researchers use and develop a set of research techniques that can be used in your field classes and dissertations.\nLearning objectives:",
- "crumbs": [
- "Overview"
- ]
- },
- {
- "objectID": "general/overview.html#aim-and-learning-objectives",
- "href": "general/overview.html#aim-and-learning-objectives",
- "title": "Overview",
- "section": "",
- "text": "Understand how to explore a dataset, containing a number of observations described by a set of variables.\nDemonstrate an understanding in the application and interpretation of commonly used quantitative research methods.\nDemonstrate an understanding of how to work with quantitative data to address real-world research questions.",
- "crumbs": [
- "Overview"
- ]
- },
- {
- "objectID": "general/overview.html#module-structure",
- "href": "general/overview.html#module-structure",
- "title": "Overview",
- "section": "Module Structure",
- "text": "Module Structure\nStaff: Dr Zi Ye and Dr Gabriele Filomena\nWhere and When\nQuantitative Block (Weeks 7-12):\n\nLecture: 10 am – 10.45 am Fridays\nPC Practical sessions: 11am – 1 pm, following the Lecture\n\nWeek 7: Central Teaching Hub: PC Teaching Centre BLUE+GREEN+ORANGE ZONES\nWeek 8 -12: Central Teaching Hub, PCTC\nLectures will introduce and explain the fundamentals of quantitative methods, with the opportunity to apply the method introduced in the labs later in the week.\nThe computer practical sessions, will give you the chance to use and apply quantitative methods to real-world data. These are primarily self-directed sessions, but with support on hand if you get stuck. Support and training in R will be provided through these sessions. Weekly sessions will be driven by empirical research questions.\n\n\n\n\n\n\n\n\n\nWeek\nTopic\nFormat\nStaff\n\n\n\n\n7\nIntroduction & Review\nLecture and Computer Lab Practical\nGF\n\n\n8\nSingle & Multiple Linear Regression\nLecture and Computer Lab Practical\nGF\n\n\n9\nMultiple Linear Regression with Categorical Variables\nLecture and Computer Lab Practical\nZY\n\n\n10\nLogistic Regression\nLecture and Computer Lab Practical\nZY\n\n\n11\nData Visualisation\nLecture and Computer Lab Practical\nGF\n\n\n12\nSummary and Assessment Support\nLecture and Computer Lab Practical\nZY",
- "crumbs": [
- "Overview"
- ]
- },
- {
- "objectID": "general/overview.html#software-and-data",
- "href": "general/overview.html#software-and-data",
- "title": "Overview",
- "section": "Software and Data",
- "text": "Software and Data\nFor quantitative training sessions, ensure you have installed and/or have access to RStudio. To run the analysis and reproduce the code in R, you need the following software installed on your machine:\n\nR-4.2.2\nRStudio 2022.12.0-353\n\nTo install and update:\n\nR, download the appropriate version from The Comprehensive R Archive Network (CRAN).\nRStudio, download the appropriate version from here.\n\nThis software is already installed on University Machines. But you will need it to run the analysis on your personal devices.\nData\nExample datasets could be accessed through Canvas or the GitHub Repository of the module. These include:\n\n2021 UK Census Data.\n2021 Annulation Population Survey.\n2016 Family Resource Survey.\n\nNote: The Annual Population Survey requires the completion of a form prior to its usage, as it is licensed.",
- "crumbs": [
- "Overview"
- ]
- },
- {
- "objectID": "general/assessment.html",
- "href": "general/assessment.html",
- "title": "Assessment",
- "section": "",
- "text": "Required Report Structure\nFollow this structure and include ALL these points, do not make your life harder.",
- "crumbs": [
- "Assessment"
- ]
- },
- {
- "objectID": "general/assessment.html#required-report-structure",
- "href": "general/assessment.html#required-report-structure",
- "title": "Assessment",
- "section": "",
- "text": "Introduction\n\nContext: Why is the topic relevant or worth being investigated?\nBrief discussion of existing literature.\nKnowledge gap and Aim.\nResearch questions.\n\nLiterature review\n\nMore detailed Literature review, i.e. what do we already know about this subject\nRationale for including certain predictor variables in the model.\nWhat knowledge gap remains that this article will address? (includes “not studied before in this area”). Note: there is no expectation on totally original research. The focus is on a clean, sensible, data analysis situated in existing ideas.\n\nMethodology:\n\nA brief introduction to the dataset being analysed (who collected it? When? How many responses? etc.)\nA description of the variables chosen to be analysed.\nA description of any transformation made to the original data, i.e. turning a continuous variable of income into intervals, or reducing the number of age groups from 11 to 3.\nA description and justification of the statistical techniques used in the subsequent analysis (i.e. the Multivariate regression model: Multiple or Logistic Linear Regression).\n\nResults and Discussion\n\nDescriptive statistics and summary of the variables employed.\nCorrect interpretation of correlation coefficients.\nUsage and results of an appropriate multivariate regression model.\nInterpretation of the results, including links and contrasts to existing literature.\nSelective illustrations (graphs and tables) to make your findings as clear as possible.\n\nConclusion\n\nSummary of main findings.\nLimitations of study (self-critique).\n\n\n\nHighlight any implications derived from the study.",
- "crumbs": [
- "Assessment"
- ]
- },
- {
- "objectID": "general/assessment.html#how-to-get-there",
- "href": "general/assessment.html#how-to-get-there",
- "title": "Assessment",
- "section": "How to get there?",
- "text": "How to get there?\nThe first stage is to identify ONE a relevant research question to be addressed. Based on the chosen question, you will need to identify a dependent (or outcome) variable which you want to explain, and at least two relevant independent variables that you can use to explain the chosen dependent variable. The selection of variables should be informed by the literature and empirical evidence.\nTo detail in the Methods Section: Once the variables have been chosen, you will need to describe the data and appropriate type of regression to be used for the analysis. You need to explain any transformation done to the original data source, such as reclassifying variables, or changing variables from continuous to nominal scales. You also need to briefly describe the data use: source of data, year of data collection, indicate the number of records used, state if you are using individual records or geographical units, explain if you are selecting a sample, and any relevant details. You also need to identify type of regression to be used and why.\nTo detail in the Results and Discussion Section: Firstly, you need to provide two types of analyses. First, you need to provide a descriptive analysis of the data. Here you could use tables and/or plots reporting relevant descriptive statistics, such as the mean, median and standard deviation; variable distributions using histograms; and relationships between variables using correlation matrices or scatter plots. Secondly, you need to present an estimated regression model or models and the interpretation of the estimated coefficients. You need a careful and critical analysis of the regression estimates. You should think that you intend to use your regression models to advice your boss who is expecting to make some decisions based on the information you will provide. As part of this process, you need to discuss the model assessment results for the overall model and regression coefficients. Remember to substantiate your arguments using relevant literature and evidence, and present results clearly in tables and graphs.",
- "crumbs": [
- "Assessment"
- ]
- },
- {
- "objectID": "general/assessment.html#how-to-submit",
- "href": "general/assessment.html#how-to-submit",
- "title": "Assessment",
- "section": "How to submit",
- "text": "How to submit\nYou should submit a .pdf file, that is a rendered version of a Quarto Markdown file (qmd file). This will allow you to write a research paper that also includes your working code, without the need of including the data (rendered .qmd files are executed before being converted to R).\nHow to get a PDF?\n\nInstall Quarto: Make sure you have Quarto installed. You can download it from quarto.org.\nLaTeX Installation: For PDF output, you’ll need a LaTeX distribution like TinyTeX from R, by executing this in the R console:\n\ninstall.packages(\"tinytex\")\ntinytex::install_tinytex()\n\nOpen the Quarto File: Open your .qmd file in RStudio.\nSet Output Format: In the YAML header at the top of your Quarto file, specify pdf under format:\n\n\n\n\n\n\n\n\n\n\n title: \"Your Document Title\"\n author: \"Anonymous\" # do not change\n format: pdf\n\nClick the Render button in the RStudio toolbar (next to the Knit button).",
- "crumbs": [
- "Assessment"
- ]
- },
- {
- "objectID": "general/assessment.html#how-is-it-graded",
- "href": "general/assessment.html#how-is-it-graded",
- "title": "Assessment",
- "section": "How is it graded?",
- "text": "How is it graded?\n\n\n\n\n\n\n\n\n\n\nGrade\nScore Range\nUG\nDescriptor\nAssignment Expectations\n\n\n\n\nFail\n0-34%\nFail\nInadequate\nLiterature Review: Lacks relevance and fails to justify variable choice. Evidence is irrelevant or missing, providing no support to the research question. Methods: Data is not described, and the regression model is entirely missing. No appropriate statistical method is applied. Results and Discussion: No descriptive statistics, graphs, or tables are provided. Model results and interpretation are absent. Structure and References: Report is disorganized with significant referencing and citation errors throughout.\n\n\nNarrow Fail\n35-39%\nFail\nHighly Deficient\nLiterature Review: Review is present but lacks coherence and fails to justify variable choice. Evidence is poorly aligned with the research question and mostly irrelevant. Methods: Minimal data description; the regression model is missing but some statistical methods are mentioned. Results and Discussion: Few or no descriptive statistics or visuals are present. Statistical methods are unclear or incorrectly applied. Results are vague and lack meaningful interpretation. Structure and References: Report structure is poor, with referencing errors in multiple sections.\n\n\nThird / Fail\n40-49%\nThird (UG)\nDeficient\nLiterature Review: Relevant literature is partially addressed but lacks depth, with limited justification for variable choice. Evidence is minimally aligned with the research question. Methods: A very basic data description is provided, but the selected regression model is deeply inadequate or incorrect (e.g., multiple linear regression for a categorical outcome; logistic regression for a continuous outcome). Results and Discussion: Descriptive statistics or visuals may be present but insufficient. Model results are presented with little to no interpretation. Structure and References: Report structure is present but lacks clarity, with inconsistencies in citations and citation style.\n\n\n2.2 / Pass\n50-59%\n2.2 (UG)\nAdequate\nLiterature Review: Addresses relevant literature but with limited justification of variable choices. Evidence generally supports the research question but lacks detail. Methods: Data description is present but brief; a regression model is included but applied illogically or incorrectly (e.g., multiple linear regression for a categorical outcome; logistic regression for a continuous outcome) and with little explanation. Results and Discussion: Basic descriptive statistics, graphs, or tables are presented; the regression model is applied with some inaccuracies and/or interpretation is minimal. Structure and References: Report is mostly organized, though with referencing inconsistencies.\n\n\n2.1 / Merit\n60-69%\n2.1 (UG)\nGood\nLiterature Review: Relevant literature is discussed, with some justification for variable choice. Evidence supports the research question well. Methods: Data is described with some detail, though potential data transformations are under-explored. The regression model is appropriate for the selected variable types. Results and Discussion: Descriptive statistics and visuals are provided. Model results are discussed, though interpretation lacks depth. Findings are compared to existing literature. Structure and References: Report is logically structured and clear, with mostly correct citations.\n\n\nFirst / Distinction\n70-79%\nFirst (UG)\nVery Good\nLiterature Review: Strong grasp of relevant literature, with well-justified variable selection. Evidence aligns well with the research question. Methods: Data is comprehensively described with consideration of relevant transformations. The regression model is appropriate and well-justified. Results and Discussion: Descriptive statistics and clear visuals support findings. Model results are accurately interpreted with strong connections to existing literature. Structure and References: Report has a coherent, professional structure with only minor referencing errors.\n\n\nHigh First / High Distinction\n80-100%\nHigh First (UG)\nExcellent to Outstanding\nLiterature Review: Critical and thorough literature review with strong, well-justified variable selection. Evidence fully supports the research question with insightful connections. Methods: Detailed data description and transformation steps are clearly articulated. Regression model is expertly applied and justified. Results and Discussion: Comprehensive descriptive statistics, graphs, and tables are provided. Model results are innovatively interpreted with strong links to existing research. Structure and References: Report is professionally structured, with flawless citations and a high standard of organization.\n\n\n\n\nIn summary:\n\nIntroduction: Should establish the topic’s relevance, present a concise literature overview, identify a knowledge gap, and outline research questions.\nLiterature Review: Requires an in-depth review of relevant studies, justification for chosen independent variables, and identification of a potential knowledge gap or unexplored area aligned with the chosen research question.\nMethods and Data: Should describe the dataset, variable transformations, and justify the regression technique. Key transformations, such as reclassifying variables, should be explained with clarity and relevance.\nResults and Discussion: Involves presenting descriptive statistics, followed by a clear regression analysis. Discussion should interpret results, compare findings with existing literature, and include meaningful tables and graphs.\nConclusion: Summarize findings, discuss limitations, and suggest future directions.\nReferencing: Requires correct and consistent citations and a well-structured reference list.\n\nEmploying a novel dataset, i.e. not employed during the practical sessions, for the assignment will be awarded with a higher grade.",
- "crumbs": [
- "Assessment"
- ]
- },
- {
- "objectID": "labs/01.introR.html",
- "href": "labs/01.introR.html",
- "title": "1 Lab: Introduction to R for Statistics",
- "section": "",
- "text": "1.1 R?\nR is an open-source program that is commonly used in Statistics. It runs on almost every platform and is completely free and is available at www.r-project.org. Most of the cutting-edge statistical research is first available on R.\nR is a script based language, so there is no point and click interface. While the initial learning curve will be steeper, understanding how to write scripts will be valuable because it leaves a clear description of what steps you performed in your data analysis. Typically you will want to write a script in a separate file and then run individual lines. This saves you from having to retype a bunch of commands and speeds up the debugging process.",
- "crumbs": [
- "1 Lab: Introduction to R for Statistics"
- ]
- },
- {
- "objectID": "labs/01.introR.html#rstudio-basics",
- "href": "labs/01.introR.html#rstudio-basics",
- "title": "1 Lab: Introduction to R for Statistics",
- "section": "1.2 R(Studio) Basics",
- "text": "1.2 R(Studio) Basics\nWe will be running R through the program RStudio which is located at rstudio.com. When you first open up RStudio the console window gives you some information about the version of R you are running and then it gives the prompt >. This prompt is waiting for you to input a command. The prompt + tells you that the current command is spanning multiple lines. In a script file you might have typed something like this:\nfor( i in 1:5 ){\n print(i)\n}\nFinding help about a certain function is very easy. At the prompt, just type help(function.name) or ?function.name. If you don’t know the name of the function, your best bet is to go the the web page www.rseek.org which will search various R resources for your keyword(s). Another great resource is the coding question and answer site stackoverflow.\n\n1.2.1 Starting a session in RStudio\nUpon startup, RStudio will look something like this.\n\n\n\n\n\n\n\n\n\nNote: the Pane Layout and Appearance settings can be altered:\n\non Windows by clicking RStudio>Tools>Global Options>Appearance or Pane Layout\non Mac OS by clicking RStudio>Preferences>Appearance or Pane Layout.\n\nYou will also have a standard white background; but you can choose specific themes.\nSource Panel (Top-Left)\nThis is where you write, edit, and view scripts, R Markdown/Quarto documents, or R scripts. It allows:\n\nEditing Scripts: Write and edit R scripts or documents (.R, .Rmd, .qmd).\nExecuting the Code: Run lines, blocks, or the entire script directly from the editor.\n\nConsole Panel (Bottom-Left)\nThe Console is the main place to run R commands interactively. It allows:\n\nExecuting the Code: Type and run R commands directly.\nViewing outputs, warnings, and errors for immediate feedback.\nBrowsing and reusing past commands (History Tab).\nToggling between the R Console, and the Terminal (yuo don’t really need the latter).\n\nEnvironment Panel (Top-Right)\nThis panel helps track variables, functions, and the history of commands used. It contains:\n\nEnvironment Tab: Shows all current variables, datasets, and objects in your session, including their structure and values.\nHistory Tab: Provides a record of past commands. You can re-run or move commands to the console or script.\n\nFiles / Plots / Packages / Help Panel (Bottom-Right)\nThis multifunctional panel is for file navigation, plotting, managing packages, viewing help, and managing jobs. It contains:\n\nFiles Tab: Navigate, open, and manage files and directories within your project.\nPlots Tab: Displays plots generated in your session. You can export or navigate through multiple plots here.\nPackages Tab: Lists installed packages and allows you to install, load, and update packages.\nHelp Tab: Displays help documentation for R functions, packages, and other resources. You can search for documentation by typing a function or package name.\n\nImportant: Unless you are working with a script, you will be likely writing code on the console.\nAt the start of a session, it’s good practice clearing your R environment (console):\n\nrm(list = ls())\n\nIn R, we are going to be working with relative paths. With the command getwd(), you can see where your working directory is currently set.\n\ngetwd() \n\nFor ENVS225, download the material of the module an unzip it whever you like.\nThe folder structure should look like:\nstats/\n├── data/\n├── labs_img/\n└── labs/\nYou can delete other sub-folders (e.g. docs).\nThis should be on your personal computer or if on a local machine, I suggest using the directory M: to store the folder, it can be accessed from every computer.\nThen, in R Studio - on Windows by clicking RStudio>Tools>Global Options>General.. - on Mac OS by clicking RStudio>Preferences>Appearance or Pane Layout…\nbrowse and set the folder you just creted as your working directory.\nCheck if that has been applied.\n\ngetwd() \n\nFile paths in R work like this:\n\n\n\n\n\n\n\nFile Path\nDescription\n\n\n\n\nMyFile.csv\nLook in the working directory for MyFile.csv.\n\n\nMyFolder/MyFile.csv\nIn the working directory, there is a subdirectory called MyFolder and inside that folder is MyFile.csv.\n\n\n\nYou do not need to set your working directory if you are using an R-markdown or Quarto document and you have it saved in the right location. The pathway will start from where your document is saved.\n\n\n1.2.2 Using the console\nTry to use the console to perform a few operations. For example type in:\n\n1+1\n\n[1] 2\n\n\nSlightly more complicated:\n\nprint(\"hello world\")\n\n[1] \"hello world\"\n\n\nIf you are unsure about what a command does, use the “Help” panel in your Files pane or type ?function in the console. For example, to see how the dplyr::rename() function works, type in ?dplyr::rename. When you see the double colon syntax like in the previous command, it’s a call to a package without loading its library.\n\n\n1.2.3 R as a simple calculator\nYou can use R as a simple calculator. At the prompt, type 2+3 and hit enter. What you should see is the following\n\n# Some simple addition\n2+3\n\n[1] 5\n\n\nIn this fashion you can use R as a very capable calculator.\n\n6*8\n\n[1] 48\n\n4^3\n\n[1] 64\n\nexp(1) # exp() is the exponential function\n\n[1] 2.718282\n\n\nR has most constants and common mathematical functions you could ever want. For example, the absolute value of a number is given by abs(), and round() will round a value to the nearest integer.\n\npi # the constant 3.14159265...\n\n[1] 3.141593\n\nabs(1.77) \n\n[1] 1.77\n\n\nWhenever you call a function, there will be some arguments that are mandatory, and some that are optional and the arguments are separated by a comma. In the above statements the function abs() requires at least one argument, and that is the number you want the absolute value of.\nWhen functions require more than one argument, arguments can be specified via the order in which they are passed or by naming the arguments. So for the log() function, for example, which calculates the logarithm of a number, one can specify the arguments using the named values; the order woudn’t matter:\n\n# Demonstrating order does not matter if you specify\n# which argument is which\nlog(x=5, base=10) \n\n[1] 0.69897\n\nlog(base=10, x=5)\n\n[1] 0.69897\n\n\nWhen we don’t specify which argument is which, R will decide that x is the first argument, and base is the second.\n\n# If not specified, R will assume the second value is the base...\nlog(5, 10)\n\n[1] 0.69897\n\nlog(10, 5)\n\n[1] 1.430677\n\n\nWhen we want to specify the arguments, we can do so using the name=value notation.\n\n\n1.2.4 Variables Assignment\nWe need to be able to assign a value to a variable to be able to use it later. R does this by using an arrow <- or an equal sign =. While R supports either, for readability, I suggest people pick one assignment operator and stick with it.\nVariable names cannot start with a number, may not include spaces, and are case sensitive.\n\nvar <- 2*7.5 # create two variables\nanother_var = 5 # notice they show up in 'Environment' tab in RStudio!\nvar \n\n[1] 15\n\nvar * another_var \n\n[1] 75\n\n\nAs your analysis gets more complicated, you’ll want to save the results to a variable so that you can access the results later. If you don’t assign the result to a variable, you have no way of accessing the result.\n\n\n1.2.5 Working with Scripts\nR Scripts (.R files)\nTraditional script files look like this:\n\n# Problem 1 \n# Calculate the log of a couple of values and make a plot\n# of the log function from 0 to 3\nlog(0)\nlog(1)\nlog(2)\nx <- seq(.1,3, length=1000)\nplot(x, log(x))\n\n# Problem 2\n# Calculate the exponential function of a couple of values\n# and make a plot of the function from -2 to 2\nexp(-2)\nexp(0)\nexp(2)\nx <- seq(-2, 2, length=1000)\nplot(x, exp(x))\n\nIn RStudio you can create a new script by going to File -> New File -> R Script. This opens a new window in RStudio where you can type commands and functions as a common text editor.\nThis looks perfectly acceptable as a way of documenting what one does, but this script file doesn’t contain the actual results of commands you ran, nor does it show you the plots. Also anytime you want to comment on some output, it needs to be offset with the commenting character #. It would be nice to have both the commands and the results merged into one document. This is what the R Markdown file does for us.\nR Markdown (.Rmd and .qmd files)\nThe R Markdown is an implementation of the Markdown syntax that makes it extremely easy to write webpages or scientific documents that include conde. This syntax was extended to allow users to embed R code directly into more complex documents. Perhaps the easiest way to understand the syntax is to look at an at the RMarkdown website.\nThe R code in a R Markdown document (.rmd file extension) can be nicely separated from regular text using the three backticks (3 times `, see below) and an instruction that it is R code that needs to be evaluated. A code chunk will look like:\n\n for (i in 1:5) {print(i)}\n\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n\n\nIn ENVS225: In this module we will be using .qmd a more flexible development of .rmd files.\nMarkdown files present several advantages compared to writing your code in the console or just using scripts. You’ll save yourself a huge amount of work by embracing Markdown files from the beginning; you will keep track of your code and your steps, be able to document and present how you did your analysis (helpful when writing the methods section of a paper), and it will make it easier to re-run an analysis after a change in the data (such as additional data values, transformed data, or removal of outliers) or once you spot an error. Finally, it makes the script more readable.\n\n\n1.2.6 R Packages\nOne of the greatest strengths about R is that so many people have developed add-on packages to do some additional function. To download and install the package from the Comprehensive R Archive Network (CRAN), you just need to ask RStudio it to install it via the menu Tools -> Install Packages.... Once there, you just need to give the name of the package and RStudio will download and install the package on your computer.\nOnce a package is downloaded and installed on your computer, it is available, but it is not loaded into your current R session by default. To improve overall performance only a few packages are loaded by default and the you must explicitly load packages whenever you want to use them. You only need to load them once per session/script.\n\nlibrary(dplyr) # load the dplyr library, will be useful later",
- "crumbs": [
- "1 Lab: Introduction to R for Statistics"
- ]
- },
- {
- "objectID": "labs/01.introR.html#practice-dataset-and-dataframes",
- "href": "labs/01.introR.html#practice-dataset-and-dataframes",
- "title": "1 Lab: Introduction to R for Statistics",
- "section": "1.3 Practice: Dataset and Dataframes",
- "text": "1.3 Practice: Dataset and Dataframes\n\nFirst of all, create a new Markdown document. We use the File -> New File -> Quarto Document.. dropdown option, and a menu will appear asking you for the document title, author, and preferred output type. You can select HTML, but you will need your assignment to be submitted in PDF; more on that later.\nFollow the practical below. You can describe what you are doing in normal text. See here for how to format normal text in Markdown documents\nRemember, when you want to write code in a markdown document you have to enclose it like this:\n\n\n\n\n\n\n\n\n\n\n\nor you can insert it manually:\n\n\n\n\n\n\n\n\n\n\nWithin this module we will be working with data stored in so-called datasets. A dataset is a structured collection of data points that represent various measurements or observations, often organized in a tabular format with rows and columns. A dataset might contain information about different locations, such as neighborhoods or cities, with each row representing a place and each column detailing characteristics like population density, average income, or number of green parks. For example, a dataset could be compiled to study patterns in urban mobility, where the data includes the number of daily commuters, the distance they travel, and the mode of transport they use. Datasets provide the essential building blocks for statistical analysis; they enable exploring relationships, identifying patterns, and drawing conclusions about certatin phenomena.\nExamples of everyday datasets:\n\nPremier League Standings: Each row represents a team, with columns for points, games played, wins, draws, and losses.\nMovie Dataset: Each row represents a movie, with columns showing its title, genre, release year, director, and rating.\nWeather Dataset: Each row shows a day’s weather in a city, with columns for temperature, humidity, wind speed, and precipitation.\n\nUsually, data is organized in\n\nColumns of data representing some trait or variable that we might be interested in. In general, we might wish to investigate the relationship between variables.\nRows represent a single object on which the column traits are measured.\n\nFor example, in a grade book for recording students scores throughout the semester, their is one row for every student and columns for each assignment. A greenhouse experiment dataset will have a row for every plant and columns for treatment type and biomass.\n\n1.3.1 Datasets in R\nIn R, we want a way of storing data where it feels just as if we had an Excel Spreadsheet where each row represents an observation and each column represents some information about that observation. We will call this object a data.frame, an R represention of a data set. The easiest way to understand data frames is to create one.\n\nTask: Copy the code below in your markdown. Create a data.frame that represents an instructor’s grade book, where each row is a student, and each column represents some sort of assessment.\n\n\nGrades <- data.frame(\n Name = c('Bob','Jeff','Mary','Valerie'), \n Exam.1 = c(90, 75, 92, 85),\n Exam.2 = c(87, 71, 95, 81)\n)\n# Show the data.frame \n# View(Grades) # show the data in an Excel-like tab. Doesn't work when knitting \nGrades # show the output in the console. This works when knitting\n\n Name Exam.1 Exam.2\n1 Bob 90 87\n2 Jeff 75 71\n3 Mary 92 95\n4 Valerie 85 81\n\n\nTo execute just one chunk of code press the green arrow top-right of the chunk:\n\n\n\n\n\n\n\n\n\nR allows two differnt was to access elements of the data.frame. First is a matrix-like notation for accessing particular values.\n\n\n\nFormat\nResult\n\n\n\n\n[a,b]\nElement in row a and column b\n\n\n[a,]\nAll of row a\n\n\n[,b]\nAll of column b\n\n\n\nBecause the columns have meaning and we have given them column names, it is desirable to want to access an element by the name of the column as opposed to the column number.\n\nTask: Copy and Run:\n\n\nGrades[, 2] # print out all of column 2 \n\n[1] 90 75 92 85\n\nGrades$Name # The $-sign means to reference a column by its label\n\n[1] \"Bob\" \"Jeff\" \"Mary\" \"Valerie\"\n\n\n\n\n1.3.2 Importing Data in R\nFrom: https://raw.githubusercontent.com/dereksonderegger/570L/master/07_DataImport.Rmd\nUsually we won’t type the data in by hand, but rather load the data from some package. Reading data from external sources is a necessary skill.\nComma Separated Values Data\nTo consider how data might be stored, we first consider the simplest file format: the comma separated values file (.csv). In this file time, each of the “cells” of data are separated by a comma. For example, the data file storing scores for three students might be as follows:\nAble, Dave, 98, 92, 94\nBowles, Jason, 85, 89, 91\nCarr, Jasmine, 81, 96, 97\nTypically when you open up such a file on a computer with MS Excel installed, Excel will open up the file assuming it is a spreadsheet and put each element in its own cell. However, you can also open the file using a more primitive program (say Notepad in Windows, TextEdit on a Mac) you’ll see the raw form of the data.\nHaving just the raw data without any sort of column header is problematic (which of the three exams was the final??). Ideally we would have column headers that store the name of the column.\nLastName, FirstName, Exam1, Exam2, FinalExam\nAble, Dave, 98, 92, 94\nBowles, Jason, 85, 89, 91\nCarr, Jasmine, 81, 96, 97\nReading (.csv) files\nTo make R read in the data arranged in this format, we need to tell R three things:\n\nWhere does the data live? Often this will be the name of a file on your computer, but the file could just as easily live on the internet (provided your computer has internet access).\nIs the first row data or is it the column names?\nWhat character separates the data? Some programs store data using tabs to distinguish between elements, some others use white space. R’s mechanism for reading in data is flexible enough to allow you to specify what the separator is.\n\nThe primary function that we’ll use to read data from a file and into R is the function read.csv(). This function has many optional arguments but the most commonly used ones are outlined in the table below.\n\n\n\n\n\n\n\n\nArgument\nDefault\nDescription\n\n\n\n\nfile\nRequired\nA character string denoting the file location.\n\n\nheader\nTRUE\nSpecifies whether the first line contains column headers.\n\n\nsep\n\",\"\nSpecifies the character that separates columns. For read.csv(), this is usually a comma.\n\n\nskip\n0\nThe number of lines to skip before reading data; useful for files with descriptive text before the actual data.\n\n\nna.strings\n\"NA\"\nValues that represent missing data; multiple values can be specified, e.g., c(\"NA\", \"-9999\").\n\n\nquote\n\"\nSpecifies the character used to quote character strings, typically \" or '.\n\n\nstringsAsFactors\nFALSE\nControls whether character strings are converted to factors; FALSE means they remain as character data.\n\n\nrow.names\nNULL\nAllows specifying a column as row names, or assigning NULL to use default indexing for rows.\n\n\ncolClasses\nNULL\nSpecifies the data type for each column to speed up reading for large files, e.g., c(\"character\", \"numeric\").\n\n\nencoding\n\"unknown\"\nSets the text encoding of the file, which can be useful for files with special or international characters.\n\n\n\nMost of the time you just need to specify the file. |\n\nTask: Let’s read in a dataset of terrorist attacks that have taken place in the UK:\n\n\nattacks <- read.csv(file = '../data/attacksUK.csv') # where the data lives \nView(attacks)",
- "crumbs": [
- "1 Lab: Introduction to R for Statistics"
- ]
- },
- {
- "objectID": "labs/01.introR.html#practice-descriptive-statistics",
- "href": "labs/01.introR.html#practice-descriptive-statistics",
- "title": "1 Lab: Introduction to R for Statistics",
- "section": "1.4 Practice: Descriptive Statistics",
- "text": "1.4 Practice: Descriptive Statistics\n\n1.4.1 Summarizing Data\nIt is very important to be able to take a data set and produce summary statistics such as the mean and standard deviation of a column. For this sort of manipulation, we use the package dplyr. This package allows chaining together many common actions to form a particular task.\nThe foundational operations to perform on a data set are:\n\nSubsetting - Returns a with only particular columns or rows\n– select - Selecting a subset of columns by name or column number.\n– filter - Selecting a subset of rows from a data frame based on logical expressions.\n– slice - Selecting a subset of rows by row number.\narrange - Re-ordering the rows of a data frame.\nmutate - Add a new column that is some function of other columns.\nsummarise - calculate some summary statistic of a column of data. This collapses a set of rows into a single row.\n\nEach of these operations is a function in the package dplyr. These functions all have a similar calling syntax,: - The first argument is a data set;. - Subsequent arguments describe what to do with the input data frame and you can refer to the columns without using the df$column notation.\nAll of these functions will return a data set.\nThe dplyr package also includes a function that “pipes” commands together. The idea is that the %>% operator works by translating the command a %>% f(b) to the expression f(a,b). This operator works on any function f. The beauty of this comes when you have a suite of functions that takes input arguments of the same type as their output. For example if we wanted to start with x, and first apply function f(), then g(), and then h(), the usual R command would be h(g(f(x))) which is hard to read because you have to start reading at the innermost set of parentheses. Using the pipe command %>%, this sequence of operations becomes x %>% f() %>% g() %>% h(). For example:\n\nGrades # Recall the Grades data \n\n Name Exam.1 Exam.2\n1 Bob 90 87\n2 Jeff 75 71\n3 Mary 92 95\n4 Valerie 85 81\n\n# The following code takes the Grades data.frame and calculates \n# a column for the average exam score, and then sorts the data \n# according to the that average score\nGrades %>%\n mutate( Avg.Score = (Exam.1 + Exam.2) / 2 ) %>%\n arrange( Avg.Score )\n\n Name Exam.1 Exam.2 Avg.Score\n1 Jeff 75 71 73.0\n2 Valerie 85 81 83.0\n3 Bob 90 87 88.5\n4 Mary 92 95 93.5\n\n\nKeep it in mind, it is not necessary to memorise this.\nLet’s consider the summarize function to calculate the mean score for Exam.1. Notice that this takes a data frame of four rows, and summarizes it down to just one row that represents the summarized data for all four students.\n\nlibrary(dplyr) # load the library\nGrades %>%\n summarize( Exam.1.mean = mean( Exam.1 ) )\n\n Exam.1.mean\n1 85.5\n\n\nSimilarly you could calculate the standard deviation for the exam as well.\n\nGrades %>%\n summarize( Exam.1.mean = mean( Exam.1 ),\n Exam.1.sd = sd( Exam.1 ) )\n\n Exam.1.mean Exam.1.sd\n1 85.5 7.593857\n\n\n\nTask: Write the code above in your markdown file and run it. Do not to copy it this time.\n\nLet’s go back to the terrorist attacks. There are attacks perpetrated by several different groups. Each record is a single attack and contains information about who perpetrated the attack, what year, how many were killed and how many were wounded. You can get a glimpse of the dataframe with the function head\n\nhead(attacks, n = 10)\n\n nrKilled nrWound year country group\n1 0 0 2005 United Kingdom Abu Hafs al-Masri Brigades\n2 0 0 2005 United Kingdom Abu Hafs al-Masri Brigades\n3 0 0 2005 United Kingdom Abu Hafs al-Masri Brigades\n4 0 0 2005 United Kingdom Abu Hafs al-Masri Brigades\n5 0 1 1982 United Kingdom Abu Nidal Organization (ANO)\n6 0 0 2014 United Kingdom Anarchists\n7 0 0 2014 United Kingdom Anarchists\n8 0 0 2014 United Kingdom Anarchists\n9 0 0 2014 United Kingdom Anarchists\n10 0 0 2014 United Kingdom Anarchists\n attack target\n1 Bombing/Explosion Transportation\n2 Bombing/Explosion Transportation\n3 Bombing/Explosion Transportation\n4 Bombing/Explosion Transportation\n5 Assassination Government (Diplomatic)\n6 Facility/Infrastructure Attack Business\n7 Facility/Infrastructure Attack Business\n8 Facility/Infrastructure Attack Business\n9 Facility/Infrastructure Attack Private Citizens & Property\n10 Facility/Infrastructure Attack Police\n weapon\n1 Explosives/Bombs/Dynamite\n2 Explosives/Bombs/Dynamite\n3 Explosives/Bombs/Dynamite\n4 Explosives/Bombs/Dynamite\n5 Firearms\n6 Incendiary\n7 Incendiary\n8 Incendiary\n9 Incendiary\n10 Incendiary\n\n\nWe might want to compare different actors and see the mean and standard deviation of the number of people wound, by each group’s attack, across time. To do this, we are still going to use the summarize, but we will precede that with group_by(group) to tell the subsequent dplyr functions to perform the actions separately for each breed.\n\nattacks %>%\n group_by( group) %>%\n summarise( Mean = mean(attacks$nrWound), \n Std.Dev = sd(attacks$nrWound))\n\n# A tibble: 38 × 3\n group Mean Std.Dev\n <chr> <dbl> <dbl>\n 1 Abu Hafs al-Masri Brigades 0.963 7.22\n 2 Abu Nidal Organization (ANO) 0.963 7.22\n 3 Anarchists 0.963 7.22\n 4 Animal Liberation Front (ALF) 0.963 7.22\n 5 Animal Rights Activists 0.963 7.22\n 6 Armenian Secret Army for the Liberation of Armenia 0.963 7.22\n 7 Black September 0.963 7.22\n 8 Continuity Irish Republican Army (CIRA) 0.963 7.22\n 9 Dissident Republicans 0.963 7.22\n10 Informal Anarchist Federation 0.963 7.22\n# ℹ 28 more rows\n\n\n\nTask: Write the code above in your markdown file and run it. Try out another categorical variable instead of group (e.g. year) and nrKilled instead of nrWound.\n\nLet’s now move to another dataset to address a research question. For illustration purposes, we will use the Family Resources Survey (FRS). The FRS is an annual survey conducted by the UK government that collects detailed information about the income, living conditions, and resources of private households across the United Kingdom. Managed by the Department for Work and Pensions (DWP), the FRS provides data that is essential for understanding the economic and social conditions of households and informing public policy.\nConsider questions such as:\n\nHow many respondents (persons) are there in the 2016-17 FRS?\nHow many variables (population attributes) are there?\nWhat types of variables are present in the FRS?\nWhat is the most detailed geography available in the FRS?\n\n\nTask: To answer these questions, load and inspect the dataset.\n\n\n# the FRS dataset should be already loaded, otherwise\nfrs_data <- read.csv(\"../data/FamilyResourceSurvey/FRS16-17_labels.csv\") \n\n# Display basic structure \nglimpse(frs_data)\n\nRows: 44,145\nColumns: 45\n$ household <int> 6087, 6101, 6103, 6122, 6134, 6136, 6138, 6140, 6143,…\n$ family <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n$ person <int> 5, 3, 3, 3, 2, 4, 4, 3, 3, 4, 4, 3, 4, 2, 5, 3, 4, 3,…\n$ country <chr> \"England\", \"England\", \"England\", \"Northern Ireland\", …\n$ region <chr> \"London\", \"South East\", \"Yorks and the Humber\", \"Nort…\n$ age_group <chr> \"05-10\", \"05-10\", \"05-10\", \"05-10\", \"05-10\", \"05-10\",…\n$ sex <chr> \"Female\", \"Male\", \"Male\", \"Female\", \"Female\", \"Female…\n$ marital_status <chr> \"Single\", \"Single\", \"Single\", \"Single\", \"Single\", \"Si…\n$ ethnicity <chr> \"Mixed / multiple ethnic groups\", \"White\", \"White\", \"…\n$ hrp <chr> \"Not HRP\", \"Not HRP\", \"Not HRP\", \"Not HRP\", \"Not HRP\"…\n$ rel_to_hrp <chr> \"Son/daughter (incl. adopted)\", \"Son/daughter (incl. …\n$ lifestage <chr> \"Child (0-17)\", \"Child (0-17)\", \"Child (0-17)\", \"Chil…\n$ dependent <chr> \"Dependent\", \"Dependent\", \"Dependent\", \"Dependent\", \"…\n$ arrival_year <chr> \"UK Born\", \"UK Born\", \"UK Born\", \"UK Born\", \"UK Born\"…\n$ birth_country <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ care_hours <chr> \"0 hours per week\", \"0 hours per week\", \"0 hours per …\n$ educ_age <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ educ_type <chr> \"School (full-time)\", \"School (full-time)\", \"School (…\n$ fam_youngest <chr> \"7\", \"4\", \"0\", \"7\", \"0\", \"9\", \"10\", \"0\", \"3\", \"10\", \"…\n$ fam_toddlers <int> 0, 1, 1, 0, 2, 0, 0, 2, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,…\n$ fam_size <int> 4, 4, 4, 3, 4, 4, 3, 5, 4, 4, 3, 4, 4, 3, 5, 4, 4, 4,…\n$ happy <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ health <chr> \"Not known\", \"Not known\", \"Not known\", \"Not known\", \"…\n$ hh_accom_type <chr> \"Terraced house/bungalow\", \"Detached house/bungalow\",…\n$ hh_benefits <int> 10868, 0, 1768, 8632, 8372, 1768, 1768, 1768, 0, 0, 1…\n$ hh_composition <chr> \"Three or more adults, 1+ children\", \"One adult femal…\n$ hh_ctax_band <chr> \"Band D\", \"Band F\", \"Band A\", \"Band B\", \"Band A\", \"Ba…\n$ hh_housing_costs <chr> \"4316\", \"10296\", \"5408\", \"Northern Ireland\", \"5720\", …\n$ hh_income_gross <int> 54236, 180804, 26936, 19968, 17992, 76596, 31564, 366…\n$ hh_income_net <int> 44668, 120640, 23556, 19968, 17992, 62868, 29744, 287…\n$ hh_size <int> 5, 4, 4, 3, 4, 4, 4, 5, 4, 4, 4, 4, 4, 5, 5, 4, 4, 4,…\n$ hh_tenure <chr> \"Mortgaged (including part rent / part own)\", \"Mortga…\n$ highest_qual <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ income_gross <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…\n$ income_net <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…\n$ jobs <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ life_satisf <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ nssec <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ sic_chapter <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ sic_division <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ soc2010 <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ work_hours <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ workstatus <chr> \"Dependent Child\", \"Dependent Child\", \"Dependent Chil…\n$ years_ft_work <chr> \"Dependent child\", \"Dependent child\", \"Dependent chil…\n$ survey_weight <int> 2315, 1317, 2449, 427, 1017, 1753, 1363, 1344, 828, 1…\n\n\nand summary:\n\nsummary(frs_data)\n\n household family person country \n Min. : 1 Min. :1.000 Min. :1.00 Length:44145 \n 1st Qu.: 4816 1st Qu.:1.000 1st Qu.:1.00 Class :character \n Median : 9673 Median :1.000 Median :2.00 Mode :character \n Mean : 9677 Mean :1.106 Mean :1.98 \n 3rd Qu.:14553 3rd Qu.:1.000 3rd Qu.:3.00 \n Max. :19380 Max. :6.000 Max. :9.00 \n region age_group sex marital_status \n Length:44145 Length:44145 Length:44145 Length:44145 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n \n \n \n ethnicity hrp rel_to_hrp lifestage \n Length:44145 Length:44145 Length:44145 Length:44145 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n \n \n \n dependent arrival_year birth_country care_hours \n Length:44145 Length:44145 Length:44145 Length:44145 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n \n \n \n educ_age educ_type fam_youngest fam_toddlers \n Length:44145 Length:44145 Length:44145 Min. :0.0000 \n Class :character Class :character Class :character 1st Qu.:0.0000 \n Mode :character Mode :character Mode :character Median :0.0000 \n Mean :0.2557 \n 3rd Qu.:0.0000 \n Max. :4.0000 \n fam_size happy health hh_accom_type \n Min. :1.000 Length:44145 Length:44145 Length:44145 \n 1st Qu.:2.000 Class :character Class :character Class :character \n Median :2.000 Mode :character Mode :character Mode :character \n Mean :2.599 \n 3rd Qu.:4.000 \n Max. :9.000 \n hh_benefits hh_composition hh_ctax_band hh_housing_costs \n Min. : 0 Length:44145 Length:44145 Length:44145 \n 1st Qu.: 0 Class :character Class :character Class :character \n Median : 1768 Mode :character Mode :character Mode :character \n Mean : 5670 \n 3rd Qu.:10192 \n Max. :54080 \n hh_income_gross hh_income_net hh_size hh_tenure \n Min. :-326092 Min. :-334776 Min. :1.00 Length:44145 \n 1st Qu.: 22256 1st Qu.: 20748 1st Qu.:2.00 Class :character \n Median : 35984 Median : 31512 Median :3.00 Mode :character \n Mean : 46076 Mean : 37447 Mean :2.96 \n 3rd Qu.: 57252 3rd Qu.: 47008 3rd Qu.:4.00 \n Max. :1165216 Max. :1116596 Max. :9.00 \n highest_qual income_gross income_net jobs \n Length:44145 Min. :-354848 Min. :-358592 Length:44145 \n Class :character 1st Qu.: 52 1st Qu.: 0 Class :character \n Mode :character Median : 12740 Median : 12012 Mode :character \n Mean : 17305 Mean : 14204 \n 3rd Qu.: 23712 3rd Qu.: 20384 \n Max. :1127360 Max. :1110928 \n life_satisf nssec sic_chapter sic_division \n Length:44145 Length:44145 Length:44145 Length:44145 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n \n \n \n soc2010 work_hours workstatus years_ft_work \n Length:44145 Length:44145 Length:44145 Length:44145 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n \n \n \n survey_weight \n Min. : 221 \n 1st Qu.: 1097 \n Median : 1380 \n Mean : 1459 \n 3rd Qu.: 1742 \n Max. :39675 \n\n\n\n\n1.4.2 Understanding the Structure of the FRS Datafile\nIn the FRS data structure, each row represents a person, but:\n\nEach person is nested within a family.\nEach family is nested within a household.\n\nBelow is an example dataset structure:\n\n\n\n\n\n\n\n\n\n\n\n\n\nhousehold\nfamily\nperson\nregion\nage_group\nsex\nmarital_status\nrel_to_hrp\n\n\n\n\n1\n1\n1\nLondon\n40-44\nFemale\nMarried/Civil partnership\nSpouse\n\n\n1\n1\n2\nLondon\n40-44\nMale\nMarried/Civil partnership\nHousehold Representative\n\n\n1\n1\n3\nLondon\n5-10\nMale\nSingle\nSon/daughter (incl. adopted)\n\n\n1\n1\n4\nLondon\n5-10\nFemale\nSingle\nSon/daughter (incl. adopted)\n\n\n1\n1\n5\nLondon\n16-19\nMale\nSingle\nStep-son/daughter\n\n\n2\n1\n1\nScotland\n35-39\nMale\nSingle\nHousehold Representative\n\n\n3\n1\n1\nYorks and the Humber\n35-39\nFemale\nMarried/Civil partnership\nHousehold Representative\n\n\n3\n1\n2\nYorks and the Humber\n35-39\nMale\nMarried/Civil partnership\nSpouse\n\n\n3\n1\n3\nYorks and the Humber\n5-10\nMale\nSingle\nStep-son/daughter\n\n\n4\n1\n1\nWales\n0-4\nMale\nSingle\nSon/daughter (incl. adopted)\n\n\n4\n1\n2\nWales\n60-64\nMale\nMarried/Civil partnership\nHousehold Representative\n\n\n4\n1\n3\nWales\n55-59\nFemale\nMarried/Civil partnership\nSpouse\n\n\n4\n2\n3\nWales\n30-34\nFemale\nSingle\nSon/daughter (incl. adopted)\n\n\n\nThe first five people in the FRS all belong to the same household (household 1); they also all belong to the same family. This family comprises a married middle-aged couple plus their three children, one of whom is a stepson.\nThe second household (household 2) comprises only one person – a single middle-aged male.The third household comprises another married couple, this time with two children.\nSuperficially the fourth household looks similar to households 1 and 2: a married couple plus their daughter. The difference is that this particular married couple is nearing retirement age, and their daughter is middle-aged. Consequently, despite being a child of the married couple, the middle-aged daughter is treated as a separate ‘family’ (family 2 in the household). This is because the FRS (and Census) define a ‘family’ as a couple plus any ‘dependent’ children. A dependent child is defined as a child who is either` aged 0-15 or aged 16-19, unmarried and in full-time education. All children aged 16-19 who are married or no longer in full-time education are regarded as ‘independent’ adults who form their own family unit, as are all children aged 20+.\nThe inclusion of all persons in a household allows us more flexibility in the types of research question we can answer. For example, we could explore how the likelihood of a woman being in paid employment WorkStatus is influenced by the age of the youngest child still living in her family (if any) fam_youngest.\nIn the FRS (and Census), a “family” is defined as a couple and any “dependent” children. Dependent children are defined as those aged 0–15, or aged 16–19 if unmarried and in full-time education.\n\n\n1.4.3 Explore the Distribution of Your Outcome Variable\nBefore starting your analysis, it is critical to know the type of scale used to measure your outcome variable: is it categorical or continuous? Here we will start off by exploring a continuous variable which can then turn into a categorical variable (e.g. top earners: yes or no). We explore the income distribution in the UK by first looking at the low and high end of the distribution ie. What sorts of people have high (or low) incomes?\nIn the FRS each person’s annual income is recorded, both gross (pre-tax) and net (post-tax). This income includes all income sources, including earnings, profits, investment returns, state benefits, occupational pensions etc. As it is possible to make a loss on some of these activities, it is also possible (although unusual) for someone’s gross or net annual income in a given year to be negative (representing an overall loss).\n\nTask: Load the FRS dataset into your R environment, if it’s not already loaded, and inspect the data.\n\n\n# Load the dataset (replace 'frs_data.csv' with the actual file path)\nfrs_data <- read.csv(\"../data/FamilyResourceSurvey/FRS16-17_labels.csv\") \n\nOpen the dataset in RStudio’s Data Viewer to explore its structure, including the income_gross and income_net variables.\n\n # Open the data in the RStudio Viewer\n View(frs_data)\n\nin the Data Viewer tab, scroll horizontally to locate the income_gross and income_net columns. If columns are listed alphabetically, they will appear near other attributes that start with “income.”\nYou should notice two things:\n\nIncomes are recorded to the nearest £, NOT in income bands.\nDependent children almost all have a recorded income of £0.\n\nThis second observation highlights the somewhat loose wording of our question above (What sorts of people have high (or low) incomes?). To avoid reaching the somewhat banal conclusion that those with the lowest of all incomes are almost all children, we should re-frame the question more precisely as What sorts of people (excluding dependent children) have low incomes?\n\nTask: Determine the Scale of the Outcome Variable.\n\n\n# Summarize income variables\nsummary(frs_data$income_gross)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n-354848 52 12740 17305 23712 1127360 \n\n\n\n# Summarize income variables\nsummary(frs_data$income_net)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n-358592 0 12012 14204 20384 1110928 \n\n\n\nTask: Exclude Dependent Children.\n\nYou need to select all cases (persons) that are independent, that is where the variable dependent has values different from != “Dependent” or equal == “Independent”.\n\n# Filter to include only independent persons\nfrs_independent <- frs_data %>% filter(dependent != \"Dependent\")\n\n\nTask: Create a basic histogram (a visualisation lecture is scheduled later on).\n\nThe income variables in the FRS are all scale variables so a good starting point is to examine its distribution looking at a histogram of income_gross.\n\nlibrary(ggplot2)\n \n ggplot(frs_independent, aes(x = income_gross)) +\n geom_histogram(binwidth = 5000, fill = \"blue\", color = \"black\") +\n labs(\n title = \"Distribution of Gross Household Income\",\n x = \"Gross Income (£)\",\n y = \"Frequency\"\n ) +\n xlim(0, 90000) +\n theme_minimal()\n\n\n\n\n\n\n\n\nYou should see the histogram below. It reveals that the income distribution is very skewed with few people earning high salaries and the majority earning just over or less 35,000 annually.\n\nTask: Adopt a regrouping strategy.\n\nYou can also cross-tabulate gross (or net) income with any of the other variables in the FRS to your heart’s content – or can you?\nAgain, here is important to recall that the income variables in the FRS are all ‘scale’ variables; in other words, they are precise measures rather than broad categories. Consequently, every single person in the FRS potentially has their own unique income value. That could make for a table c. 44,000 rows long (one row per person) if each person has their own unique value. The solution is to create a categorical version of the original income variable by assigning each person to one of a set of income categories (income bands). Having done this, cross-tabulation then becomes possible.\nBut which strategy to use? Equal intervals, percentiles or ‘ad hoc’. Here I would suggest that ‘ad hoc’ is best: all you want to do is to allocate each independent adult to one of three arbitrarily defined groups: ‘low’, ‘middle’ and ‘high’ income. Define Low and High Income Thresholds\nDefine thresholds for income categories:\n\nLow-income threshold: £________\nHigh-income threshold: £_______\n\n\nTask: Create a New Variable Based on Regrouping of Original Variable.\n\nRecode income_gross into categories based on the chosen thresholds.\n\n# Define thresholds for income categories \nLOW_THRESHOLD <- 10000 # Replace with the upper limit for low income \nHIGH_THRESHOLD <- 50000 # Replace with the lower limit for high income \n\n# Define income categories based on thresholds \nfrs_independent <- frs_independent %>% \n mutate(income_category = case_when( \n income_gross <= LOW_THRESHOLD ~ \"Low\", \n income_gross >= HIGH_THRESHOLD ~ \"High\", \n TRUE ~ \"Middle\" ))\n\nThe mutate() function in R, from the dplyr package, is used to add or modify columns in a data frame. It allows you to create new variables or transform existing ones by applying calculations or conditional statements directly within the function.\nExplanation of the code\n\nfrs_independent %>%: The pipe operator %>% sends frs_independent into mutate(), allowing us to apply transformations without reassigning it repeatedly.\nmutate(): Starts the transformation process by defining new or modified columns.\nincome_category = case_when(...):\n\nThis creates a new column named income_category.\nThe case_when() function defines conditions for assigning values to this new column.\n\ncase_when():\n\ncase_when() is used here to assign categorical labels based on conditions.\nincome_gross <= LOW_THRESHOLD ~ \"Low\": If income_gross is less than or equal to LOW_THRESHOLD, income_category will be labeled “Low.”\nincome_gross >= HIGH_THRESHOLD ~ \"High\": If income_gross is greater than or equal to HIGH_THRESHOLD, income_category will be labeled “High.”\nTRUE ~ \"Middle\": Any values not meeting the previous conditions are labeled “Middle.”\n\n\n\nTask: Add some Metadata.\n\nDefine metadata for the new variable by labeling income categories.\n\n# Add metadata by converting to a factor and defining labels\n\nfrs_independent$income_category <- factor(frs_independent$income_category,\n levels = c(\"Low\", \"Middle\", \"High\"), labels = c(\"<= £10,000\", \"£10,001 - £49,999\", \">= £50,000\"))\n\n\nTask: Check your work.\n\nExamine the frequency distribution of the variable you have just created. Both variables should have the same number of missing cases, unless:\n\nMissing cases in the old variable have been intentionally converted into valid cases in the new variable.\nYou forgot to allocate a new value to one of the old variable categories, in which case the new variable will have more missing cases than the old variable.\n\n\n# Frequency distribution of income categories\ntable(frs_independent$income_category)\n\n\n <= £10,000 £10,001 - £49,999 >= £50,000 \n 8584 22981 2271 \n\n\nAfter preparing the data, use cross-tabulations to compare income levels across demographic groups.\n\n# Cross-tabulate income category by age group, nationality, etc.\ntable(frs_independent$income_category, frs_independent$age_group)\n\n \n 16-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64\n <= £10,000 373 680 492 558 474 511 554 652 781 826\n £10,001 - £49,999 263 1241 1802 2056 2052 1948 1995 1967 1749 1772\n >= £50,000 1 8 59 186 314 331 334 356 237 177\n \n 65-69 70-74 75+\n <= £10,000 773 744 1166\n £10,001 - £49,999 2073 1554 2509\n >= £50,000 144 56 68\n\n\nExplore income distribution across different regions.\n\n# Cross-tabulate income category by region\ntable(frs_independent$income_category, frs_independent$region) \n\n \n East Midlands East of England London North East North West\n <= £10,000 562 665 740 357 878\n £10,001 - £49,999 1550 1855 1850 979 2347\n >= £50,000 135 245 367 48 174\n \n Northern Ireland Scotland South East South West Wales\n <= £10,000 874 1212 895 588 399\n £10,001 - £49,999 2305 3234 2563 1707 971\n >= £50,000 123 322 367 149 63\n \n West Midlands Yorks and the Humber\n <= £10,000 744 670\n £10,001 - £49,999 1892 1728\n >= £50,000 164 114\n\n\nTips for Cross-Tabulation\n\nPlace the income variable in the columns.\nAdd multiple variables in the rows to create simultaneous cross-tabulations.",
- "crumbs": [
- "1 Lab: Introduction to R for Statistics"
- ]
- },
- {
- "objectID": "labs/02.MultipleLinear.html",
- "href": "labs/02.MultipleLinear.html",
- "title": "2 Lab: Correlation, Single, and Multiple Linear Regression",
- "section": "",
- "text": "2.1 Part I. Correlation",
- "crumbs": [
- "2 Lab: Correlation, Single, and Multiple Linear Regression"
- ]
- },
- {
- "objectID": "labs/02.MultipleLinear.html#part-i.-correlation",
- "href": "labs/02.MultipleLinear.html#part-i.-correlation",
- "title": "2 Lab: Correlation, Single, and Multiple Linear Regression",
- "section": "",
- "text": "2.1.1 Data Overview: Descriptive Statistics:\nLet’s start by picking one dataset derived from the English-Wales 2021 Census data. You can choose one dataset that aggregates data either at a) county, b) district, or c) ward-level. Lower Tier Local Authority-, Region-, and Country-level data is also available in the data folder.\nsee also: https://canvas.liverpool.ac.uk/courses/77895/pages/census-data-2021\n\n# Load necessary libraries \nlibrary(ggplot2) \n\nWarning: package 'ggplot2' was built under R version 4.3.2\n\nlibrary(dplyr) \n\nWarning: package 'dplyr' was built under R version 4.3.2\n\n\n\nAttaching package: 'dplyr'\n\n\nThe following objects are masked from 'package:stats':\n\n filter, lag\n\n\nThe following objects are masked from 'package:base':\n\n intersect, setdiff, setequal, union\n\noptions(scipen = 999, digits = 4) # Avoid scientific notation and round to 4 decimals globally\n\n# load data\ncensus <- read.csv(\"../data/Census2021/EW_DistrictPercentages.csv\") # District level\n\nWe’re using a (district/ward/etc.)-level census dataset that includes:\n\n% of population with poor health (variable name: pct_Very_bad_health).\n% of population with no qualifications (pct_No_qualifications).\n% of male population (pct_Males).\n% of population in a higher managerial/professional occupation (pct_Higher_manager_prof).\n\nFirst, let’s get some descriptive statistics that help identify general trends and distributions in the data.\n\n# Summary statistics\nsummary_data <- census %>%\n select(pct_Very_bad_health, pct_No_qualifications, pct_Males, pct_Higher_manager_prof) %>%\n summarise_all(list(mean = mean, sd = sd))\nsummary_data\n\n pct_Very_bad_health_mean pct_No_qualifications_mean pct_Males_mean\n1 1.173 17.9 48.97\n pct_Higher_manager_prof_mean pct_Very_bad_health_sd pct_No_qualifications_sd\n1 13.22 0.3402 3.959\n pct_Males_sd pct_Higher_manager_prof_sd\n1 0.6603 4.73\n\n\nQ1. Complete the table below by specifying each variable type (continuous or categorical) and reporting its mean and standard deviation.\n\n\n\n\n\n\n\n\n\nVariable Name\nType (Continuous or Categorical)\nMean\nStandard Deviation\n\n\n\n\npct_Very_bad_health\n\n\n\n\n\npct_No_qualifications\n\n\n\n\n\npct_Males\n\n\n\n\n\npct_Higher_manager_prof\n\n\n\n\n\n\n\n\n2.1.2 Simple visualisation for continuous data\nYou can visualise the relationship between two continuous variables using a scatter plot. Using the chosen census datasets, visualise the association between the % of population with bad health (pct_Very_bad_health) and each of the following:\n\nthe % of population with no qualifications (pct_No_qualifications);\nthe % of population aged 65 to 84 (pct_Age_65_to_84);\nthe % of population in a married couple (pct_Married_couple);\nthe % of population in a Higher Managerial or Professional occupation (pct_Higher_manager_prof).\n\n\n# Scatterplot for each variable variables \nvariables <- c(\"pct_No_qualifications\", \"pct_Age_65_to_84\", \"pct_Married_couple\", \"pct_Higher_manager_prof\")\n\n# Loop to create scatterplots and calculate correlations \n# x and y variables for each scatter plot,\nfor (var in variables) { \n # Scatterplot \n ggplot(census, aes_string(x = var, y = \"pct_Very_bad_health\")) +\n geom_point() + \n labs(title = paste(\"Scatterplot of pct_Very_bad_health vs\", var), \n x = var, y = \"pct_Very_bad_health\") +\n theme_minimal()\n}\n\nWarning: `aes_string()` was deprecated in ggplot2 3.0.0.\nℹ Please use tidy evaluation idioms with `aes()`.\nℹ See also `vignette(\"ggplot2-in-packages\")` for more information.\n\n\nQ2. Which of the associations do you think is strongest, which one is the weakest?\nAs noted, before, an observed association between two variables is no guarantee of causation. It could be that the observed association is:\n\nsimply a chance one due to sampling uncertainty;\ncaused by some third underlying variable which explains the spatial variation of both of the variables in the scatterplot;\ndue to the inherent arbitrariness of the boundaries used to define the areas being analysed (the ‘Modifiable Area Unit Problem’).\n\nQ3. Setting these caveats to one side, are the associations observed in the scatter-plots suggestive of any causative mechanisms of bad health?\nRather than relying upon an impressionistic view of the strength of the association between two variables, we can measure that association by calculating the relevant correlation coefficient. The Table below identifies the statistically appropriate measure of correlation to use between two continuous variables.\n\n\n\n\n\n\n\n\nVariable Data Type\nMeasure of Correlation\nRange\n\n\n\n\nBoth symmetrically distributed\nPearson’s\n-1 to +1\n\n\nOne or both with a skewed distribution\nSpearman’s Rank\n-1 to +1\n\n\n\nDifferent Calculation Methods: Pearson’s correlation assumes linear relationships and is suitable for symmetrically distributed (normally distributed) variables, measuring the strength of the linear relationship. Spearman’s rank correlation, however, works on ranked data, so it’s more suitable for skewed data or variables with non-linear relationships, measuring the strength and direction of a monotonic relationship.\nWhen calculating correlation for a single pair of variables, select the method that best fits their data distribution:\n- Use **Pearson’s** if both variables are symmetrically distributed.\n- Use **Spearman’s** if one or both variables are skewed.\n\n\n\n\n\n\n\n\n\nYou can check the distribution of a variable (e.g. pct_No_qualifications like this):\n\n# Plot histogram with density overlay for a chosen variable (e.g., 'pct_No_qualifications')\nggplot(census, aes(x = pct_No_qualifications)) + \n geom_histogram(aes(y = after_stat(density)), bins = 30, color = \"black\", fill = \"skyblue\", alpha = 0.7) +\n geom_density(color = \"darkblue\", linewidth = 1) +\n labs(title = \"Distribution of pct_No_qualifications\", x = \"Value\", y = \"Density\") +\n theme_minimal()\n\n\n\n\n\n\n\n\nWhen analyzing multiple pairs of variables, using different measures (Pearson for some pairs, Spearman for others) creates inconsistencies since Pearson and Spearman values aren’t directly comparable in size due to their different calculation methods. To maintain consistency across comparisons, calculate both Pearson’s and Spearman’s correlations for each pair, e.g. do the trends align (both showing strong, weak, or moderate correlation in the same direction)? This consistency check can give confidence that the relationships observed are not dependent on the correlation method chosen. While in a report you’d typically include only one set of correlations (usually Pearson’s if the relationships appear linear), calculating both can validate that your observations aren’t an artifact of the correlation method.\n\nResearch Question 1: Which of our selected variables are most strongly correlated with % of population with bad health?\n\nTo answer this question, complete the Table below by editing/running this code:.\nPearson correlations\n\npearson_correlation <- cor(census$pct_Very_bad_health,\n census$pct_No_qualifications,use = \"complete.obs\", method = \"pearson\")\n \n# Display the results\ncat(\"Pearson Correlation:\", pearson_correlation, \"\\n\")\n\nPearson Correlation: 0.7621 \n\n\nSpearman correlations:\n\nspearman_correlation <- cor(census$pct_Very_bad_health,\n census$pct_No_qualifications, use = \"complete.obs\", method = \"spearman\")\n\ncat(\"Spearman Correlation:\", spearman_correlation, \"\\n\")\n\nSpearman Correlation: 0.7785 \n\n\n\n\n\nCovariates\nPearson\nSpearman\n\n\n\n\npct_Very_bad_health - pct_No_qualifications\n\n\n\n\npct_Very_bad_health - pct_Age_65_to_84\n\n\n\n\npct_Very_bad_health - pct_Married_couple\n\n\n\n\npct_Very_bad_health - pct_Higher_manager_prof\n\n\n\n\n\nWhat can you make of this numbers?\nIf you think you have found a correlation between two variables in our dataset, this doesn’t mean that an association exists between these two variables in the population at large. The uncertainty arises because, by chance, the random sample included in our dataset might not be fully representative of the wider population.\nFor this reason, we need to verify whether the correlation is statistically significant,\n\n# significance test for pearson, for example\npearson_test <- cor.test(census$pct_Very_bad_health,\n census$pct_No_qualifications, method = \"pearson\", use = \"complete.obs\")\npearson_test\n\n\n Pearson's product-moment correlation\n\ndata: census$pct_Very_bad_health and census$pct_No_qualifications\nt = 21, df = 329, p-value <0.0000000000000002\nalternative hypothesis: true correlation is not equal to 0\n95 percent confidence interval:\n 0.7128 0.8038\nsample estimates:\n cor \n0.7621 \n\n\nLook at https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/cor.test for details about the function. But in general, when calculating the correlation between two variables, a p-value accompanies the correlation coefficient to indicate the statistical significance of the observed association. This p-value tests the null hypothesis that there is no association between the two variables (i.e., that the correlation is zero).\nWhen interpreting p-values, certain thresholds denote different levels of confidence. A p-value less than 0.05 is generally considered statistically significant at the 95% confidence level, suggesting that we can be 95% confident there is an association between the variables in the broader population. When the p-value is below 0.01, the result is significant at the 99% confidence level, meaning we have even greater confidence (99%) that an association exists. Sometimes, on research papers or tables significance levels are denoted with asterisks: one asterisk (*) typically indicates significance at the 95% level (p < 0.05), two asterisks (**) significance at the 99% level (p < 0.01), three asterisks (***) significance at the 99.99% level (p < 0.01).\nTypically, p-values are reported under labels such as “Sig (2-tailed),” where “2-tailed” refers to the fact that the test considers both directions (positive and negative correlations). Reporting the exact p-value (e.g., p = 0.002) is more informative than using thresholds alone, as it gives a clearer picture of how strongly the data contradicts the null hypothesis of no association.\nIn a nutshell, lower p-values suggest a stronger statistical basis for believing that an observed correlation is not due to random chance. A statistically significant p-value reinforces confidence that an association is likely to exist in the wider population, though it does not imply causation.\n\n\n2.1.3 Part. 2: Implementing a Linear Regression Model\nA key goal of data analysis is to explore the potential factors of health at the local district level. So far, we have used cross-tabulations and various bivariate correlation analysis methods to explore the relationships between variables. One key limitation of standard correlation analysis is that it remains hard to look at the associations of an outcome/dependent variable to multiple independent/explanatory variables at the same time. Regression analysis provides a very useful and flexible methodological framework for such a purpose. Therefore, we will investigate how various local factors impact residents’ health by building a multiple linear regression model in R.\nWe use pct_Very_bad_health as a proxy for residents’ health.\n\nResearch Question 2: How do local factors affect residents’ health?\n\nDependent (or Response) Variable:\n\n% of population with bad health (pct_Very_bad_health).\n\nIndependent (or Explanatory) Variables:\n\n% of population with no qualifications (pct_No_qualifications).\n% of male population (pct_Males).\n% of population in a higher managerial/professional occupation (pct_Higher_manager_prof).\n\nLoad some other Libraries\n\nlibrary(tidyverse)\n\nWarning: package 'tidyverse' was built under R version 4.3.2\n\n\nWarning: package 'tibble' was built under R version 4.3.2\n\n\nWarning: package 'tidyr' was built under R version 4.3.2\n\n\nWarning: package 'readr' was built under R version 4.3.2\n\n\nWarning: package 'purrr' was built under R version 4.3.2\n\n\nWarning: package 'stringr' was built under R version 4.3.2\n\n\nWarning: package 'forcats' was built under R version 4.3.2\n\n\nWarning: package 'lubridate' was built under R version 4.3.2\n\n\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ lubridate 1.9.3 ✔ tibble 3.2.1\n✔ purrr 1.0.2 ✔ tidyr 1.3.1\n✔ readr 2.1.5 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nlibrary(broom)\n\nWarning: package 'broom' was built under R version 4.3.2\n\n\nand the data (if not loaded):\n\n# Load dataset\ncensus <- read.csv(\"../data/Census2021/EW_DistrictPercentages.csv\")\n\nRegression models are the standard method for constructing predictive and explanatory models. They tell us how changes in one variable (the target variable or independent variable, \\(Y\\)) are associated with changes in explanatory variables, or dependent variables, \\(X_1, X_2, X_3\\) (\\(X_n\\)), etc. Classic linear regression is referred to Ordinary least squares (OLS) regression because they estimate the relationship between one or more independent variables and a dependent variable \\(Y\\) using a hyperplane (i.e. a multi-dimensional line) that minimises the sum of the squared difference between the observed values of \\(Y\\) and the values predicted by the model (denoted as \\(\\hat{Y}\\), \\(Y\\)-hat).\nHaving seen Single Linear Regression in class - where the relationship between one independent variable and a dependent variable is modeled - we can extend this concept to situations where more than one explanatory variable might influence the outcome. While single linear regression helps us understand the effect of ONE variable in isolation, real-world phenomena are often influenced by multiple factors simultaneously. Multiple linear regression addresses this complexity by allowing us to model the relationship between a dependent variable and multiple independent variables, providing a more comprehensive view of how various explanatory variables contribute to changes in the outcome.\nHere, regression allows us to examine the relationship between people’s health rates and multiple dependent variables.\nBefore starting, we define two hypotheses:\n\nNull hypothesis (\\(H_0\\)): For each variable \\(X_n\\), there is no effect of \\(X_n\\) on \\(Y\\).\nAlternative hypothesis (\\(H_1\\)): There is an effect of\\(X_n\\) on \\(Y\\).\n\nWe will test if we can reject the null hypothesis.\n\n\n2.1.4 Model fit\n\n# Linear regression model\nmodel <- lm(pct_Very_bad_health ~ pct_No_qualifications + pct_Males + pct_Higher_manager_prof, data = census)\nsummary(model)\n\n\nCall:\nlm(formula = pct_Very_bad_health ~ pct_No_qualifications + pct_Males + \n pct_Higher_manager_prof, data = census)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.4903 -0.1369 -0.0352 0.0983 0.7658 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 4.01799 0.88004 4.57 0.0000071 ***\npct_No_qualifications 0.05296 0.00591 8.96 < 0.0000000000000002 ***\npct_Males -0.07392 0.01785 -4.14 0.0000440 ***\npct_Higher_manager_prof -0.01309 0.00494 -2.65 0.0084 ** \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.213 on 327 degrees of freedom\nMultiple R-squared: 0.61, Adjusted R-squared: 0.607 \nF-statistic: 171 on 3 and 327 DF, p-value: <0.0000000000000002\n\n\nCode explanation\nlm() Function:\n\nlm() stands for “linear model” and is used to fit a linear regression model in R.\nThe formula syntax pct_Very_bad_health ~ pct_No_qualifications + pct_Males + pct_Higher_manager_prof specifies a relationship between:\n\nDependent Variable: pct_Very_bad_health.\nIndependent Variables: pct_No_qualifications, pct_Males, and pct_Higher_manager_prof. The model is trained on the data dataset.\n\n\nStoring the Model: The model <- syntax stores the fitted model in an object called model.\nsummary(model) provides a detailed output of the model’s results, including:\n\nCoefficients: Estimates of the regression slopes (i.e., how each independent variableaffects pct_Very_bad_health).\nStandard Errors: The variability of each coefficient estimate.\nt-values and p-values: Indicate the statistical significance of the effect of each independent (explanatory) variable.\n\nR-squared and Adjusted R-squared: Show how well the independent variables explain the variance in the dependent variable.\nF-statistic: Tests the overall significance of the model.\n\nWe can focus only on certain output metrics:\n\n# Regression coefficients\ncoefficients <- tidy(model)\ncoefficients\n\n# A tibble: 4 × 5\n term estimate std.error statistic p.value\n <chr> <dbl> <dbl> <dbl> <dbl>\n1 (Intercept) 4.02 0.880 4.57 7.06e- 6\n2 pct_No_qualifications 0.0530 0.00591 8.96 2.54e-17\n3 pct_Males -0.0739 0.0179 -4.14 4.40e- 5\n4 pct_Higher_manager_prof -0.0131 0.00494 -2.65 8.43e- 3\n\n\nThese are:\n\nRegression Coefficient Estimates.\nP-values.\nAdjusted R-squared.\n\n\n\n2.1.5 How to interpret the output metrics\n\n2.1.5.1 Regression Coefficient Estimates\nThe Estimate column in the output table tells us the rate of change between each dependent variable \\(X_n\\) and \\(Y\\).\nIntercept: In the regression equation, this is \\(β_0\\) and it indicates the value of \\(Y\\) when \\(X_n\\) are equal to zero.\nSlopes: These are the other regression coefficients of an independent variable, e.g. \\(β_1\\), i.e. estimated average changes in \\(Y\\) for a one unit change in an independent variable, e.g. \\(X_1\\), when all other dependent or explanatory variables are held constant.\nThere are two key points worth mentioning:\n\nThe unit of \\(X\\) and \\(Y\\): you need to know what the units are of the independent and dependent variables. For instance, one unit could be one year if you have an age variable, or a one percentage point if the variable is measured in percentages (all the variables in this week’s practical).\nAll the other explanatory variables are held constant. It means that the coefficient of an explanatory variable \\(X_1\\) (e.g. \\(β_1\\)) should be interpreted as: a one unit change in \\(X_1\\) is associated with \\(β_1\\) units change in \\(Y\\), keeping other values of explanatory variables (e.g. \\(X_2\\), \\(X_3\\)) constant – for instance, \\(X_2\\)= 0.1 or \\(X_3\\)= 0.4.\n\nFor the independent variable \\(X\\), we can derive how changes of 1 unit for the independent are associated with the changes in pct_Very_bad_health, for example:\n\nThe association of pct_No_qualifications is positive and strong: each increase in 1% of pct_No_qualifications is associated with an increase of 0.05% of very bad health rate.\nThe association of pct_Males is negative and strong: each decrease in 1% of pct_Males is associated with an increase of 0.07% of pct_Very_bad_health in the population in England and Wales.\nThe association of pct_Higher_manager_prof is negative but weak: each decrease in 1% of pct_Higher_manager_prof is associated with an increase of 0.013% of pct_Very_bad_health.\n\n\n\n2.1.5.2 P-values and Significance\nThe t tests of regression coefficients are used to judge the statistical inferences on regression coefficients, i.e. associations between independent variables and the outcome variable. For a t-statistic of a dependent variable, there is a corresponding p-value that indicates different levels of significance in the column Pr(>|t|) and the asterisks ∗.\n\n*** indicates “changes in \\(X_n\\) are significantly associated with changes in \\(Y\\) at the <0.001 level”.\n** suggests that “changes in \\(X_n\\) are significantly associated with changes in \\(Y\\) between the 0.001 and (<) 0.01 levels”.\nNow you should know what * means: The significance is between the 0.01 and 0.05 levels, which means that we observe a less significant (but still significant) relationship between the variables.\n\nP-value provide a measure of how significant the relationship is; it is an indication of whether the relationship between \\(X_n\\) and \\(Y\\) found in this data could have been found by chance. Very small p-values suggest that the level of association found here might not have come from a random sample of data.\nIn this case, we can say:\n\nGiven that the p-value is indicated by ***, changes in pct_No_qualifications and pct_Males are significantly associated with changes in pct_Very_bad_health at the <0.001 level; the association is highly statistically significant; we can be confident that the observed relationship between these variables and pct_Very_bad_health is not due to chance.\nGiven that the p-value is indicated by **, changes in pct_Higher_manager_prof are significantly associated with changes in pct_Very_bad_health at the 0.001 level. This means that the association between the independent and dependent variable is not one that would be found by chance in a series of random sample 99.999% of the time.\n\nIn both cases we can then confidently reject the Null hypothesis (\\(H_0\\): no association between dependent and independent variables exist).\nRemember, If the p-value of a coefficient is smaller than 0.05, that coefficient is statistically significant. In this case, you can say that the relationship between this independent variable and the outcome variable is statistically significant. Contrarily, if the p-value of a coefficient is larger than 0.05 you can conclude that there is no evidence of an association or relationship between the independent variable and the outcome variable.\n\n\n2.1.5.3 R-squared and Adjusted R-squared\nThese provide a measure of model fit. They are calculated as the difference between the actual value of \\(Y\\) and the value predicted by the model. The R-squared and Adjusted R-squared values are statistical measures that indicate how well the independent variables in your model explain the variability of the dependent variable. Both R-squared and Adjusted R-squared help us understand how closely the model’s predictions align with the actual data. An R-squared of 0.6, for example, indicates that 60% of the variability in \\(Y\\) is explained by the independent variables in the model. The remaining 40% is due to other factors not captured by the model.\nAdjusted R-squared also measures the goodness of fit, but it adjusts for the number of independent variables in the model, accounting for the fact that adding more variables can artificially inflate R-squared without genuinely improving the model. This is especially useful when comparing models with different numbers of independent variables. If Adjusted R-squared is close to or above 0.6, as in your example, it implies that the model has a strong explanatory power while not being overfit with unnecessary explanatory variables.\nA high R-squared and Adjusted R-squared indicate that the model captures much of the variation in the data, making it more reliable for predictions or for understanding the relationship between \\(Y\\) and the explanatory variables. However Low R-squared values suggest (e.g. 0.15) that the model might be missing important explanatory variables or that the relationship between \\(Y\\) and the selected explanatory variables is not well-captured by a linear approach.\nAn R-squared and Adjusted R-squared over 0.6 are generally seen as signs of a well-fitting model in many fields, though the ideal values can depend on the context and the complexity of the data.\n\n\n\n2.1.6 Interpreting the Results\n\ncoefficients\n\n# A tibble: 4 × 5\n term estimate std.error statistic p.value\n <chr> <dbl> <dbl> <dbl> <dbl>\n1 (Intercept) 4.02 0.880 4.57 7.06e- 6\n2 pct_No_qualifications 0.0530 0.00591 8.96 2.54e-17\n3 pct_Males -0.0739 0.0179 -4.14 4.40e- 5\n4 pct_Higher_manager_prof -0.0131 0.00494 -2.65 8.43e- 3\n\n\nQ4. Complete the table above by filling in the coefficients, t-values, p-values, and indicating if each variable is statistically significant.\n\n\n\n\n\n\n\n\n\n\nVariable Name\nCoefficients\nt-values\np-values\nSignificant?\n\n\n\n\npct_No_qualifications\n\n\n\n\n\n\npct_Males\n\n\n\n\n\n\npct_Higher_manager_prof\n\n\n\n\n\n\n\nFrom the lecture notes, you know that the Intercept or Constant represents the estimated average value of the outcome variable when the values of all independent variables are equal to zero.\nQ5. When values of pct_Males, pct_No_qualifications and pct_Higher_manager_prof are all \\(zero\\), what is the % of population with very bad health? Is the intercept term meaningful? Are there any districts (or zones, depending on the dataset you chose) with zero percentages of persons with no qualification in your data set?\nQ6. Interpret the regression coefficients of pct_Males, pct_No_qualifications and pct_Higher_manager_prof. Do they make sense?\n\n\n2.1.7 Identify factors of % bad health\nNow combine the above two sections and identify factors affecting the percentage of population with very bad health. Fill in each row for the direction (positive or negative) and significance level of each variable.\n\n\n\n\n\n\n\n\nVariable Name\nPositive or Negative\nStatistical Significance\n\n\n\n\npct_No_qualifications\n\n\n\n\npct_Higher_manager_prof\n\n\n\n\npct_Males\n\n\n\n\n\nQ7. Think about the potential conclusions that can be drawn from the above analyses. Try to answer the research question of this practical: How do local factors affect residents’ health? Think about causation vs association and consider potential confounders when interpreting the results. How could these findings influence local health policies?",
- "crumbs": [
- "2 Lab: Correlation, Single, and Multiple Linear Regression"
- ]
- },
- {
- "objectID": "labs/02.MultipleLinear.html#part-c-practice-and-extension",
- "href": "labs/02.MultipleLinear.html#part-c-practice-and-extension",
- "title": "2 Lab: Correlation, Single, and Multiple Linear Regression",
- "section": "2.2 Part C: Practice and Extension",
- "text": "2.2 Part C: Practice and Extension\nIf you haven’t understood something, if you have doubts, even if they seem silly, ask.\n\nFinish working through the practical.\nRevise the material.\nExtension activities (optional): Think about other potential factors of very bad health and test your ideas with new linear regression models.",
- "crumbs": [
- "2 Lab: Correlation, Single, and Multiple Linear Regression"
- ]
- },
- {
- "objectID": "labs/03.QualitativeVariable.html",
- "href": "labs/03.QualitativeVariable.html",
- "title": "3 Lab: Correlation and Multiple Linear Regression with Qualitative Variables",
- "section": "",
- "text": "3.1 Analysis categorical variables\nRecall in Week 7, you get familiar to R by using the Family Resource Survey data. Today we will keep explore the data by using its categorical variables. As usual we first load the necessary libraries.\nSome tips to avoid R returning can’t find data errors:\nCheck your working directory by\ngetwd()\nCheck the relative path of your data folder on your PC/laptop, make sure you know the relative path of your data from your workding directory, returned by getwd().\nLibrary knowledge used in today:\nA useful shortcut to format your code: select all your code lines, use Ctrl+Shift+A for automatically format them in a tidy way.",
- "crumbs": [
- "3 Correlation and Multiple Linear Regression with Qualitative Variables"
- ]
- },
- {
- "objectID": "labs/03.QualitativeVariable.html#analysis-categorical-variables",
- "href": "labs/03.QualitativeVariable.html#analysis-categorical-variables",
- "title": "3 Lab: Correlation and Multiple Linear Regression with Qualitative Variables",
- "section": "",
- "text": "dplyr: a basic library provides a suite of functions for data manipulation\nggplot2: a widely-used data visualisation library to help you create nice plots through layered plotting.\ntidyverse: a collection of R packages designed for data science, offering a cohesive framework for data manipulation, visualization, and analysis. Containing dyplyr, ggplot2 and other basic libraries.\nbroom: a part of the tidyverse and is designed to convert statistical analysis results into tidy data frames.\nforcats: designed to work with factors, which are used to represent categorical data. It simplifies the process of creating, modifying, and ordering factors.\nvcd: visualise and analyse categorical data.\n\n\n\n3.1.1 Data overview\n\nif(!require(\"dplyr\"))\n install.packages(\"dplyr\",dependencies = T)\n\nLoading required package: dplyr\n\n\n\nAttaching package: 'dplyr'\n\n\nThe following objects are masked from 'package:stats':\n\n filter, lag\n\n\nThe following objects are masked from 'package:base':\n\n intersect, setdiff, setequal, union\n\n# Load necessary libraries \nif(!require(\"ggplot2\"))\n install.packages(\"ggplot2\",dependencies = T)\n\nLoading required package: ggplot2\n\nif(!require(\"broom\"))\n install.packages(\"broom\",dependencies = T)\n\nLoading required package: broom\n\nlibrary(dplyr) \nlibrary(ggplot2)\nlibrary(broom)\n\nOr we can use library tidyverse which includes ggplot2, dplyr,broom and other foundamental libraries together already, remember you need first install the package if you haven’t by using install.packages(\"tidyverse\").\n\nif(!require(\"tidyverse\"))\n install.packages(\"tidyverse\",dependencies = T)\n\nLoading required package: tidyverse\n\n\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ lubridate 1.9.3 ✔ tibble 3.2.1\n✔ purrr 1.0.2 ✔ tidyr 1.3.1\n✔ readr 2.1.5 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nlibrary(tidyverse)\n\nWe will also use forcat library, so\n\nif(!require(\"forcats\"))\n install.packages(\"forcats\")\n\nlibrary(forcats)\n\nExactly as you did in previous weeks, we first load in the dataset:\n\nfrs_data <- read.csv(\"../data/FamilyResourceSurvey/FRS16-17_labels.csv\")\n\nRecall in previous weeks, we used the following code to overview the dataset. Familiar yourself again by using them:\n\nView(frs_data)\nglimpse(frs_data)\n\nand also summary() to produce summaries of each variable\n\nsummary(frs_data)\n\nYou may notice that for the numeric variables such as hh_income_gross (household gross income) and work_hours(worked hours per week), the summary() offers useful descriptive statistics. While for the qualitative information, such as age_group (age group), highest_qual (Highest educational qualification), marital_status (Marital status) and nssec (Socio-economic status), the summary() function is not that useful by providing mean or median values.\nPerforming descriptive analysis for categorical variables or qualitative variables, we focus on summarising the frequency and distribution of categories within the variable. This analysis helps understand the composition and diversity of categories in the data, which is especially useful for identifying patterns, common categories, or potential data imbalances.\n\n# Frequency count\ntable(frs_data$age_group)\n\n\n 0-4 05-10 11-15 16-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 \n 2914 3575 2599 1858 1929 2353 2800 2840 2790 2883 2975 2767 2775 \n65-69 70-74 75+ \n 2990 2354 3743 \n\ntable(frs_data$highest_qual)\n\n\nA-level or equivalent Degree or above Dependent child \n 5260 9156 10298 \n GCSE or equivalent Not known Other \n 9729 6820 2882 \n\ntable(frs_data$marital_status)\n\n\n Cohabiting Divorced/civil partnership dissolved \n 4015 2199 \n Married/Civil partnership Separated \n 18195 747 \n Single Widowed \n 16663 2326 \n\ntable(frs_data$nssec)\n\n\n Dependent child \n 10299 \n Full-time student \n 963 \n Higher professional occupations \n 3004 \n Intermediate occupations \n 4372 \n Large employers and higher managers \n 1025 \nLower managerial and professional occupations \n 8129 \n Lower supervisory and technical occupations \n 2400 \n Never worked or long-term unemployed \n 1516 \n Not classifiable \n 107 \n Routine occupations \n 4205 \n Semi-routine occupations \n 5226 \n Small employers and own account workers \n 2899 \n\n\nBy using ggplot2, it is easy to create some nice descriptive charts for the categorical variables, such like what you did for the continuous variables last week.\n\nggplot(frs_data, aes(x = highest_qual)) +\n geom_bar(fill=\"brown\",width=0.5) +\n labs(title = \"Histogram of Highest Qualification in FRS\", x = \"Highest Qualification\", y = \"Count\")+#set text info\n theme_classic()#choose theme type, try theme_bw(), theme_minimal() see differences\n\n\n\n\n\n\n\n\n\nggplot(frs_data, aes(x = health)) +\n geom_bar(fill=\"skyblue\") +\n geom_text(stat = \"count\", aes(label = ..count..),vjust = -0.3,colour = \"grey\")+ #add text\n labs(title = \"Histogram of Health in FRS\", x = \"Health\", y = \"Count\")+#set text info\n theme_minimal()\n\n\n\n\n\n\n\n\n\nggplot(frs_data, aes(x = nssec)) + \n geom_bar(fill = \"yellow4\") + \n labs(title = \"Histogram of NSSEC in FRS\", x = \"NSSEC\", y = \"Count\") +\n coord_flip()+ #Flip the Axes, add a # in front of this line, to make the code in gray and you will see why we would better flip the axes at here\n theme_bw() \n\n\n\n\n\n\n\n\nIf we want to reorder the Y axis by from highest to lowest, we use the functions in forcats library. fct_infreq(): orders by the value’s frequency of the variable nssec. fct_rev(): reverses the order to go from highest to lowest.\n\nggplot(frs_data, aes(x = fct_rev(fct_infreq(nssec)))) + \n geom_bar(fill = \"yellow4\") + \n labs(title = \"Histogram of NSSEC in FRS\", x = \"NSSEC\", y = \"Count\") +\n coord_flip()+ #Flip the Axes, add a # in front of this line, to make the code in gray and you will see why we would better flip the axes at here\n theme_bw() \n\n\n\n\n\n\n\n\nYou can change the variables in ggplot() to make your own histogram chart for the variables you are interested in. You will learn more of visualisation methods in Week11’s practical.\n\n\n3.1.2 Correlation\n\nQ1. Which of the associations do you think is strongest? Which is the weakest?\n\nAs before, rather than relying upon an impressionistic view of the strength of the association between two variables, we can measure that association by calculating the relevant correlation coefficient.\nTo calculate the correlation between categorical data, we first use Chi-squared test to assess the independence between pairs of categorical variables, then we use Cramer’s V to measures the strength of association - the correlation coefficents in R.\nPearson’s chi-squared test (χ2) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. If the p-value is low (typically < 0.05), it suggests a significant association between the two variables.\n\nchisq.test(frs_data$health,frs_data$happy) \n\nWarning in chisq.test(frs_data$health, frs_data$happy): Chi-squared\napproximation may be incorrect\n\n\n\n Pearson's Chi-squared test\n\ndata: frs_data$health and frs_data$happy\nX-squared = 45594, df = 60, p-value < 2.2e-16\n\n\nIf you see a warning message of Chi-squared approximation may be incorrect. This is because some expected frequencies in one or more cells of the cross-tabular (health * happy) are too low. The df means degrees of freedom and it related to the size of the table and the number of categories in each variable. The most important message from the output is the estimated p-value, which shows as p-value < 2.2e-16 (2.2 with 16 decimals move to the left, it is a very small number so written in scientific notation). P-value of the chi-squared test is far smaller than 0.05, so we can say the correlation is statistically significant.\nCramér’s V is a measure of association for categorical (nominal or ordinal) data. It ranges from 0 (no association) to 1 (strong association). The main downside of using Cramer’s V is that no information is provided on whether the correlation is positive or negative. This is not a problem if the variable pair includes a nominal variable but represents an information loss if the both variables being correlated are ordinal.\n\n# Install the 'vcd' package if not installed \nif(!require(\"vcd\")) \ninstall.packages(\"vcd\", repos = \"https://cran.r-project.org\", dependencies = T)\n\nLoading required package: vcd\n\n\nWarning: package 'vcd' was built under R version 4.3.3\n\n\nLoading required package: grid\n\nlibrary(vcd) \n\n# creat the crosstable \ncrosstab <- table(frs_data$health, frs_data$happy)\n\n# Calculate Cramér's V \nassocstats(crosstab)\n\n X^2 df P(> X^2)\nLikelihood Ratio 54036 60 0\nPearson 45594 60 0\n\nPhi-Coefficient : NA \nContingency Coeff.: 0.713 \nCramer's V : 0.454 \n\n#you can also directly calculate the assoication between variables\nassocstats(table(frs_data$health, frs_data$age_group))\n\n X^2 df P(> X^2)\nLikelihood Ratio 26557 75 0\nPearson 23854 75 0\n\nPhi-Coefficient : NA \nContingency Coeff.: 0.592 \nCramer's V : 0.329 \n\n\n\nResearch Question 1. Which of our selected person-level variables is most strongly correlated with an individual’s health status?\n\nUse the codes of Chi-test and Cramer’s V to answer this question by completing Table 1.\nTable 1 Person-level correlations with health status\n\n\n\n\n\n\n\n\n\nCovariates\n\nCorrelation Coefficient\nStatistical Significance\n\n\n\n\nCramer’s V\np-value\n\n\nhealth\nage_group\n\n\n\n\nHealth\nhighest_qual\n\n\n\n\nhealth\nmarital_status\n\n\n\n\nHealth\nnssec",
- "crumbs": [
- "3 Correlation and Multiple Linear Regression with Qualitative Variables"
- ]
- },
- {
- "objectID": "labs/03.QualitativeVariable.html#implementing-a-linear-regression-model-with-a-qualitative-independent-variable",
- "href": "labs/03.QualitativeVariable.html#implementing-a-linear-regression-model-with-a-qualitative-independent-variable",
- "title": "3 Lab: Correlation and Multiple Linear Regression with Qualitative Variables",
- "section": "3.2 Implementing a linear regression model with a qualitative independent variable",
- "text": "3.2 Implementing a linear regression model with a qualitative independent variable\n\nResearch Question 2: How does health vary across regions in the UK?\n\nThe practical is split into two main parts. The first focuses on implementing a linear regression model with a qualitative independent variable. Note that you need first to set the reference category (baseline) as the outcomes of the model reflects the differences between categories with the baseline. The second part focuses prediction based the estimated linear regression model.\nFirst we load the UK district-level census dataset.\n\n# load data\nLAcensus <- read.csv(\"../data/Census2011/UK_DistrictPercentages.csv\") # Local authority level\n\nUsing the district-level census dataset “UK_DistrictPercentages.csv”. the variable “Region” (labelled as Government Office Region) is used to explore regional inequality in health.\nFamiliar yourself with the dataset by using the same codes as last week:\n\n#view the data \nView(LAcensus) \nglimpse(LAcensus)\n\nThe names() function returns all the column names.\n\nnames(df)\n\nThe dim() function can merely returns the number of rows and number of columns.\n\ndim(LAcensus) \n\n[1] 406 128\n\n\nThere are 406 rows and 130 columns in the dataset. It would be very hard to scan throught the data if we use so many variables altogether. Therefore, we can select several columns to tailor for this practical. You can of course include other variables you are interested in also by their names:\n\ndf <- LAcensus %>% select(c(\"pct_Long_term_ill\",\n \"pct_No_qualifications\",\n \"pct_Males\",\n \"pct_Higher_manager_prof\",\n \"Region\"))\n\nSimply descriptive of this new data\n\nsummary(df)\n\n pct_Long_term_ill pct_No_qualifications pct_Males \n Min. :11.20 Min. : 6.721 Min. :47.49 \n 1st Qu.:15.57 1st Qu.:19.406 1st Qu.:48.67 \n Median :18.41 Median :23.056 Median :49.09 \n Mean :18.27 Mean :23.257 Mean :49.10 \n 3rd Qu.:20.72 3rd Qu.:26.993 3rd Qu.:49.48 \n Max. :27.97 Max. :40.522 Max. :55.47 \n pct_Higher_manager_prof Region \n Min. : 4.006 Min. : 1.000 \n 1st Qu.: 7.664 1st Qu.: 3.000 \n Median : 9.969 Median : 6.000 \n Mean :10.747 Mean : 6.034 \n 3rd Qu.:12.986 3rd Qu.: 8.000 \n Max. :37.022 Max. :12.000 \n\n\nNow we can retrieve the “Region” column from the data frame by simply use df$Region. But what if we want to understand the data better, like the following questions?\n\nQ2. How many categories do the variable “Region” entail? How many local authority districts does each region include?\n\nSimply use the function table() would return you the answer.\n\ntable(df$Region) \n\n\n 1 2 3 4 5 6 7 8 9 10 11 12 \n40 47 33 12 39 67 37 30 21 22 32 26 \n\n\nThe numbers in Region column indicate different regions in the UK - 1: East Midlands; 2: East of England; 3: London; 4: North East; 5: North West; 6: South East; 7: South West; 8: West Midlands; 9: Yorkshire and the Humber; 10: Wales; 11: Scotland; and 12: Northern Ireland.\nThe table() function tells us that this data frame contains 12 regions, and the number of LAs belongs to each region.\nNow, for better interpration of our regions with their real name rather than the code, we can create a new column named “Region_label” by using the following code. **R can only include the categorical variables in the factor type, so we set the new column Region_label in factor()\n\ndf$Region_label <- factor(df$Region,c(1:12),labels=c(\"East Midlands\",\n \"East of England\",\n \"London\",\n \"North East\",\n \"North West\",\n \"South East\",\n \"South West\",\n \"West Midlands\",\n \"Yorkshire and the Humber\",\n \"Wales\",\n \"Scotland\",\n \"Northern Ireland\")) \n\nIf you re-run the table() function, the output is now more readable:\n\ntable(df$Region_label)\n\n\n East Midlands East of England London \n 40 47 33 \n North East North West South East \n 12 39 67 \n South West West Midlands Yorkshire and the Humber \n 37 30 21 \n Wales Scotland Northern Ireland \n 22 32 26 \n\n\n\n3.2.1 Include the categorical variables into a regression model\nWe will continue with a very similar regression model fitted in last week that relates Percentages long-term illness (pct_Long_term_ill) to Percentages no-qualification (pct_No_qualifications), Percentage Males (pct_Males) and Percentages Higher Managerial or Professional occupation (pct_Higher_manager_prof).\nDecide which region to be set as the baseline category. The principle is that if you want to compare the (average) long term illness outcome of Region A to those of other regions, Region A should be chosen as the baseline category. For example, if you want to compare the (average) long term illness outcome of London to rest of regions in the UK, London should be selected as the baseline category.\nImplement the regression model with the newly created categorical variables - Region_label in our case. R will automatically handle the qualitative variable as dummy variables so you don’t need to concern any of that. But you need to let R knows which category of your qualitative variable is your reference category or the baseline. Here we will use London as our first go. Note: We choose London as the baseline category so the London region will be excluded in the independent variable list.\nTherefore, first, we set London as the reference:\n\ndf$Region_label <- fct_relevel(df$Region_label, \"London\")\n\nSimilar to last week, we build our linear regression model, but also include the Region_label variable into the model.\n\nmodel <- lm(pct_Long_term_ill ~ pct_Males + pct_No_qualifications + pct_Higher_manager_prof + Region_label, data = df)\n\nsummary(model)\n\n\nCall:\nlm(formula = pct_Long_term_ill ~ pct_Males + pct_No_qualifications + \n pct_Higher_manager_prof + Region_label, data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.2963 -0.9090 -0.1266 0.8168 5.2821 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 41.54134 5.22181 7.955 1.95e-14 ***\npct_Males -0.75756 0.10094 -7.505 4.18e-13 ***\npct_No_qualifications 0.50573 0.03062 16.515 < 2e-16 ***\npct_Higher_manager_prof 0.08910 0.03674 2.426 0.01574 * \nRegion_labelEast Midlands 1.14167 0.35015 3.260 0.00121 ** \nRegion_labelEast of England -0.01113 0.33140 -0.034 0.97322 \nRegion_labelNorth East 2.70447 0.49879 5.422 1.03e-07 ***\nRegion_labelNorth West 2.64240 0.35468 7.450 6.03e-13 ***\nRegion_labelSouth East 0.48327 0.30181 1.601 0.11013 \nRegion_labelSouth West 2.62729 0.34572 7.600 2.22e-13 ***\nRegion_labelWest Midlands 0.91064 0.37958 2.399 0.01690 * \nRegion_labelYorkshire and the Humber 1.03930 0.41050 2.532 0.01174 * \nRegion_labelWales 4.63424 0.41368 11.202 < 2e-16 ***\nRegion_labelScotland 0.46291 0.38916 1.189 0.23497 \nRegion_labelNorthern Ireland 0.55722 0.42215 1.320 0.18762 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.394 on 391 degrees of freedom\nMultiple R-squared: 0.8298, Adjusted R-squared: 0.8237 \nF-statistic: 136.2 on 14 and 391 DF, p-value: < 2.2e-16\n\n\nYou have already learnt how to interpret the output of regression model last week: Significance (p-value), Coefficient Estimates, and Model fit (R squared and Adjusted R-squared).\n\nQ3. Relating back to this week’s lecture notes, indicate what regions have statistically significant differences in the percentage of long-term illness, compared to London?\n\nFirst, the Significance and the Coefficient Estimates. By examining the P-value, which is the last column in the output table, we can see that most of the independent variables are significant predictor of pct_Long_term_ill.\n\nSimilarly to last week, we learn that the changes in pct_No_qualifications and pct_Malesare significantly associated with changes in pct_Long_term_ill at the <0.001 level (with the three asterisks *** ), which is actually an indicator of highly statistically significant, while we are less confident that the observed relationship between pct_Higher_manager_prof and pct_Long_term_ill are statistically significant (with the two asterisks **). Through their coefficient estimates, we learn that:\n\nThe association of pct_Males is negative and strong: each decrease in 1% of pct_Males is associated with an increase of 0.75% of long term illness rate in the population of UK.\nThe association of pct_No_qualifications is positive and strong: each increase in 1% of pct_No_qualifications is associated with an increase of 0.5% of long term illness rate.\nThe association of pct_Higher_manager_prof is positive but weak: each increase in 1% of pct_Higher_manager_prof is associated with an increase of 0.08% of pct_Long_term_ill.\n\nNow comes to the dummy variables (all the items starts with Region_label) created by R for our qualitative variable Region_label: Region_labelNorth East, Region_labelNorth West, Region_labelSouth West and Region_labelWales are also statistically significant at the <0.001 level. The changes in Region_labelEast Midlands are significantly associated with changes in pct_Long_term_ill at the 0.001 level, while the changes in Region_labelWest Midlands and Region_labelYorkshire and the Humber are significantly associated with changes in pct_Long_term_ill at the 0.01 level. The 0.01 level suggests that it is a mild likelihood that the relationship between these independent variables and the dependent variable is not due to random change. They are just mildly statistically significant.\nThe coefficient estimates of them need to be interpreted by comparing to the reference category London. The Estimate column tells us: North East region is associated with a 2.7% higher rate of long term illness than London when the other predictors remain the same. Similarly, Wales is 4.6% higher rate of long term illness than London when the other predictors remain the same. You can draw the conclusion for the other regions in this way by using their coefficient estimate values.\nReminder: You cannot draw conclusion between North East and Wales, nor comparison between any regions beyond London. It is because the regression model is built for the comparison between regions to your reference category London. If we want to compare between North East and Wales, we need to set either of them as the reference category by using df$Region_label <- fct_relevel(df$Region_label, \"North East\") or df$Region_label <- fct_relevel(df$Region_label, \"Wales\").\nRegion_labelEast of England, Region_labelSouth Eest, Region_labelScotland and RegionlabelNorthern Ireland were not found to be significantly associated with pct_Long_term_ill.\n\nLast but not least, the Measure of Model Fit. The model output suggests the R-squared and Adjusted R-squared are of greater than 0.8 indicate a reasonably well fitting model. he model explains 83.0 % of the variance in the dependent variable. After adjusting for the number of independent variable, the model explains 82.4% of the variance. They two suggest a strong fit of the model.\nNow, complete the following table.\n\n\n\n\n\n\n\n\nRegion names\nHigher or lower than London\nWhether the difference is statistically significant (Yes or No)\n\n\n\n\nEast Midlands\n\n\n\n\nEast of England\n\n\n\n\nNorth East\n\n\n\n\nNorth West\n\n\n\n\nSouth East\n\n\n\n\nSouth West\n\n\n\n\nWest Midlands\n\n\n\n\nYorkshire and The Humber\n\n\n\n\nWales\n\n\n\n\nScotland\n\n\n\n\nNorthern Ireland\n\n\n\n\n\n\n\n3.2.2 Change the baseline category\nIf you would like to learn about differences in long-term illness between North East and other regions in the UK, you need to change the baseline category (from London) to the North East region (with variable name “Region_2”).\n\ndf$Region_label <- fct_relevel(df$Region_label, \"North East\")\n\nThe regression model is specified again as follows:\n\nmodel1 <- lm(\n pct_Long_term_ill ~ pct_Males + pct_No_qualifications + pct_Higher_manager_prof + Region_label,\n data = df\n)\n\nsummary(model1)\n\n\nCall:\nlm(formula = pct_Long_term_ill ~ pct_Males + pct_No_qualifications + \n pct_Higher_manager_prof + Region_label, data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.2963 -0.9090 -0.1266 0.8168 5.2821 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 44.24582 5.20125 8.507 3.85e-16 ***\npct_Males -0.75756 0.10094 -7.505 4.18e-13 ***\npct_No_qualifications 0.50573 0.03062 16.515 < 2e-16 ***\npct_Higher_manager_prof 0.08910 0.03674 2.426 0.015738 * \nRegion_labelLondon -2.70447 0.49879 -5.422 1.03e-07 ***\nRegion_labelEast Midlands -1.56281 0.46292 -3.376 0.000809 ***\nRegion_labelEast of England -2.71561 0.45836 -5.925 6.87e-09 ***\nRegion_labelNorth West -0.06208 0.46209 -0.134 0.893206 \nRegion_labelSouth East -2.22120 0.45667 -4.864 1.67e-06 ***\nRegion_labelSouth West -0.07718 0.47482 -0.163 0.870957 \nRegion_labelWest Midlands -1.79384 0.48230 -3.719 0.000229 ***\nRegion_labelYorkshire and the Humber -1.66517 0.50695 -3.285 0.001113 ** \nRegion_labelWales 1.92976 0.50111 3.851 0.000137 ***\nRegion_labelScotland -2.24157 0.47299 -4.739 3.01e-06 ***\nRegion_labelNorthern Ireland -2.14725 0.49296 -4.356 1.70e-05 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.394 on 391 degrees of freedom\nMultiple R-squared: 0.8298, Adjusted R-squared: 0.8237 \nF-statistic: 136.2 on 14 and 391 DF, p-value: < 2.2e-16\n\n\nNow it is very easy to use your model to estimate the results Y (dependent variable) by setting all the input independent variable X.\n\nobj_London <- data.frame(\n pct_Males = 49.7,\n pct_No_qualifications = 24.3,\n pct_Higher_manager_prof = 14.7,\n Region_label = \"London\"\n) \nobj_NW <- data.frame(\n pct_Males = 49.8,\n pct_No_qualifications = 23.3,\n pct_Higher_manager_prof = 11.2,\n Region_label = \"North West\"\n) \nobj_NE <- data.frame(\n pct_Males = 49.8,\n pct_No_qualifications = 23.3,\n pct_Higher_manager_prof = 11.2,\n Region_label = \"North East\"\n \n) \n\npredict(model1, obj_London) \n\n 1 \n17.48952 \n\npredict(model1, obj_NW) \n\n 1 \n19.23858 \n\npredict(model1, obj_NE)\n\n 1 \n19.30065 \n\n\n\n\n3.2.3 Recode the Region variable and explore regional inequality in health\nIn many real-word studies, we might not be interested in health inequality across all regions. For example, in this case study, we are interested in health inequality between London, Other regions in England, Wales, Scotland and Northern Ireland. We can achieve this by re-grouping regions in the UK based on the variable “Region”. That said, we need to have a new grouping of regions as follows:\n\n\n\nOriginal region labels\nNew region labels\n\n\nEast Midlands\nOther regions in England\n\n\nEast of England\nOther regions in England\n\n\nLondon\nLondon\n\n\nNorth East\nOther regions in England\n\n\nNorth West\nOther regions in England\n\n\nSouth East\nOther regions in England\n\n\nSouth West\nOther regions in England\n\n\nWest Midlands\nOther regions in England\n\n\nYorkshire and The Humber\nOther regions in England\n\n\nWales\nWales\n\n\nScotland\nScotland\n\n\nNorthern Ireland\nNorthern Ireland\n\n\n\nHere we use mutate() function in R to make it happen:\n\ndf <- df %>% mutate(New_region_label = fct_other(\n Region_label,\n keep = c(\"London\", \"Wales\", \"Scotland\", \"Northern Ireland\"),\n other_level = \"Other regions in England\"\n))\n\nThis code may looks a bit complex. You can simply type ?mutate in your console. Now in your right hand Help window, the R studio offers your the explanation of the mutate function. This is a common way you can use R studio to help you learn what the function caate() creates new columns that are functions of existing variables. Therefore, the df %>% mutate() means add a new column into the current dataframe df; the New_region_label in the mutate() function indicates the name of this new column is New_region_label. The right side of the New_region_label = indicates the value we want to assign to the New_region_label in each row.\nThe right side of New_region_label is\nfct_other(Region_label, keep=c(\"London\",\"Wales\",\"Scotland\",\"Northern Ireland\"), other_level=\"Other regions in England\")\nBy using the code, the fct_other() function checks whether each value in the Region_label column is one of the keep regions: “London”, “Wales”, “Scotland”, or “Northern Ireland”. If the region is not in this list, the value is replaced with the label “Other regions in England”. If the region is one of these four, the original value in Region_label is kept. This process categorizes regions that are outside of the four specified ones into a new group labeled “Other regions in England”, while preserving the original labels for the specified regions.\nNow we use the same way to examine our new column New_region_label:\n\ntable(df$New_region_label)\n\n\n London Wales Scotland \n 33 22 32 \n Northern Ireland Other regions in England \n 26 293 \n\n\nComparing with the Region_label, we now can see the mutate worked:\n\ndf[,c(\"Region_label\",\"New_region_label\")]\n\nNow you will have a new qualitative variable named New_region_label in which the UK is divided into five regions: London, Other regions in England, Wales, Scotland and Northern Ireland.\nBased on the newly generated qualitative variable New_region_label, we can now build our new linear regression model. Don’t forget:\n(1) R need to deal with the categorical variables in regression model in the factor type;\n\nclass(df$New_region_label)\n\n[1] \"factor\"\n\n\nThe class() returns the type of the variable. The New_region_label is already a factor variable. If not, we need to convert it by the as.factor(), as we used above.\n\ndf$New_region_label = as.factor(df$New_region_label)\n\n2) Let R know which region you want to use as the baseline category. Here I will use London again, but of course you can choose other regions.\n\ndf$New_region_label <- fct_relevel(df$New_region_label, \"London\")\n\nThe linear regression window is set up below. This time we include New_region_label rather than Region_label as the region variable:\n\nmodel2 <- lm(\n pct_Long_term_ill ~ pct_Males + pct_No_qualifications + pct_Higher_manager_prof + New_region_label,\n data = df\n)\n\nsummary(model2)\n\n\nCall:\nlm(formula = pct_Long_term_ill ~ pct_Males + pct_No_qualifications + \n pct_Higher_manager_prof + New_region_label, data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-4.6719 -1.1252 -0.0556 0.9564 7.3768 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\n(Intercept) 47.217525 5.795439 8.147 4.86e-15\npct_Males -0.834471 0.113398 -7.359 1.07e-12\npct_No_qualifications 0.472354 0.032764 14.417 < 2e-16\npct_Higher_manager_prof 0.000497 0.040851 0.012 0.990300\nNew_region_labelWales 4.345262 0.471088 9.224 < 2e-16\nNew_region_labelScotland 0.143672 0.442989 0.324 0.745863\nNew_region_labelNorthern Ireland 0.264425 0.474074 0.558 0.577314\nNew_region_labelOther regions in England 1.071719 0.312746 3.427 0.000674\n \n(Intercept) ***\npct_Males ***\npct_No_qualifications ***\npct_Higher_manager_prof \nNew_region_labelWales ***\nNew_region_labelScotland \nNew_region_labelNorthern Ireland \nNew_region_labelOther regions in England ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 1.609 on 398 degrees of freedom\nMultiple R-squared: 0.7693, Adjusted R-squared: 0.7653 \nF-statistic: 189.6 on 7 and 398 DF, p-value: < 2.2e-16\n\n\n\nQ4. Are there statistically significant differences in the percentage of people with long-term illness between London and Scotland, and between London and Wales, controlling for other variables? What conclusions could be drawn in terms of regional differences in health outcome?",
- "crumbs": [
- "3 Correlation and Multiple Linear Regression with Qualitative Variables"
- ]
- },
- {
- "objectID": "labs/03.QualitativeVariable.html#predictions-using-fitted-regression-model",
- "href": "labs/03.QualitativeVariable.html#predictions-using-fitted-regression-model",
- "title": "3 Lab: Correlation and Multiple Linear Regression with Qualitative Variables",
- "section": "3.3 Predictions using fitted regression model",
- "text": "3.3 Predictions using fitted regression model\n\n3.3.1 Write down the % illness regression model with the new region label categorical variables\nRelating to this week’s lecture, the % pct_Long_term_ill is equal to:\n[write down the model]\n\nQ5. Now imagine that the values of variables pct_Males, pct_No_qualifications, and pct_Higher_manager_prof are 49, 23 and 11, respectively, what would the percentage of persons with long-term illness in Wales and London be?\n\nCheck the answer at the end of this practical page.",
- "crumbs": [
- "3 Correlation and Multiple Linear Regression with Qualitative Variables"
- ]
- },
- {
- "objectID": "labs/03.QualitativeVariable.html#income-inequality-with-respect-to-gender-and-health-status",
- "href": "labs/03.QualitativeVariable.html#income-inequality-with-respect-to-gender-and-health-status",
- "title": "3 Lab: Correlation and Multiple Linear Regression with Qualitative Variables",
- "section": "3.4 Income inequality with respect to gender and health status",
- "text": "3.4 Income inequality with respect to gender and health status\nIn this section, we will work with individual-level data (“FRS 2016-17_label.csv”) again to explore income inequality with respect to gender and health status.\nTo explore income inequality, we need to work with a data set excluding dependent children. In addition, we look at individuals who are the representative persons of households. Therefore, we will select cases (or samples) that meet both conditions.\nWe want R to select persons only if they are the representative persons of households and they are not dependent children. The involved variables are hrp and Dependent for the categories “Household Reference Person” and “independent”, you can select the appropriate cases. We also want to exclude the health variable reported as “Not known”.\n\nfrs_df <- frs_data %>% filter(hrp == \"HRP\" &\n dependent == \"Independent\" & health != \"Not known\") \n\nThen, we create a new numeric variable Net_inc_perc indicate net income per capita as our dependent variable:\n\nfrs_df$Net_inc_pp = frs_df$hh_income_net / frs_df$hh_size\n\nsummary(frs_df$Net_inc_pp)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n-238160 9074 13347 15834 19136 864812 \n\n\nThe distribution of the net household income per capita can be visualised by using ggplot()\n\nggplot(frs_df, aes(x = Net_inc_pp)) +\n geom_histogram(\n bins = 1000,\n color = \"black\",\n fill = \"skyblue\",\n alpha = 0.7\n ) +\n labs(title = \"Distribution of Net_inc_pp\", x = \"Value\", y = \"Frequency\") +\n scale_x_continuous(labels = scales::label_comma()) + # Prevent scientific notation on the x axis\n theme_minimal()\n\n\n\n\n\n\n\n\nOur two qualitative independent variables “sex” and “health”. Let’s first know what they look like:\n\ntable(frs_df$sex)\n\n\nFemale Male \n 7647 9180 \n\ntable(frs_df$health)\n\n\n Bad Fair Good Very Bad Very Good \n 1472 4253 6277 426 4399 \n\n\nRemember what we did in the Region long-term illness practical previously before we put the qualitative variable into the regression model? Yes. First, make sure they are in factor type and Second, decide the reference category. Here, I will use Female and Very Bad health status as my base categories. You can decide what you wish to use. This time, I use the following codes to combine these two steps in one line.\n\nfrs_df$sex <- fct_relevel(as.factor(frs_df$sex), \"Female\")\nfrs_df$health <- fct_relevel(as.factor(frs_df$health), \"Very Bad\")\n\nImplement the regression model with the two qualitative independent variables.\n\nmodel_frs <- lm(Net_inc_pp ~ sex + health, data = frs_df)\nsummary(model_frs)\n\n\nCall:\nlm(formula = Net_inc_pp ~ sex + health, data = frs_df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-255133 -6547 -2213 3515 845673 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 12115.5 762.9 15.881 < 2e-16 ***\nsexMale 2091.2 240.6 8.691 < 2e-16 ***\nhealthBad -102.8 854.3 -0.120 0.904205 \nhealthFair 1051.3 789.0 1.332 0.182751 \nhealthGood 2766.0 777.4 3.558 0.000375 ***\nhealthVery Good 4931.8 787.8 6.260 3.95e-10 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 15530 on 16821 degrees of freedom\nMultiple R-squared: 0.01646, Adjusted R-squared: 0.01616 \nF-statistic: 56.29 on 5 and 16821 DF, p-value: < 2.2e-16\n\n\nThe result can be formatted by:\n\nlibrary(broom)\ntidy(model_frs)\n\n# A tibble: 6 × 5\n term estimate std.error statistic p.value\n <chr> <dbl> <dbl> <dbl> <dbl>\n1 (Intercept) 12116. 763. 15.9 2.21e-56\n2 sexMale 2091. 241. 8.69 3.90e-18\n3 healthBad -103. 854. -0.120 9.04e- 1\n4 healthFair 1051. 789. 1.33 1.83e- 1\n5 healthGood 2766. 777. 3.56 3.75e- 4\n6 healthVery Good 4932. 788. 6.26 3.95e-10\n\n\n\nQ6. What conclusions could be drawn in terms of income inequalities with respect to gender and health status? Also think about the statistical significance of these differences.",
- "crumbs": [
- "3 Correlation and Multiple Linear Regression with Qualitative Variables"
- ]
- },
- {
- "objectID": "labs/03.QualitativeVariable.html#extension-activities",
- "href": "labs/03.QualitativeVariable.html#extension-activities",
- "title": "3 Lab: Correlation and Multiple Linear Regression with Qualitative Variables",
- "section": "3.5 Extension activities",
- "text": "3.5 Extension activities\nThe extension activities are designed to get yourself prepared for the Assignment 2 in progress. For this week, try whether you can:\n\nPresent descriptive statistics for independent variable and the dependent variable: counts, percentages, a centrality measure, a spread measure, histograms or any relevant statistic\nReport the observed association between the dependent and independent variables: correlation plus a graphic or tabular visualisation\nBriefly describe and critically discuss the results\nThink about other potential factors of long-term illness and income, and then test your ideas with linear regression models\nSummaries your model outputs and interpret the results.\n\nAnswer of the written down model and Q5\nThe model of the new region label is: pct_Long_term_ill (%) = 47.218+ (-0.834)* pct_Males (%) + 0.472 * pct_No_qualifications (%) + 1.072*Other Regions in England + 4.345* Wales\nSo if the values of variables pct_Males, pct_No_qualifications, and pct_Higher_manager_prof are 49, 23 and 11,\nthe model of Wales will be: pct_Long_term_ill (%) = 47.218+ (-0.834)* 49 + 0.472 * 23 + 1.072*0+ 4.345* 1 = 21.553 (you can direct paste the number sentence into your R studio Console and the result will be returned)\nthe model of London will be: pct_Long_term_ill (%) = 47.218+ (-0.834)* 49 + 0.472 * 23 + 1.072*0+ 4.345* 0 = 17.208\nYou can also make a new object like\n\nobj_London <- data.frame(\n pct_Males = 49,\n pct_No_qualifications = 23,\n pct_Higher_manager_prof = 11,\n New_region_label = \"London\"\n)\nobj_Wales <- data.frame(\n pct_Males = 49,\n pct_No_qualifications = 23,\n pct_Higher_manager_prof = 11,\n New_region_label = \"Wales\"\n)\npredict(model2, obj_London)\n\n 1 \n17.19808 \n\npredict(model2, obj_Wales)\n\n 1 \n21.54334 \n\n\nTherefore, the percentage of persons with long-term illness in Wales and London be 21.5% and 17.2% separately. If you got the right answers, then congratulations you can now use regression model to make prediction.",
- "crumbs": [
- "3 Correlation and Multiple Linear Regression with Qualitative Variables"
- ]
- },
- {
- "objectID": "labs/04.LogisticRegression.html",
- "href": "labs/04.LogisticRegression.html",
- "title": "4 LogisticRegression",
- "section": "",
- "text": "4.1 Preparing the input variables\nPrepare the data for implementing a logistic regression model. The data set used in this practical is the “sar_sample_label.csv” and “sar_sample_code.csv”. They are actually the same dataframe, only one uses the label as the value but the other uses the code. We will first read in both for the data overview the labels are more friendly, and then we focus on using “sar_sample_code.csv” in the regression model as it is easier for coding.\nlibrary(tidyverse)\n\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.0 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nlibrary(broom)\nsar_label <- read_csv(\"../data/SAR/sar_sample_label.csv\")\n\nRows: 50000 Columns: 39\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (39): country, region, county, la_group, age_group, sex, marital_status,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\nsar_code <- read_csv(\"../data/SAR/sar_sample_code.csv\")\n\nRows: 50000 Columns: 39\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (39): country, region, county, la_group, age_group, sex, marital_status,...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\nglimpse(sar_label)\nglimpse(sar_code)\nsummary(sar_label)\n\n country region county la_group \n Length:50000 Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n age_group sex marital_status ethnicity \n Length:50000 Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n accom_type arrival_year birth_country care_hours \n Length:50000 Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n cars census_return central_heating crowding \n Length:50000 Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n dependent_children english_ability health hh_composition \n Length:50000 Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n hh_size highest_qual industry last_worked \n Length:50000 Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n long_term_ill main_language move_distance move_origin \n Length:50000 Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n national_identity nssec ons_supergroup relation_to_hrp \n Length:50000 Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n religion rural_urban tenure work_distance \n Length:50000 Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character Class :character \n Mode :character Mode :character Mode :character Mode :character \n work_hours work_status work_transport \n Length:50000 Length:50000 Length:50000 \n Class :character Class :character Class :character \n Mode :character Mode :character Mode :character\nThe outcome variable is a person’s commuting distance captured by the variable “work_distance”.\ntable(sar_label$work_distance)\n\n\n 10 to <20 km \n 3650 \n 2 to <5 km \n 4414 \n 20 to <40 km \n 2014 \n 40 to <60km \n 572 \n 5 to <10 km \n 4190 \n 60km or more \n 703 \n Age<16 or not working \n 25975 \n At home \n 2427 \n Less than 2 km \n 4028 \n No fixed place \n 1943 \nWork outside England and Wales but within UK \n 29 \n Work outside UK \n 21 \n Works at offshore installation (within UK) \n 34\ntable(sar_code$work_distance)\n\n\n -9 1 2 3 4 5 6 7 8 9 10 11 12 \n25975 4028 4414 4190 3650 2014 572 703 2427 1943 29 21 34\nThere are a variety of categories in the variable, however, we are only interested in commuting distance and therefore in people reporting their commuting distance. Thus, we will explore the numeric codes of the variable ranging from 1 to 8.\nAs we are also interested in exploring whether people with different socio-economic statuses (or occupations) tend to be associated with varying probabilities of commuting over long distances, we further filter or select cases.\ntable(sar_label$nssec)\n\n\n Child aged 0-15 \n 9698 \n Full-time student \n 3041 \n Higher professional occupations \n 3162 \n Intermediate occupations \n 5288 \n Large employers and higher managers \n 909 \nLower managerial and professional occupations \n 8345 \n Lower supervisory and technical occupations \n 2924 \n Never worked or long-term unemployed \n 2261 \n Routine occupations \n 4660 \n Semi-routine occupations \n 5893 \n Small employers and own account workers \n 3819 \n\ntable(sar_code$nssec)\n\n\n 1 2 3 4 5 6 7 8 9 10 12 \n 909 3162 8345 5288 3819 2924 5893 4660 2261 3041 9698\nUsing nssec, we select people who reported an occupation, and delete cases with numeric codes from 9 to 12, who are unemployed, full-time students, children and not classifiable.\nNow, similar to next week, we use the filter() to prepare our dataframe today. You may already realise that using sar_code is easier to do the filtering.\nsar_df <- sar_code %>% filter(work_distance<=8 & nssec <=8 )\nRecode the “work_distance” variable into a binary dependent variable\nA simple way to create a binary dependent variable representing long-distance commuting is to use the mutate() function as discussed in last week’s practical session. Before creating the binary variables from the “work_distance” variable, we need to define what counts as a long-distance commuting move. Such definition can vary. Here we define a long-distance commuting move as any commuting move over a distance above 60km (the category of “60km or more”).\nsar_df <- sar_df %>% mutate(\n New_work_distance = if_else(work_distance >6, 1,0))\nPrepare your “nssec” variable before the regression model\nThe nssec is a categorical variable. Therefore, as we’ve learnt last week, before adding the categorical variables into the regression model, we need first make it a factor and then identify the reference category.\nWe are interested in whether people with occupations being “Small-employers and Own account workers” are associated with a lower probability of commuting over long distances when comparing to people in other occupations.\nsar_df$nssec <- relevel(as.factor(sar_df$nssec), ref = \"5\")",
- "crumbs": [
- "4 LogisticRegression"
- ]
- },
- {
- "objectID": "labs/04.LogisticRegression.html#preparing-the-input-variables",
- "href": "labs/04.LogisticRegression.html#preparing-the-input-variables",
- "title": "4 LogisticRegression",
- "section": "",
- "text": "Code for Work_distance\nCategories\n\n\n\n\n1\nLess than 2 km\n\n\n2\n2 to <5 km\n\n\n3\n5 to <10 km\n\n\n4\n10 to <20 km\n\n\n5\n20 to <40 km\n\n\n6\n40 to <60 km\n\n\n7\n60km or more\n\n\n8\nAt home\n\n\n9\nNo fixed place\n\n\n10\nWork outside England and Wales but within UK\n\n\n11\nWork outside UK\n\n\n12\nWorks at offshore installation (within UK)\n\n\n\n\n\n\n\n\n\nCode for nssec\nCategory labels\n\n\n\n\n1\nLarge employers and higher managers\n\n\n2\nHigher professional occupations\n\n\n3\nLower managerial and professional occupations\n\n\n4\nIntermediate occupations\n\n\n5\nSmall employers and own account workers\n\n\n6\nLower supervisory and technical occupations\n\n\n7\nSemi-routine occupations\n\n\n8\nRoutine occupations\n\n\n9\nNever worked or long-term employed\n\n\n10\nFull-time student\n\n\n11\nNot classifiable\n\n\n12\nChild aged 0-15\n\n\n\n\n\n\nQ1. Summarise the frequencies of the two variables “work_distance” and “nssec” with the new data.\n\n\n\n\n\nQ2. Check the new sar_df dataframe with new column named New_work_distance by using the codes you have learnt.\n\n\n\n\n\n\n4.1.1 Implementing a logistic regression model\nThe binary dependent variable is long-distance commuting, variable name New_work_distance.\nThe independent variables are gender and socio-economic status.\nFor gender, we use male as the basline.\n\nsar_df$sex <- relevel(as.factor(sar_df$sex),ref=\"1\")\n\nFor socio-economic status, we use code 5 (Small employers and Own account workers) as the baseline category to explore whether people work as independent employers show lower probability of commuting longer than 60km compared with other occupations.\n\n#create the model\nm.glm = glm(New_work_distance~sex + nssec, \n data = sar_df, \n family= \"binomial\")\n# inspect the results\nsummary(m.glm) \n\n\nCall:\nglm(formula = New_work_distance ~ sex + nssec, family = \"binomial\", \n data = sar_df)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -0.44698 0.04138 -10.801 <2e-16 ***\nsex2 -0.36678 0.04196 -8.742 <2e-16 ***\nnssec1 -1.35519 0.10782 -12.569 <2e-16 ***\nnssec2 -1.22639 0.06489 -18.898 <2e-16 ***\nnssec3 -1.61400 0.05482 -29.442 <2e-16 ***\nnssec4 -2.25717 0.07696 -29.329 <2e-16 ***\nnssec6 -2.61631 0.10377 -25.212 <2e-16 ***\nnssec7 -2.66548 0.08317 -32.047 <2e-16 ***\nnssec8 -2.71172 0.09021 -30.059 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 20441 on 33025 degrees of freedom\nResidual deviance: 17968 on 33017 degrees of freedom\nAIC: 17986\n\nNumber of Fisher Scoring iterations: 6\n\n\n\n# odds ratios\nexp(coef(m.glm)) \n\n(Intercept) sex2 nssec1 nssec2 nssec3 nssec4 \n 0.63955383 0.69296493 0.25789714 0.29335107 0.19909052 0.10464615 \n nssec6 nssec7 nssec8 \n 0.07307217 0.06956621 0.06642224 \n\n\n\n# confidence intervals\nexp(confint(m.glm, level = 0.95)) \n\nWaiting for profiling to be done...\n\n\n 2.5 % 97.5 %\n(Intercept) 0.58959085 0.69343772\nsex2 0.63818099 0.75227729\nnssec1 0.20787318 0.31733101\nnssec2 0.25813190 0.33291942\nnssec3 0.17876868 0.22162844\nnssec4 0.08984430 0.12149367\nnssec6 0.05933796 0.08915692\nnssec7 0.05895858 0.08169756\nnssec8 0.05547661 0.07902917\n\n\nQ3. If we want to explore whether people with occupation being “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation, how will we prepare the input independent variables and what will be the specified regression model?\nHint: use mutate() to create a new column, set the value of “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” as original, while the rest as “Other occupations” ().\n\nsar_df <- sar_df %>% mutate(New_nssec = if_else(!nssec %in% c(1,2,8), \"0\" ,nssec))\n\nUse “Other occupations” (code: 0) as the reference category by relevel(as.factor()) and then create the regression model: glm(New_work_distance~sex + New_nssec, data = sar_df, family= \"binomial\")\n\n\n4.1.2 Model fit\nRelating back to this week’s lecture notes, what is the Pseudo R2 of the fitted logistic model (from the Model Summary table below)?\n\nif(!require(\"pscl\"))\n install.packages(\"pscl\")\n\nLoading required package: pscl\n\n\nWarning: package 'pscl' was built under R version 4.3.3\n\n\nClasses and Methods for R originally developed in the\nPolitical Science Computational Laboratory\nDepartment of Political Science\nStanford University (2002-2015),\nby and under the direction of Simon Jackman.\nhurdle and zeroinfl functions by Achim Zeileis.\n\nlibrary(pscl)\n\n# Pseudo R-squared\ntidy(pR2(m.glm))\n\nfitting null model for pseudo-r2\n\n\nWarning: 'tidy.numeric' is deprecated.\nSee help(\"Deprecated\")\n\n\n# A tibble: 6 × 2\n names x\n <chr> <dbl>\n1 llh -8984. \n2 llhNull -10220. \n3 G2 2473. \n4 McFadden 0.121 \n5 r2ML 0.0721\n6 r2CU 0.156 \n\n\n\n\n4.1.3 Interpreting estimated regression coefficients\n\nThe interpretation of coefficients (B) and odds ratios (Exp(B)) for the independent variables differs from that in a linear regression setting.\nInterpreting the regression coefficients.\n\no For the variable Sex, a negative sign and the odds ratio estimate indicate that being female decreases the odds of commuting over long distances by a factor of 0.462, holding all other variables constant. Put it differently, the odds of commuting over long distances for females are 53.8% (i.e., 1 – 0.462, presented as percentages) smaller than that for males, holding all other variables constant.\no For variable “nssec=Higher professional occupations”, a positive sign and the odds ratio estimate indicate that being employed in a higher professional occupation increases the odds of commuting over long distances by a factor of 1.873 comparing to being employed in other occupations (the baseline categories), holding all other variables constant (the Sex variable).\n\nQ4. Interpret the regression coefficients (i.e. Exp(B)) of variables “nssec=Large employers and higher managers” and “nssec=Routine occupations”.\n\n\n\n4.1.4 Statistical significance of regression coefficients or covariate effects\nSimilar to the statistical inference in a linear regression model context, p-values of regression coefficients are used to assess significances of coefficients; for instance, by comparing p-values to the conventional level of significance of 0.05:\n· If the p-value of a coefficient is smaller than 0.05, the coefficient is statistically significant. In this case, you can say that the relationship between an independent variable and the outcome variable is statistically significant.\n· If the p-value of a coefficient is larger than 0.05, the coefficient is statistically insignificant. In this case, you can say or conclude that there is no statistically significant association or relationship between an independent variable and the outcome variable.\n\nQ5. Could you identify significant factors of commuting over long distances?",
- "crumbs": [
- "4 LogisticRegression"
- ]
- },
- {
- "objectID": "labs/04.LogisticRegression.html#extension-activities",
- "href": "labs/04.LogisticRegression.html#extension-activities",
- "title": "4 LogisticRegression",
- "section": "4.2 Extension activities",
- "text": "4.2 Extension activities\nThe extension activities are designed to get yourself prepared for the Assignment 2 in progress. For this week, try whether you can:\n\nSelect a regression strategy and explain why a linear or logit model is appropriate\nPerform one or a series of regression models, including different combinations of your chosen independent variables to explain and/or predict your dependent variable",
- "crumbs": [
- "4 LogisticRegression"
- ]
- }
-]
\ No newline at end of file
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
deleted file mode 100644
index 6caf0f4..0000000
--- a/docs/sitemap.xml
+++ /dev/null
@@ -1,35 +0,0 @@
-
-