diff --git a/docs/labs/03.QualitativeVariable.html b/docs/labs/03.QualitativeVariable.html index 16a7a3e..9e26146 100644 --- a/docs/labs/03.QualitativeVariable.html +++ b/docs/labs/03.QualitativeVariable.html @@ -253,8 +253,7 @@

3  3 

3.1 Analysis categorical variables

Recall in Week 7, you get familiar to R by using the Family Resource Survey data. Today we will keep explore the data by using its categorical variables. As usual we first load the necessary libraries.

+

Some tips to avoid R returning can’t find data errors:

+

Check your working directory by

+
+
getwd()
+
+
[1] "C:/Users/ziye/Documents/GitHub/stats/labs"
+
+
+

Check the relative path of your data folder on your PC/laptop, make sure you know the relative path of your data from your workding directory, returned by getwd().

+

Library knowledge used in today:

+

dplyr: a basic library provides a suite of functions for data manipulation

+

ggplot2: a widely-used data visualisation library to help you create nice plots through layered plotting.

+

tidyverse: a collection of R packages designed for data science, offering a cohesive framework for data manipulation, visualization, and analysis. Containing dyplyr, ggplot2 and other basic libraries.

+

broom: a part of the tidyverse and is designed to convert statistical analysis results into tidy data frames.

3.1.1 Data overview

-
# Load necessary libraries 
-library(ggplot2) 
+
if(!require("dplyr"))
+  install.packages("dplyr")
-
Warning: package 'ggplot2' was built under R version 4.3.2
+
Warning: package 'dplyr' was built under R version 4.3.2
-
library(dplyr) 
+
# Load necessary libraries 
+if(!require("ggplot2"))
+  install.packages("ggplot2")
-
Warning: package 'dplyr' was built under R version 4.3.2
+
Warning: package 'ggplot2' was built under R version 4.3.2
+
library(dplyr) 
+library(ggplot2) 
-

#or use tidyverse which includes ggplot2, dplyr and other foundamental libraries, remember you need first install the package if you haven’t by using install.packages(“tidyverse”)

+

Or we can use library tidyverse which complies ggplot2, dplyr and other foundamental libraries together already, remember you need first install the package if you haven’t by using install.packages("tidyverse").

-
library(tidyverse)
+
if(!require("tidyverse"))
+  install.packages("tidyverse")
Warning: package 'tidyverse' was built under R version 4.3.2
@@ -315,25 +333,26 @@

Warning: package 'lubridate' was built under R version 4.3.2

+
library(tidyverse)

Exactly as you did in previous weeks, we first load in the dataset:

-
frs_data <- read.csv("../data/FamilyResourceSurvey/FRS16-17_labels.csv")
+
frs_data <- read.csv("../data/FamilyResourceSurvey/FRS16-17_labels.csv")

Recall in previous weeks, we used the following code to overview the dataset. Familiar yourself again by using them:

-
View(frs_data)
-glimpse(frs_data)
+
View(frs_data)
+glimpse(frs_data)

and also summary() to produce summaries of each variable

-
summary(frs_data)
+
summary(frs_data)

You may notice that for the numeric variables such as hh_income_gross (household gross income) and work_hours(worked hours per week), the summary() offers useful descriptive statistics. While for the qualitative information, such as age_group (age group), highest_qual (Highest educational qualification), marital_status (Marital status) and nssec (Socio-economic status), the summary() function is not that useful by providing mean or median values.

Performing descriptive analysis for categorical variables or qualitative variables, we focus on summarising the frequency and distribution of categories within the variable. This analysis helps understand the composition and diversity of categories in the data, which is especially useful for identifying patterns, common categories, or potential data imbalances.

-
# Frequency count
-table(frs_data$age_group)
+
# Frequency count
+table(frs_data$age_group)

   0-4 05-10 11-15 16-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 
@@ -341,7 +360,7 @@ 

-
table(frs_data$highest_qual)
+
table(frs_data$highest_qual)

 A-level or equivalent       Degree or above       Dependent child 
@@ -349,7 +368,7 @@ 

-
table(frs_data$marital_status)
+
table(frs_data$marital_status)

                           Cohabiting Divorced/civil partnership dissolved 
@@ -359,7 +378,7 @@ 

-
table(frs_data$nssec)
+
table(frs_data$nssec)

                               Dependent child 
@@ -390,42 +409,42 @@ 

By using ggplot2, it is easy to create some nice descriptive charts for the categorical variables, such like what you did for the continuous variables last week.

-
ggplot(frs_data, aes(x = highest_qual)) +
-  geom_bar(fill="brown",width=0.5) +
-  labs(title = "Histogram of Highest Qualification in FRS", x = "Highest Qualification", y = "Count")+#set text info
-  theme_classic()#choose theme type, try theme_bw(), theme_minimal() see differences
+
ggplot(frs_data, aes(x = highest_qual)) +
+  geom_bar(fill="brown",width=0.5) +
+  labs(title = "Histogram of Highest Qualification in FRS", x = "Highest Qualification", y = "Count")+#set text info
+  theme_classic()#choose theme type, try theme_bw(), theme_minimal() see differences
-

+

-
ggplot(frs_data, aes(x = health)) +
-  geom_bar(fill="skyblue") +
-  geom_text(stat = "count", aes(label = ..count..),vjust = -0.3,colour = "grey")+ #add text
-  labs(title = "Histogram of Health in FRS", x = "Health", y = "Count")+#set text info
-  theme_minimal()
+
ggplot(frs_data, aes(x = health)) +
+  geom_bar(fill="skyblue") +
+  geom_text(stat = "count", aes(label = ..count..),vjust = -0.3,colour = "grey")+ #add text
+  labs(title = "Histogram of Health in FRS", x = "Health", y = "Count")+#set text info
+  theme_minimal()
-

+

-
ggplot(frs_data, aes(x = nssec)) + 
-  geom_bar(fill = "yellow4") + 
-  labs(title = "Histogram of NSSEC in FRS", x = "NSSEC", y = "Count") +
-  coord_flip()+ #Flip the Axes, add a # in front of this line, to make the code in gray and you will see why we would better flip the axes at here
-  theme_bw() 
+
ggplot(frs_data, aes(x = nssec)) + 
+  geom_bar(fill = "yellow4") + 
+  labs(title = "Histogram of NSSEC in FRS", x = "NSSEC", y = "Count") +
+  coord_flip()+ #Flip the Axes, add a # in front of this line, to make the code in gray and you will see why we would better flip the axes at here
+  theme_bw() 
-

+

@@ -441,7 +460,7 @@

To calculate the correlation between categorical data, we first use Chi-squared test to assess the independence between pairs of categorical variables, then we use Cramer’s V to measures the strength of association - the correlation coefficents in R.

Pearson’s chi-squared test (χ2) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. If the p-value is low (typically < 0.05), it suggests a significant association between the two variables.

-
chisq.test(frs_data$health,frs_data$happy) 
+
chisq.test(frs_data$health,frs_data$happy) 
Warning in chisq.test(frs_data$health, frs_data$happy): Chi-squared
 approximation may be incorrect
@@ -457,9 +476,9 @@

If you see a warning message of Chi-squared approximation may be incorrect. This is because some expected frequencies in one or more cells of the cross-tabular (health * happy) are too low. The df means degrees of freedom and it related to the size of the table and the number of categories in each variable. The most important message from the output is the estimated p-value, which shows as p-value < 2.2e-16 (2.2 with 16 decimals move to the left, it is a very small number so written in scientific notation). P-value of the chi-squared test is far smaller than 0.05, so we can say the correlation is statistically significant.

Cramér’s V is a measure of association for categorical (nominal or ordinal) data. It ranges from 0 (no association) to 1 (strong association). The main downside of using Cramer’s V is that no information is provided on whether the correlation is positive or negative. This is not a problem if the variable pair includes a nominal variable but represents an information loss if the both variables being correlated are ordinal.

-
# Install the 'vcd' package if not installed 
-if(!require("vcd"))   
-install.packages("vcd", repos = "https://cran.r-project.org")
+
# Install the 'vcd' package if not installed 
+if(!require("vcd"))   
+install.packages("vcd", repos = "https://cran.r-project.org")
Loading required package: vcd
@@ -469,10 +488,10 @@

Loading required package: grid

-
library(vcd)  
-
-# Calculate Cramér's V 
-assocstats(table(frs_data$health, frs_data$happy))
+
library(vcd)  
+
+# Calculate Cramér's V 
+assocstats(table(frs_data$health, frs_data$happy))
                   X^2 df P(> X^2)
 Likelihood Ratio 54036 60        0
@@ -544,38 +563,38 @@ 

-
# load data
-LAcensus <- read.csv("../data/Census2011/UK_DistrictPercentages.csv") # Local authority level
+
# load data
+LAcensus <- read.csv("../data/Census2011/UK_DistrictPercentages.csv") # Local authority level

Using the district-level census dataset “UK_DistrictPercentages.csv”. the variable “Region” (labelled as Government Office Region) is used to explore regional inequality in health.

Familiar yourself with the dataset by using the same codes as last week:

-
#view the data 
-View(LAcensus)  
-glimpse(LAcensus)
+
#view the data 
+View(LAcensus)  
+glimpse(LAcensus)

The names() function returns all the column names.

-
names(df)
+
names(df)

The dim() function can merely returns the number of rows and number of columns.

-
dim(LAcensus) 
+
dim(LAcensus) 
[1] 406 128

There are 406 rows and 130 columns in the dataset. It would be very hard to scan throught the data if we use so many variables altogether. Therefore, we can select several columns to tailor for this practical. You can of course include other variables you are interested in also by their names:

-
df <- LAcensus %>% select(c("pct_Long_term_ill",
-                            "pct_No_qualifications",
-                            "pct_Males",
-                            "pct_Higher_manager_prof",
-                            "Region"))
+
df <- LAcensus %>% select(c("pct_Long_term_ill",
+                            "pct_No_qualifications",
+                            "pct_Males",
+                            "pct_Higher_manager_prof",
+                            "Region"))

Simply descriptive of this new data

-
summary(df)
+
summary(df)
 pct_Long_term_ill pct_No_qualifications   pct_Males    
  Min.   :11.20     Min.   : 6.721        Min.   :47.49  
@@ -599,7 +618,7 @@ 

-
table(df$Region) 
+
table(df$Region) 

  1  2  3  4  5  6  7  8  9 10 11 12 
@@ -610,22 +629,22 @@ 

-
df$Region_label <- factor(df$Region,c(1:12),labels=c("East Midlands",
-                                                     "East of England",
-                                                     "London",
-                                                     "North East",
-                                                     "North West",
-                                                     "South East",
-                                                     "South West",
-                                                     "West Midlands",
-                                                     "Yorkshire and the Humber",
-                                                     "Wales",
-                                                     "Scotland",
-                                                     "Northern Ireland")) 
+
df$Region_label <- factor(df$Region,c(1:12),labels=c("East Midlands",
+                                                     "East of England",
+                                                     "London",
+                                                     "North East",
+                                                     "North West",
+                                                     "South East",
+                                                     "South West",
+                                                     "West Midlands",
+                                                     "Yorkshire and the Humber",
+                                                     "Wales",
+                                                     "Scotland",
+                                                     "Northern Ireland")) 

If you re-run the table() function, the output is now more readable:

-
table(df$Region_label)
+
table(df$Region_label)

            East Midlands          East of England                   London 
@@ -645,13 +664,13 @@ 

-
df$Region_label <- relevel(df$Region_label, ref = "London")
+
df$Region_label <- relevel(df$Region_label, ref = "London")

Similar to last week, we build our linear regression model, but also include the Region_label variable into the model.

-
model <- lm(pct_Long_term_ill ~ pct_Males + pct_No_qualifications + pct_Higher_manager_prof + Region_label, data = df)
-
-summary(model)
+
model <- lm(pct_Long_term_ill ~ pct_Males + pct_No_qualifications + pct_Higher_manager_prof + Region_label, data = df)
+
+summary(model)

 Call:
@@ -782,13 +801,13 @@ 

3.2.2 Change the baseline category

If you would like to learn about differences in long-term illness between East of England and other regions in the UK, you need to change the baseline category (from London) to the East of England region (with variable name “Region_2”).

-
df$Region_label <- relevel(df$Region_label, ref = "East of England")
+
df$Region_label <- relevel(df$Region_label, ref = "East of England")

The regression model is specified again as follows:

-
model1 <- lm(pct_Long_term_ill ~ pct_Males + pct_No_qualifications + pct_Higher_manager_prof + Region_label, data = df)
-
-summary(model1)
+
model1 <- lm(pct_Long_term_ill ~ pct_Males + pct_No_qualifications + pct_Higher_manager_prof + Region_label, data = df)
+
+summary(model1)

 Call:
@@ -886,7 +905,7 @@ 

-
df <- df %>% mutate(New_region_label = if_else(!Region_label %in% c("London","Wales","Scotland","Northern Ireland"), "Other regions in England",Region_label))
+
df <- df %>% mutate(New_region_label = if_else(!Region_label %in% c("London","Wales","Scotland","Northern Ireland"), "Other regions in England",Region_label))

This code may looks a bit complex. You can simply type ?mutate in your console. Now in your right hand Help window, the R studio offers your the explanation of the mutate function. This is a common way you can use R studio to help you learn what the function caate() creates new columns that are functions of existing variables. Therefore, the df %>% mutate() means add a new column into the current dataframe df; the New_region_label in the mutate() function indicates the name of this new column is New_region_label. The right side of the New_region_label = indicates the value we want to assign to the New_region_label in each row.

The right side of New_region_label is

@@ -894,7 +913,7 @@

-
table(df$New_region_label)
+
table(df$New_region_label)

                   London         Northern Ireland Other regions in England 
@@ -905,23 +924,23 @@ 

-
df[,c("Region_label","New_region_label")]
+
df[,c("Region_label","New_region_label")]

Now you will have a new qualitative variable named New_region_label in which the UK is divided into five regions: London, Other regions in England, Wales, Scotland and Northern Ireland.

Based on the newly generated qualitative variable New_region_label, we can now build our new linear regression model. Don’t forget:

(1) R need to deal with the categorical variables in regression model in the factor type;

-
df$New_region_label = as.factor(df$New_region_label)
+
df$New_region_label = as.factor(df$New_region_label)

2) Let R know which region you want to use as the baseline category. Here I will use London again, but of course you can choose other regions.

-
df$New_region_label <- relevel(df$New_region_label, ref = "London")
+
df$New_region_label <- relevel(df$New_region_label, ref = "London")

The linear regression window is set up below. This time we include New_region_label rather than Region_label as the region variable:

-
model2 <- lm(pct_Long_term_ill ~ pct_Males + pct_No_qualifications + pct_Higher_manager_prof + New_region_label, data = df)
-
-summary(model2)
+
model2 <- lm(pct_Long_term_ill ~ pct_Males + pct_No_qualifications + pct_Higher_manager_prof + New_region_label, data = df)
+
+summary(model2)

 Call:
@@ -982,21 +1001,21 @@ 

-
frs_df <- frs_data %>% filter(hrp=="HRP" & dependent=="Independent" & health!="Not known") 
+
frs_df <- frs_data %>% filter(hrp=="HRP" & dependent=="Independent" & health!="Not known") 

Then, we create a new numeric variable Net_inc_perc indicate net income per capita as our dependent variable:

-
frs_df$Net_inc_perc = frs_df$hh_income_net / frs_df$hh_size
+
frs_df$Net_inc_perc = frs_df$hh_income_net / frs_df$hh_size

Our two qualitative independent variables “sex” and “health”. Let’s first know what they look like:

-
table(frs_df$sex)
+
table(frs_df$sex)

 Female   Male 
   7647   9180 
-
table(frs_df$health)
+
table(frs_df$health)

       Bad      Fair      Good  Very Bad Very Good 
@@ -1005,13 +1024,13 @@ 

-
frs_df$sex <- relevel(as.factor(frs_df$sex), ref = "Male")
-frs_df$health <- relevel(as.factor(frs_df$health), ref = "Fair")
+
frs_df$sex <- relevel(as.factor(frs_df$sex), ref = "Male")
+frs_df$health <- relevel(as.factor(frs_df$health), ref = "Fair")

Implement the regression model with the two qualitative independent variables.

-
model_frs <- lm(Net_inc_perc ~ sex + health, data = frs_df)
-summary(model_frs)
+
model_frs <- lm(Net_inc_perc ~ sex + health, data = frs_df)
+summary(model_frs)

 Call:
@@ -1039,8 +1058,8 @@ 

-
library(broom)
-tidy(model_frs)
+
library(broom)
+tidy(model_frs)
# A tibble: 6 × 5
   term            estimate std.error statistic  p.value
@@ -1074,14 +1093,14 @@ 

the model of London will be: pct_Long_term_ill (%) = 47.218+ (-0.834)* 49 + 0.472 * 23 + 1.072*0+ 4.345* 0 = 17.208

You can also make a new object like

-
obj_London <- data.frame(pct_Males = 49, pct_No_qualifications=23,pct_Higher_manager_prof =11, New_region_label ="London")
-obj_Wales <- data.frame(pct_Males = 49, pct_No_qualifications=23,pct_Higher_manager_prof =11, New_region_label ="Wales")
-predict(model2,obj_London)
+
obj_London <- data.frame(pct_Males = 49, pct_No_qualifications=23,pct_Higher_manager_prof =11, New_region_label ="London")
+obj_Wales <- data.frame(pct_Males = 49, pct_No_qualifications=23,pct_Higher_manager_prof =11, New_region_label ="Wales")
+predict(model2,obj_London)
       1 
 17.19808 
-
predict(model2,obj_Wales)
+
predict(model2,obj_Wales)
       1 
 21.54334 
diff --git a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-10-1.png b/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-10-1.png new file mode 100644 index 0000000..e3cb00f Binary files /dev/null and b/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-8-1.png b/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-8-1.png index b8a5c6f..cf8a61b 100644 Binary files a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-8-1.png and b/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-9-1.png b/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-9-1.png index e3cb00f..b8a5c6f 100644 Binary files a/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-9-1.png and b/docs/labs/03.QualitativeVariable_files/figure-html/unnamed-chunk-9-1.png differ