Added lab for PCA and clean slides on regression.

astamm · Jan 5, 2025 · 7234691 · 7234691
1 parent b1237c1
commit 7234691
Show file tree

Hide file tree

Showing 7 changed files with 115 additions and 48 deletions.
diff --git a/11_Linear_Modeling/11-Linear-Modeling-Slides.qmd b/11_Linear_Modeling/11-Linear-Modeling-Slides.qmd
@@ -278,34 +278,34 @@ first_model |>
 ```
 :::
 
-## Reminder: model assumptions
+## Reminder: model assumptions (1/3)
 
-::: {.r-fit-text}
-Linear regression makes several assumptions about the data, such as :
-
-1.  **Linearity of the data**. The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
-
-2.  **Normality of residuals**. The residual errors are assumed to be normally distributed.
+::: {.columns}
 
-3.  **Homogeneity of residuals variance**. The residuals are assumed to have a constant variance (**homoscedasticity**)
+::: {.column}
 
-4.  **Independence of residuals error terms**.
+::: {.callout-important title="Assumptions to check"}
+1.  **Linearity.** The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
+2.  **Normality.** The error terms are assumed to be normally distributed.
+3.  **Homogeneity of variance.** The error terms are assumed to have a constant variance (**homoscedasticity**).
+4.  **Independence.** The error terms are assumed to be independent.
+:::
 
-You should check whether or not these assumptions hold true. Potential problems include:
+:::
 
-1.  **Non-linearity** of the outcome - predictor relationships
+::: {.column}
 
+::: {.callout-important title="Potential problems to check"}
+1.  **Non-linearity** of the outcome - predictor relationships
 2.  **Heteroscedasticity**: Non-constant variance of error terms.
+3.  **Presence of influential and potential outlier values** in the data:
 
-3.  **Presence of influential values** in the data that can be:
-
-    -   Outliers: extreme values in the outcome (y) variable
-
-    -   High-leverage points: extreme values in the predictors (x) variable
+    -   Outliers: typically large standardized residuals
+    -   High-leverage points: typically large leverage values
+:::
 
-All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing the residual errors.
+:::
 
-Source: <http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/>
 :::
 
 ## First data set: Residuals vs fitted values
@@ -451,32 +451,46 @@ There are two possible software suites:
 -   either [`jtools`](https://jtools.jacob-long.com/index.html) and [`interactions`](https://interactions.jacob-long.com) packages:
 -   or [`ggeffects`](https://strengejacke.github.io/ggeffects/index.html) and [`sjPlot`](https://strengejacke.github.io/sjPlot/) packages.
 
+They both provide tools for summarizing and visualising models, marginal effects, interactions and model predictions.
+:::
+
 ```{r}
 #| echo: true
 # install.packages(c("jtools", "interactions"))
 # install.packages(c("ggeffects", "sjPlot"))
-fit <- lm(metascore ~ imdb_rating + log(us_gross) + genre5, data = jtools::movies)
-fit2 <- lm(metascore ~ imdb_rating + log(us_gross) + log(budget) + genre5, data = jtools::movies)
+fit1 <- lm(
+  metascore ~ imdb_rating + log(us_gross) + genre5, 
+  data = jtools::movies
+)
+fit2 <- lm(
+  metascore ~ imdb_rating + log(us_gross) + log(budget) + genre5, 
+  data = jtools::movies
+)
 ```
 
-They both provide tools for summarizing and visualising models, marginal effects, interactions and model predictions.
+## Tabular summary (1/2) {.smaller}
 
-::: columns
-::: {.column width="50%"}
 ```{r}
 #| echo: true
-jtools::summ(fit)
+jtools::summ(fit1)
 ```
-:::
 
-::: {.column width="50%"}
+## Tabular summary (2/2) {.smaller}
+
 ```{r}
 #| echo: true
-jtools::plot_summs(fit, fit2, inner_ci_level = .9)
+jtools::export_summs(
+  fit1, fit2, 
+  error_format = "[{conf.low}, {conf.high}]", error_pos = "right"
+)
+```
+
+## Visual summary
+
+```{r}
+#| echo: true
+jtools::plot_summs(fit1, fit2, inner_ci_level = .9)
 ```
-:::
-:::
-:::
 
 ## Effect plot - Continuous predictor
 
@@ -547,7 +561,7 @@ $$
 
 where $R_i^2$ is the multiple correlation coefficient associated to the regression of $X_i$ against all remaining independent variables.
 
-## Variation Inflation Factor (2/3)
+## Variation Inflation Factor (2/3) {.smaller}
 
 You can add VIFs to the summary of a model as follows:
 
@@ -558,22 +572,22 @@ jtools::summ(fit, vifs = TRUE)
 
 ## Variation Inflation Factor (3/3)
 
-::: {.r-fit-text}
--   VIFs are always at least equal to 1.
-
--   In some domains, VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn't pay attention to these at all. Ultimately, the main thing to consider is that small effects are more likely to be "drowned out" by higher VIFs, but this may just be a natural, unavoidable fact with your model (e.g., there is no problem with high VIFs when you have an interaction effect).
+- VIFs are always at least equal to 1.
+- In some domains, VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn't pay attention to these at all. 
+- Small effects are more likely to be *drowned out* by higher VIFs, but this may just be a natural, unavoidable fact with your model (e.g., there is no problem with high VIFs when you have an interaction effect).
 
--   Should you care?
+## Model selection {.smaller}
 
-    -   If you are interested in causal inference, yes;
-
-    -   otherwise, no.
+::: {.callout-note title="Forward selection"}
+- Start with no variables in the model.
+- Test the addition of each variable using a chosen model fit criterion[^1], adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit
+- Repeat until none improves the model to a statistically significant extent.
 :::
 
-## Model selection
-
-::: {.r-fit-text}
-- **Forward selection**, which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.
-- **Backward elimination**, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically significant loss of fit.
-- A list of possible and often used model fit criterion is available on the [Wikipedia Model Selection](https://en.wikipedia.org/wiki/Model_selection) page. In R, the basic `stats::step()` function uses the Akaike's information criterion (AIC) and allows to perform forward selection (`direction = "forward"`) or backward elimination (`direction = "backward"`).
+::: {.callout-note title="Backward elimination"}
+- Start with all candidate variables
+- Test the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit
+- Repeat until no further variables can be deleted without a statistically significant loss of fit.
 :::
+
+[^1]: A list of possible and often used model fit criterion is available on the [Wikipedia Model Selection](https://en.wikipedia.org/wiki/Model_selection) page. In R, the basic `stats::step()` function uses the Akaike's information criterion (AIC) and allows to perform forward selection (`direction = "forward"`) or backward elimination (`direction = "backward"`).
diff --git a/12_PCA/12-PCA-Data.zip b/12_PCA/12-PCA-Data.zip
diff --git a/12_PCA/12-PCA-Exercises.qmd b/12_PCA/12-PCA-Exercises.qmd
@@ -0,0 +1,51 @@
+---
+title: "PCA - Exercises"
+---
+
+```{r}
+#| label: setup
+#| include: false
+knitr::opts_chunk$set(echo = FALSE)
+knitr::opts_chunk$set(comment = NA)
+library(tidyverse)
+library(DT)
+library(FactoMineR)
+library(factoextra)
+library(corrplot)
+library(viridis)
+html_table <- function(x, offset = 5) {
+  DT::datatable(
+    data = x, 
+    rownames = TRUE, 
+    options = list(
+      scrollX = TRUE, 
+      searching = FALSE, 
+      lengthMenu = c(0, 5, 10, 15) + offset
+    )
+  )
+}
+```
+
+## Le jeu de données `temperature.csv`
+
+Cette base contient:
+
+- les enregistrements des températures des capitales européennes de Janvier à Décembre
+- les coordonnées GPS de chaque ville
+- amplitude thermale : Différence entre les températures maximales et minimales
+- moyenne annuelle
+- une variable qualitative : la direction (S, N, O, E).
+
+Exécuter une PCA pour dégager des profils type de température et quelles villes les suivent. 
+
+## Le jeu de données `chicken.csv`
+
+- Description : 43 poulets ayant subis 6 régimes : régime normal (N), Jeûne pendant 16h (F16), Jeûne pendant 16h et puis se réalimenter pendant 5h (F16R5), (F16R16), (F48), (F48R24)
+- Variables : Après le régime, on a effectué une analyse des gènes utilisant une puce ADN : 7407 expressions de gènes.
+- Objectif : Voir si les gènes s’expriment différemment selon le niveau de stress. Combien de temps faut-il au poulet pour revenir à la situation normale ?
+
+## Le jeu de données `orange.csv`
+
+Six jus d'orange de fabriquants différents ont été évalués.
+Toutes les variables sont-elles indisensables ?
+Y a-t-il des jus qui se dégagent comme particulièrement bons ? mauvais ?
diff --git a/12_PCA/pca-principle.gif b/12_PCA/pca-principle.gif
diff --git a/README.md b/README.md
@@ -49,7 +49,7 @@ written in [Quarto](https://quarto.org).
 |------|-------|--------|-----------|-----------------|
 | 1    | Hypothesis Testing | [Quarto](10_Hypothesis_Testing/10-Hypothesis-Testing-Slides.qmd) | [Quarto](10_Hypothesis_Testing/10-Hypothesis-Testing-Exercises.qmd) | |
 | 2    | Linear Regression | [Quarto](11_Linear_Modeling/11-Linear-Modeling-Slides.qmd) | [Quarto](11_Linear_Modeling/11-Linear-Modeling-Exercises.qmd) | [ZIP](11_Linear_Modeling/11-Linear-Modeling-Data.zip) |
-| 3    | Principal Component Analysis | [Quarto](12_Principal_Component_Analysis/12-Principal-Component-Analysis-Slides.qmd) | [Quarto](12_Principal_Component_Analysis/12-Principal-Component-Analysis-Exercises.qmd) | |
+| 3    | Principal Component Analysis | [Quarto](12_PCA/12-PCA-Slides.qmd) | [Quarto](12_PCA/12-PCA-Exercises.qmd) | [ZIP](12_PCA/12-PCA-Data.zip) |
 | 4    | Clustering | [Quarto](13_Clustering/13-Clustering-Slides.qmd) | [Quarto](13_Clustering/13-Clustering-Exercises.qmd) | |
 
 ### Requirements

diff --git a/_quarto.yml b/_quarto.yml
@@ -8,7 +8,7 @@ website:
   repo-actions: [edit, source, issue]
   page-footer:
     background: light
-    left: Copyright 2024, Aymeric Stamm
+    left: Copyright 2025, Aymeric Stamm
     right: This website is built with [Quarto](https://quarto.org/)
   navbar:
     left:
@@ -60,6 +60,8 @@ website:
             href: 10_Hypothesis_Testing/10-Hypothesis-Testing-Exercises.qmd
           - text: "2 Linear Modeling"
             href: 11_Linear_Modeling/11-Linear-Modeling-Exercises.qmd
+          - text: "3 Principal Component Analysis"
+            href: 12_PCA/12-PCA-Exercises.qmd
       - text: "Homework Assignment"
         href: project.qmd
       - text: "First Exam"

diff --git a/index.qmd b/index.qmd
@@ -61,7 +61,7 @@ written in [Quarto](https://quarto.org).
 |------|-------|--------|-----------|-----------------|
 | 1    | Hypothesis Testing | [Quarto](10_Hypothesis_Testing/10-Hypothesis-Testing-Slides.qmd) | [Quarto](10_Hypothesis_Testing/10-Hypothesis-Testing-Exercises.qmd) | |
 | 2    | Linear Regression | [Quarto](11_Linear_Modeling/11-Linear-Modeling-Slides.qmd) | [Quarto](11_Linear_Modeling/11-Linear-Modeling-Exercises.qmd) | [ZIP](11_Linear_Modeling/11-Linear-Modeling-Data.zip) |
-| 3    | Principal Component Analysis | [Quarto](12_Principal_Component_Analysis/12-Principal-Component-Analysis-Slides.qmd) | [Quarto](12_Principal_Component_Analysis/12-Principal-Component-Analysis-Exercises.qmd) | |
+| 3    | Principal Component Analysis | [Quarto](12_PCA/12-PCA-Slides.qmd) | [Quarto](12_PCA/12-PCA-Exercises.qmd) | [ZIP](12_PCA/12-PCA-Data.zip) |
 | 4    | Clustering | [Quarto](13_Clustering/13-Clustering-Slides.qmd) | [Quarto](13_Clustering/13-Clustering-Exercises.qmd) | |
 
 ## Requirements