Skip to content

Commit

Permalink
Added lab for PCA and clean slides on regression.
Browse files Browse the repository at this point in the history
  • Loading branch information
astamm committed Jan 5, 2025
1 parent b1237c1 commit 7234691
Show file tree
Hide file tree
Showing 7 changed files with 115 additions and 48 deletions.
104 changes: 59 additions & 45 deletions 11_Linear_Modeling/11-Linear-Modeling-Slides.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -278,34 +278,34 @@ first_model |>
```
:::

## Reminder: model assumptions
## Reminder: model assumptions (1/3)

::: {.r-fit-text}
Linear regression makes several assumptions about the data, such as :

1. **Linearity of the data**. The relationship between the predictor (x) and the outcome (y) is assumed to be linear.

2. **Normality of residuals**. The residual errors are assumed to be normally distributed.
::: {.columns}

3. **Homogeneity of residuals variance**. The residuals are assumed to have a constant variance (**homoscedasticity**)
::: {.column}

4. **Independence of residuals error terms**.
::: {.callout-important title="Assumptions to check"}
1. **Linearity.** The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
2. **Normality.** The error terms are assumed to be normally distributed.
3. **Homogeneity of variance.** The error terms are assumed to have a constant variance (**homoscedasticity**).
4. **Independence.** The error terms are assumed to be independent.
:::

You should check whether or not these assumptions hold true. Potential problems include:
:::

1. **Non-linearity** of the outcome - predictor relationships
::: {.column}

::: {.callout-important title="Potential problems to check"}
1. **Non-linearity** of the outcome - predictor relationships
2. **Heteroscedasticity**: Non-constant variance of error terms.
3. **Presence of influential and potential outlier values** in the data:

3. **Presence of influential values** in the data that can be:

- Outliers: extreme values in the outcome (y) variable

- High-leverage points: extreme values in the predictors (x) variable
- Outliers: typically large standardized residuals
- High-leverage points: typically large leverage values
:::

All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing the residual errors.
:::

Source: <http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/>
:::

## First data set: Residuals vs fitted values
Expand Down Expand Up @@ -451,32 +451,46 @@ There are two possible software suites:
- either [`jtools`](https://jtools.jacob-long.com/index.html) and [`interactions`](https://interactions.jacob-long.com) packages:
- or [`ggeffects`](https://strengejacke.github.io/ggeffects/index.html) and [`sjPlot`](https://strengejacke.github.io/sjPlot/) packages.

They both provide tools for summarizing and visualising models, marginal effects, interactions and model predictions.
:::

```{r}
#| echo: true
# install.packages(c("jtools", "interactions"))
# install.packages(c("ggeffects", "sjPlot"))
fit <- lm(metascore ~ imdb_rating + log(us_gross) + genre5, data = jtools::movies)
fit2 <- lm(metascore ~ imdb_rating + log(us_gross) + log(budget) + genre5, data = jtools::movies)
fit1 <- lm(
metascore ~ imdb_rating + log(us_gross) + genre5,
data = jtools::movies
)
fit2 <- lm(
metascore ~ imdb_rating + log(us_gross) + log(budget) + genre5,
data = jtools::movies
)
```

They both provide tools for summarizing and visualising models, marginal effects, interactions and model predictions.
## Tabular summary (1/2) {.smaller}

::: columns
::: {.column width="50%"}
```{r}
#| echo: true
jtools::summ(fit)
jtools::summ(fit1)
```
:::

::: {.column width="50%"}
## Tabular summary (2/2) {.smaller}

```{r}
#| echo: true
jtools::plot_summs(fit, fit2, inner_ci_level = .9)
jtools::export_summs(
fit1, fit2,
error_format = "[{conf.low}, {conf.high}]", error_pos = "right"
)
```

## Visual summary

```{r}
#| echo: true
jtools::plot_summs(fit1, fit2, inner_ci_level = .9)
```
:::
:::
:::

## Effect plot - Continuous predictor

Expand Down Expand Up @@ -547,7 +561,7 @@ $$

where $R_i^2$ is the multiple correlation coefficient associated to the regression of $X_i$ against all remaining independent variables.

## Variation Inflation Factor (2/3)
## Variation Inflation Factor (2/3) {.smaller}

You can add VIFs to the summary of a model as follows:

Expand All @@ -558,22 +572,22 @@ jtools::summ(fit, vifs = TRUE)

## Variation Inflation Factor (3/3)

::: {.r-fit-text}
- VIFs are always at least equal to 1.

- In some domains, VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn't pay attention to these at all. Ultimately, the main thing to consider is that small effects are more likely to be "drowned out" by higher VIFs, but this may just be a natural, unavoidable fact with your model (e.g., there is no problem with high VIFs when you have an interaction effect).
- VIFs are always at least equal to 1.
- In some domains, VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn't pay attention to these at all.
- Small effects are more likely to be *drowned out* by higher VIFs, but this may just be a natural, unavoidable fact with your model (e.g., there is no problem with high VIFs when you have an interaction effect).

- Should you care?
## Model selection {.smaller}

- If you are interested in causal inference, yes;

- otherwise, no.
::: {.callout-note title="Forward selection"}
- Start with no variables in the model.
- Test the addition of each variable using a chosen model fit criterion[^1], adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit
- Repeat until none improves the model to a statistically significant extent.
:::

## Model selection

::: {.r-fit-text}
- **Forward selection**, which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.
- **Backward elimination**, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically significant loss of fit.
- A list of possible and often used model fit criterion is available on the [Wikipedia Model Selection](https://en.wikipedia.org/wiki/Model_selection) page. In R, the basic `stats::step()` function uses the Akaike's information criterion (AIC) and allows to perform forward selection (`direction = "forward"`) or backward elimination (`direction = "backward"`).
::: {.callout-note title="Backward elimination"}
- Start with all candidate variables
- Test the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit
- Repeat until no further variables can be deleted without a statistically significant loss of fit.
:::

[^1]: A list of possible and often used model fit criterion is available on the [Wikipedia Model Selection](https://en.wikipedia.org/wiki/Model_selection) page. In R, the basic `stats::step()` function uses the Akaike's information criterion (AIC) and allows to perform forward selection (`direction = "forward"`) or backward elimination (`direction = "backward"`).
Binary file added 12_PCA/12-PCA-Data.zip
Binary file not shown.
51 changes: 51 additions & 0 deletions 12_PCA/12-PCA-Exercises.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
title: "PCA - Exercises"
---

```{r}
#| label: setup
#| include: false
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(comment = NA)
library(tidyverse)
library(DT)
library(FactoMineR)
library(factoextra)
library(corrplot)
library(viridis)
html_table <- function(x, offset = 5) {
DT::datatable(
data = x,
rownames = TRUE,
options = list(
scrollX = TRUE,
searching = FALSE,
lengthMenu = c(0, 5, 10, 15) + offset
)
)
}
```

## Le jeu de données `temperature.csv`

Cette base contient:

- les enregistrements des températures des capitales européennes de Janvier à Décembre
- les coordonnées GPS de chaque ville
- amplitude thermale : Différence entre les températures maximales et minimales
- moyenne annuelle
- une variable qualitative : la direction (S, N, O, E).

Exécuter une PCA pour dégager des profils type de température et quelles villes les suivent.

## Le jeu de données `chicken.csv`

- Description : 43 poulets ayant subis 6 régimes : régime normal (N), Jeûne pendant 16h (F16), Jeûne pendant 16h et puis se réalimenter pendant 5h (F16R5), (F16R16), (F48), (F48R24)
- Variables : Après le régime, on a effectué une analyse des gènes utilisant une puce ADN : 7407 expressions de gènes.
- Objectif : Voir si les gènes s’expriment différemment selon le niveau de stress. Combien de temps faut-il au poulet pour revenir à la situation normale ?

## Le jeu de données `orange.csv`

Six jus d'orange de fabriquants différents ont été évalués.
Toutes les variables sont-elles indisensables ?
Y a-t-il des jus qui se dégagent comme particulièrement bons ? mauvais ?
Binary file added 12_PCA/pca-principle.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ written in [Quarto](https://quarto.org).
|------|-------|--------|-----------|-----------------|
| 1 | Hypothesis Testing | [Quarto](10_Hypothesis_Testing/10-Hypothesis-Testing-Slides.qmd) | [Quarto](10_Hypothesis_Testing/10-Hypothesis-Testing-Exercises.qmd) | |
| 2 | Linear Regression | [Quarto](11_Linear_Modeling/11-Linear-Modeling-Slides.qmd) | [Quarto](11_Linear_Modeling/11-Linear-Modeling-Exercises.qmd) | [ZIP](11_Linear_Modeling/11-Linear-Modeling-Data.zip) |
| 3 | Principal Component Analysis | [Quarto](12_Principal_Component_Analysis/12-Principal-Component-Analysis-Slides.qmd) | [Quarto](12_Principal_Component_Analysis/12-Principal-Component-Analysis-Exercises.qmd) | |
| 3 | Principal Component Analysis | [Quarto](12_PCA/12-PCA-Slides.qmd) | [Quarto](12_PCA/12-PCA-Exercises.qmd) | [ZIP](12_PCA/12-PCA-Data.zip) |
| 4 | Clustering | [Quarto](13_Clustering/13-Clustering-Slides.qmd) | [Quarto](13_Clustering/13-Clustering-Exercises.qmd) | |

### Requirements
Expand Down
4 changes: 3 additions & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ website:
repo-actions: [edit, source, issue]
page-footer:
background: light
left: Copyright 2024, Aymeric Stamm
left: Copyright 2025, Aymeric Stamm
right: This website is built with [Quarto](https://quarto.org/)
navbar:
left:
Expand Down Expand Up @@ -60,6 +60,8 @@ website:
href: 10_Hypothesis_Testing/10-Hypothesis-Testing-Exercises.qmd
- text: "2 Linear Modeling"
href: 11_Linear_Modeling/11-Linear-Modeling-Exercises.qmd
- text: "3 Principal Component Analysis"
href: 12_PCA/12-PCA-Exercises.qmd
- text: "Homework Assignment"
href: project.qmd
- text: "First Exam"
Expand Down
2 changes: 1 addition & 1 deletion index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ written in [Quarto](https://quarto.org).
|------|-------|--------|-----------|-----------------|
| 1 | Hypothesis Testing | [Quarto](10_Hypothesis_Testing/10-Hypothesis-Testing-Slides.qmd) | [Quarto](10_Hypothesis_Testing/10-Hypothesis-Testing-Exercises.qmd) | |
| 2 | Linear Regression | [Quarto](11_Linear_Modeling/11-Linear-Modeling-Slides.qmd) | [Quarto](11_Linear_Modeling/11-Linear-Modeling-Exercises.qmd) | [ZIP](11_Linear_Modeling/11-Linear-Modeling-Data.zip) |
| 3 | Principal Component Analysis | [Quarto](12_Principal_Component_Analysis/12-Principal-Component-Analysis-Slides.qmd) | [Quarto](12_Principal_Component_Analysis/12-Principal-Component-Analysis-Exercises.qmd) | |
| 3 | Principal Component Analysis | [Quarto](12_PCA/12-PCA-Slides.qmd) | [Quarto](12_PCA/12-PCA-Exercises.qmd) | [ZIP](12_PCA/12-PCA-Data.zip) |
| 4 | Clustering | [Quarto](13_Clustering/13-Clustering-Slides.qmd) | [Quarto](13_Clustering/13-Clustering-Exercises.qmd) | |

## Requirements
Expand Down

0 comments on commit 7234691

Please sign in to comment.