# Importing the necessary packages
import numpy as np # "Scientific computing"
import scipy.stats as stats # Statistical tests
import pandas as pd # Data Frame
from pandas.api.types import CategoricalDtype
import random
import math
import matplotlib.pyplot as plt # Basic visualisation
from statsmodels.graphics.mosaicplot import mosaic # Mosaic diagram
import seaborn as sns # Advanced data visualisation
from sklearn.linear_model import LinearRegression
import altair as alt # Alternative visualisation system
Student
Import scipy.stats
For a
Function | Purpose |
---|---|
stats.t.pdf(x, df=d) | Probability density for |
stats.t.cdf(x, df=d) | Left-tail probability 𝑃(𝑋 < x) |
stats.t.sf(x, df=d) | Right-tail probability 𝑃(𝑋 > x) |
stats.t.isf(1-p, df=d) | p% of observations are expected to be lower than this value |
Normal distribution in Python Python functions
Import scipy.stats
For a normal distribution with mean m and standard deviation s:
Function | Purpose |
---|---|
stats.norm.pdf(x, loc=m, scale=s) | Probability density at |
stats.norm.cdf(x, loc=m, scale=s) | Left-tail probability 𝑃(𝑋 < x) |
stats.norm.sf(x, loc=m, scale=s) | Right-tail probability 𝑃(𝑋 > x) |
stats.norm.isf(1-p, loc=m, scale=s) | p% of observations are expected to be lower than result |
graph LR
A[Data Characteristics] -- Sample Size < 30 --> B[t-test]
A -- Sample Size >= 30 --> C[z-test]
A -- Sample Size Unknown --> C
B -- Population Distribution Unknown --> C
B -- Population Distribution Known and Normally Distributed --> C
C -- Variances Equal and Known --> D[z-test]
C -- Variances Unequal or Unknown --> B
Requirements z-test:
- Random sample
- Sample groot genoeg (n >= 30)
- als normaal verdeeld is is sample size niet relevant
- normaal verdeeld
- populatie standaard deviatie is gekend
indien 1 van deze niet voldaan is gebruik je de t-test en deze normaal verdeeld is
H0 -> er is geen verband tussen de 2 variabelen H1 -> er is een verband tussen de 2 variabelen
De Chi-kwadraattoets wordt gebruikt om associaties tussen categorische variabelen te beoordelen
Cramér's V meet de sterkte van deze associatie
en de goodness-of-fit test controleert of de waargenomen frequenties overeenkomen met de verwachte theoretische verdeling.
Use the t-test for independence when comparing the means of two independent groups or conditions.
- 2 onafhankelijke groepen
- vergelijken van het gemiddelde van 2 groepen (niet perse even groot)
- gemiddelde van 2 verschillende groepen
- Groep met placebo en groep met medicijn
Use the paired t-test when analyzing paired or matched observations to compare means within the same group under different conditions or time points.
- 2 afhankelijke groepen
- zelfde test subjecten
- Voorbeeld zelfde auto met verschillende soorten benzine
Use Cohen's d as a measure of effect size to interpret the practical significance of the observed difference and compare effect sizes across different studies or conditions.
- Effectgrote -> hoe groot is het verschil tussen de 2 groepen
dependend variable -> y independend variable -> x
Use regression analysis when you want to understand the relationship between a dependent variable and independent variables, and predict the value of the dependent variable.
Use the correlation coefficient when you want to measure the strength and direction of the linear relationship between two variables.
- r -> 0 -> geen correlatie -> alle punten liggen verspreid
- r -> 1 -> positieve correlatie -> alle punten liggen op 1 lijn stijgend
- r -> -1 -> negatieve correlatie -> alle punten liggen op 1 lijn dalend
Use the coefficient of determination (R-squared) to assess the model fit, compare models, and interpret the proportion of variance explained by the independent variables.
- r² -> 0 -> zwakke correlatie
- r² -> 1 -> sterke correlatie
- moving averages
- simple moving average
- weighted moving average
- exponential moving average
- exponential smoothing
- single exponential smoothing -> exponential smoothing
- geen trend of seasonality
- double exponential smoothing -> Holt's method
- trend
- triple exponential smoothing -> Holt-Winters method
- trend en seasonality
- single exponential smoothing -> exponential smoothing