Skip to content

Commit

Permalink
💬 add chapters 3 and 4
Browse files Browse the repository at this point in the history
  • Loading branch information
sarahzeller committed Aug 1, 2024
1 parent d12d6bb commit 71282d0
Show file tree
Hide file tree
Showing 2 changed files with 136 additions and 16 deletions.
87 changes: 79 additions & 8 deletions 03_describing-variables.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,94 @@

**Learning objectives:**

- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
- Reviewing variable types
- How are variables distributed, and how can we show that?
- How do we summarize variables?

## SLIDE 1 {-}
## Overview

- ADD SLIDES AS SECTIONS (`##`).
- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
- «empirical research questions really come down entirely to describing the density distributions of statistical variables.»
- A variable, in the context of empirical research, is a bunch of observations of the same measurement
- Successfully describing a variable means being able to take those observations and clearly explain what was observed without making someone look through all 744 neuroticism scores themselves.

## Meeting Videos {-}
## Variable types

### Cohort 1 {-}
- Continuous
- Count (often treated as continuous)
- Ordinal
- Categorical – subgroup: binary (categorical can be turned into binary)
- Qualitative (e.g. text data)

## Distribution

- A variable’s distribution is a description of how often different values occur.
- Categorical/ordinal:
- **table** with percentages AKA frequency table/bar chart
- Continuous:
- **histogram** (like bar chart): It’s the exact same thing as the frequency table or graph we used for the categorical variable, except that the categories are ranges of the variable rather than the full list of values it could take.
- **Density plot**: like drawing infinitely small ranges for a histogram. we can describe the probability of being in a given range of the variable by seeing how large the area underneath the distribution is.
- ![](images/describingvariables-shadeddensity-1.png)

## Summarizing the distribution

- Whole distribution is too much information to take in

- So our goal is to pick ways to take the entire distribution and produce a few numbers that describe that distribution pretty well.

## Mean, Percentiles, IQR

- Mean: Central tendency: representative value
- Percentiles: shading in just a bit of the distribution
- Median: right in the middle, central tendency (representative observation)
- Minimum
- Maximum: max – min = range
- Inter-quartile range (**IQR**): exactly half of the distribution

## Variation

![](images/describingvariables-amountofvariation-1.png)

- How wide a distribution is
- e.g. Number of children vs. Number of eyes
- Measure: variance
- Variance is squared, so use standard deviation (same unit)


## Skew

![](images/describingvariables-income-1.png)

- describes how the distribution leans to one side or the other; opposite: symmetric
- Handling: transformation to shrink impact of huge observations
- e.g. log (no 0s)
- asinh (with 0s, but no negative values)

## Theoretical Distributions

- We have reality, but we want the truth from that (inference)
- Notation
- English/Latin letters: data
- Modifications of English/Latin letters: calculations with real data
- Greek letters: truth
- Modifications of Greek letters: our estimation of the truth; most commonly: hat So $\hat{\mu}$ is “my estimate of what I think 𝜇 is.”
- Theoretical distribution generated data.
- Don’t care about mean in our data, but about the mean of the true distribution for everyone
- And the bigger our number of observations gets, the gooder-enougher we become.
![](images/describingvariables-approachlimit-1.png)
- Hypothesis testing: we can check if our distribution has a certain mu, by checking out our standard deviation and then running a two-sided t-test.

## Meeting Videos {.unnumbered}

### Cohort 1 {.unnumbered}

`r knitr::include_url("https://www.youtube.com/embed/URL")`

<details>
<summary> Meeting chat log </summary>

```
<summary>Meeting chat log</summary>

```
LOG
```

</details>
65 changes: 57 additions & 8 deletions 04_describing-relationships.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,72 @@

**Learning objectives:**

- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
- Review: what's a relationship?
- Conditional distributions and means
- Why do we use regressions?
- Controls

## SLIDE 1 {-}
## Relationships

- ADD SLIDES AS SECTIONS (`##`).
- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
The relationship between two variables shows you what learning about one variable tells you about the other.

## Meeting Videos {-}
- E.g. height/age for children
- Positive/negative/null/combination

### Cohort 1 {-}
Scatterplot to show relationship

- Place to start
- Shows ALL the relationship/information, like density plot

## Conditional distributions

A conditional distribution is the distribution of one variable given the value of another variable.

- Learning about one variable changes the distribution of the other variable
- Conditional probability: e.g. probability of being a woman for someone who is called Sarah
- Conditional distributions: same, but for a distribution

## Conditional means

Since we have the conditional distribution, we can get any conditional feature of that distribution.

- Work with mean, since handy
- discrete
- Alternative: LOESS (locally estimated scatterplot smoothing), non-parametric version

![](images/describingrelationships-loess-1.png)

## Line-fitting/regression

> Instead of thinking locally and producing estimates of the mean of 𝑌 conditional on values of 𝑋, we can assume that the underlying relationship between 𝑌 and 𝑋 can be represented by some sort of shape. In basic forms of regression, that shape is a straight line.
- can describe relationship for missing data
- clear: positive/negative
- results are more precise since using all data
- ☹ line – have to pick shape of line
- OLS: use linear/squared/log, but with linear coefficient
- no OLS: different function
- Other option: Pearson correlation coefficient
- Nice to interpret: between -1 and 1

## Conditional Conditional Means (not a typo) AKA using controls

> If we really want to know if ice cream-eating affects shorts-wearing, we would want to know how much of a relationship is there between ice cream and shorts that isn’t explained by temperature? So we would get the mean of ice cream conditional on temperature, and then take the residual, getting only the variation in ice cream that has nothing to do with temperature. Then we would take the mean of shorts-wearing conditional on temperature, and take the residual, getting only the variation in shorts-wearing that has nothing to do with temperature. Finally, we get the mean of the shorts-wearing residual conditional on the ice cream residual. If the shorts mean doesn’t change much conditional on different values of ice cream eating, then the entire relationship was just explained by heat! If there’s still a strong relationship there, maybe we do have something.
![](images/describingrelationships-control-2-1.png)

## Meeting Videos {.unnumbered}

### Cohort 1 {.unnumbered}

`r knitr::include_url("https://www.youtube.com/embed/URL")`

<details>
<summary> Meeting chat log </summary>

```
<summary>Meeting chat log</summary>

```
LOG
```

</details>

0 comments on commit 71282d0

Please sign in to comment.