Skip to content

Commit

Permalink
more tasks
Browse files Browse the repository at this point in the history
  • Loading branch information
Szymon M. Kiełbasa authored and Szymon M. Kiełbasa committed Sep 10, 2024
1 parent d2d992b commit ced4b3f
Showing 1 changed file with 81 additions and 1 deletion.
82 changes: 81 additions & 1 deletion rcourse/task_concepts.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1179,7 +1179,87 @@ d |>
d |> select( weight, height ) |> mutate( bmi = weight / (height/100)^2 ) |> arrange( bmi ) |> head()
```

## Grouping and summarizing rows of a table. {#topic:ExGroupSummarize} {#needs:ExMutate} {#function:group_by} {#function:summarize} {#function:mean} {#function:sd} {#function:n} {#function:n_distinct} {#function:quantile} {#function:median} {#function:sum} {#function:sd} {#function:var} {#function:range} {#function:min} {#function:max} {#function:fivenum} {#function:unique}
## Grouping and summarizing rows of a table. {#topic:ExGroupSummarize} {#needs:ExMutate} {#function:group_by} {#function:summarize} {#function:count} {#function:mean} {#function:sd} {#function:n}

The `group_by` function is used to group rows of a table into separate parts, based on the values in one or more columns.
Then, the `summarize` function allows to calculate summary statistics separately for each group (one row per group is created in the output table).
The `mutate` function can also be used with `group_by` to prodicue calculations within each group (but with `mutate` the number of rows does not change).

Manually run the pieces of code given below. Compare the output of the following lines of code to each other.
Understand how you perform calculations on groups of rows in the table.
Observe, how to build longer pipes by combining several actions on the tables.

```{r eval=FALSE,echo=TRUE}
library(tidyverse)
d <- readRDS( "rcourse/data/pulseNA.rds" )
d |> group_by( gender ) |> summarize( studentsNum=n() )
d |> group_by( gender ) |> mutate( studentsNum=n() ) # compare to above; not what you usually want
d |> count( gender ) # a shortcut
d |> count( gender, name = "studentsNum" ) # a shortcut with renaming
d |>
group_by( gender ) |>
summarize( n=n(), meanWeight=mean(weight), sdWeight=sd(weight) )
d |> count( gender, exercise )
d |> count( exercise, gender )
dd <- d |> count( gender, exercise )
dd
dd |>
group_by( gender ) |>
mutate( fracWithinGender = n/sum(n) )
d |>
count( gender, exercise ) |>
group_by( gender ) |>
mutate( percentWithinGender = round( 100*n/sum(n), 1 ) ) |>
arrange( gender, desc(percentWithinGender) )
```

Per gender, calculate the mean and the standard deviation of the pulse before the exercise.
Find how to perform these calculations with ignoring missing values. Name the columns `meanPulseBefore` and `sdPulseBefore`.

How many students were there in each year of the experiment?

Per year, calculate the number of students and the number of missing values in the `exercise` column.
Provide the results in a single table with columns `year`, `studentsNum`, `missingExerciseNum`.

For each gender and `run` levels, build a table with min, median, and max of known pulses after the exercise.

```{r}
### SOLUTION
d |>
group_by( gender ) |>
summarize( meanPulseBefore=mean(pulse1, na.rm=TRUE), sdPulseBefore=sd(pulse1, na.rm=TRUE) )
d |> # another possible solution
filter( !is.na(pulse1) ) |>
group_by( gender ) |>
summarize( meanPulseBefore=mean(pulse1), sdPulseBefore=sd(pulse1) )
d |> count( year )
d |>
group_by( year ) |>
summarize( studentsNum=n(), missingExerciseNum=sum( is.na(exercise) ) )
d |>
filter( !is.na(pulse2) ) |>
group_by( gender, ran ) |>
summarize( minPulse=min(pulse2), medianPulse=median(pulse2), maxPulse=max(pulse2) )
```

## Getting (pulling) a column from a table.

```{r}
d$weight
d |> pull( weight )
setNames( d$weight, d$name )
d |> pull( weight, name )
```


## Sandbox
Expand Down

0 comments on commit ced4b3f

Please sign in to comment.