Skip to content

Commit

Permalink
upd pokemon plots
Browse files Browse the repository at this point in the history
  • Loading branch information
kevinwangstats committed Jul 17, 2021
1 parent 20d4e6b commit d6337c4
Show file tree
Hide file tree
Showing 2 changed files with 226 additions and 65 deletions.
107 changes: 96 additions & 11 deletions wrangling_pokemon.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -163,9 +163,9 @@ In both plots, notice that

From that perspective, both plots are presenting the same information to you. However, the second plot has one extra feature, in that each type gets its own little plot, or "facet".

We are now ready to talk about the grammar of graphics. There are hundreads of plots that statisticians can use, but we need a consistent way to describe these plots and the rules to describe these plots is like how we have grammar in English to govern how language is used.
We are now ready to talk about the grammar of graphics. There are hundreads of plots that statisticians can use, but we need a consistent way to describe these plots and the rules to describe these plots is like how we have grammar in English to govern how language is used.

The **grammar of graphics** is a language that maps variables into graphical elements. Instead of uniquely naming specialist plots, like barchart, pie chart, histogram ... it provides the description of a plot, that enables comparing how they are similar and different. It gives us a more formal, and yes, mathematical way to make data plots.
The **grammar of graphics** is a language that maps variables into graphical elements. If you take a look at how we have described the plots above, each graphical element (x-axis, y-axis, etc) correspond to one and only one variable (`attack`, `defense`, etc). Thus, instead of uniquely naming specialist plots, like barchart, pie chart, histogram... grammar of graphics provides the description of a plot, that enables we to compare how they are similar and different. It gives us a more formal, and yes, mathematical way to make data plots.

The grammar of graphics has these major elements:

Expand All @@ -180,12 +180,97 @@ ggplot(data = <DATA>) +
<FACET_FUNCTION>
```

- DATA: This is the data.frame or tibble that you are starting from
- MAPPINGS: Specific details of how a variable is mapped into the plot
- GEOM: The graphical element to use, e.g. point, line, rectangle, density, boxplot, ...
- STAT: A statistical calculation, if necessary, e.g. bin to make a histogram or barchart.
- POSITION: Some types of plots, like barcharts conventionally have small deviations in design, like stacked, side-by-side, 100%. Position enables this type of shift.
- COORDINATE: Most commonly we are using cartesian coordinates, but some plots benefit by putting them in polar coordinates, or even in a special map ratio, or some variables shown on a log scale.
- FACET: Split the data into subsets and plot separately. Good for making comparisons across groups.

Here are some examples. We will use with a smaller subset of the pokemon data for efficiency.
- **DATA**: This is the `data.frame` or `tibble` that you are starting from
- **MAPPINGS**: Specific details of how a variable is mapped into the plot
- **GEOM**: The graphical element to use, e.g. point, line, rectangle, density, boxplot, ...
- **STAT**: A statistical calculation, if necessary, e.g. bin to make a histogram or barchart.
- **POSITION**: Some types of plots, like barcharts conventionally have small deviations in design, like stacked, side-by-side, 100%. Position enables this type of shift.
- **COORDINATE**: Most commonly we are using cartesian coordinates, but some plots benefit by putting them in polar coordinates, or even in a special map ratio, or some variables shown on a log scale.
- **FACET**: Split the data into subsets and plot separately. Good for making comparisons across groups.

The `+` operation behaves exactly like you expect: it "adds" additional graphical elements or manipulations on the plot. This is similar to how you would draw a picture in real life first by finding some contents (in our case, the data) to draw on, then deciding what should be in the picture (the geom) and finally adding on other elemenets (the mappings, e.g. colours).

Here are some examples.

```{r, fig.width=15}
ggplot(data = pokemon) +
geom_bar(mapping = aes(x = type1))
```

This is a barchart of the types of pokemons. We can see that the "water" pokemon is the most common.

## Making scatter plot

We will now try to construct the scatter plots above. Notice how we describe the mapping between the variables is exactly how we construct the plot!

```{r, echo = FALSE}
ggplot(pokemon) +
geom_point(aes(x = attack,
y = defense,
colour = type1))
```

This plot doesn't look too bad, however, one thing that we noticed was that there are a lot of types and they are overlaying on top of each other on the same plot. How would we improve over this? This is exactly why we use facets in the first example to separate out the points by types. Facetting is very easy in ggplot, it simply needs an extra line and a specification of which variable should be used as the facets.

```{r}
ggplot(pokemon) +
geom_point(aes(x = attack,
y = defense,
colour = type1)) +
facet_wrap(~type1)
```

This plot is better in that we can compare the scattering of points across different types of pokemons now whereas before, our eyes are too busy trying to identify the colours. In fact, you will notice that we used the `type1` variable twice, which means that this variable appeared twice as two different visual elements - once as colour and once as facets. This is ok, but it can be redundant. Having a consistent way of describing plots allows us to detect such things and we may even decide removing `type1` as a colour variable.

**Try it:** remove the `type1` as the colouring variable in the plot above. What is the default colour of `geom_point`?


## Making a heatmap (advanced)

Have you ever thought about why would anyone use any plots? If our data is the original complete information, then why don't we just interpret that information directly? Afterall, any plot that we make can only represent the data in limit ways because there are only limited number of visual elements we can throw onto a plot.

The key to answer this question is that, a plot should be a tool of communication of key information. Yes, a data may contain lots of information, but without summarising the data in clever ways, nothing can be interpreted because data are often huge.

Let's see an example of this. In the `pokemon` data, there are `type1` and `type2` variables. These variables indicate the type of a certain pokemon with some pokemons having only `type1` but many pokemons has both. So what can we do to understand the total number of pokemons in each categories of `type1` and `type2`?

We could certainly tabulate these counts. But we would end up with 166 categories, which is still too much for us to understand. We can see the average or maximum of these counts, but this can be very limiting. This is where data visualisation can help us to see important patterns.

```{r}
poke_counts = pokemon %>%
group_by(type1, type2) %>%
tally()
poke_counts
```

In the plot below, we see that each number is represented as a "tile", and the "fill" colour of the tile is represented by the number of pokemons in those combined categories of `type1` and `type2`.

```{r}
poke_counts %>%
ggplot() +
geom_tile(aes(x = type1, y = type2,
fill = n))
```

We can further make adjustments on the plot to make it prettier. It is not necessary to understand the code below, but you should feel free to play with the different options and layers of the ggplot to see what each element is doing.

```{r}
poke_counts %>%
ggplot(aes(x = fct_reorder(type1, n, .fun = max),
y = fct_reorder(type2, n, .fun = max))) +
geom_tile(aes(fill = n)) +
geom_text(aes(label = n)) +
scale_fill_distiller(palette = "Spectral",
breaks = c(0, 10, 20, 30, 40, 50, 60)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.2)) +
labs(x = "Type 1",
y = "Type 2",
fill = "Num. Pokemons")
```

**Try it**:

+ What does `scale_fill_distiller` do? (HINT: comment this line out by adding a `#` in front of the line of code and run again)
+ What does `theme(axis.text.x = ...)` do? What if you change the `angle` to 45?
+ What does `labs(...)` do?
+ Replace `x = fct_reorder(type1, n, .fun = max)` in the second line with just `x = type1` as we had before. What happened to the plot? Can you guess what `fct_reorder` do?
Loading

0 comments on commit d6337c4

Please sign in to comment.