François Briatte
Spring 2024. Work VERY MUCH in progress.
A follow-up to an introduction to data science with R, RStudio, and the {tidyverse}
packages, still aimed at social scientists. This course requires some prior training in introductory statistics and regression modelling.
N.B. -- the current repo does not include the full set of datasets used during the semester, which are all publicly available. Future versions will include the full data and slides.
- Software
- Revisions
- SQL databases
- Web scraping
- Linear models
- Panel data
- Survey data
- Feedback
- Multilevel data
- Machine learning in R
- Machine learning in Python
- Dashboards
Bonus sections:
- R and RStudio
- R Markdown notebooks
- Code execution
A session to get started again with R and RStudio, this time through R Markdown notebooks, which are dynamic documents that can combine text and images with code as well as plots and other kinds of results.
>
Demo: LGBTI inclusivity in OECD countries
- The
tidyverse
package bundle - More R Markdown
- Data pivots
A general-revisions session that covers data wrangling and visualization with various packages of the tidyverse
bundle. Now is the right time to take a look at cheatsheets and similar material.
>
Demo: U.S. life expectancy (code by Kieran Healy)
- Row-wise operations and complex joins with
dplyr
- SQL databases with
dbplyr
- Regular expressions (regex) with
stringr
A session focused on advanced data wrangling. SQL databases, in particular, is what you will need when in need for speed and/or out-of-memory calculation on very (possibly very very) large data.
>
Demo: Government cabinet composition (ParlGov data, code by Holger Döring)
- HTTP with
httr
- XPath with
rvest
andxml2
- API endpoints
Another session focused on advanced data wrangling. Web scraping is what you will need if your data are trapped online into Web pages.
>
Demo: Locating nuclear reactors worldwide (data from the IAEA)
- Model estimation and manipulation with
broom
- Linear diagnostics with
performance
Mostly revisions of what was covered in the introductory course.
>
Demo: Worldwide fertility rates (QOG/World Bank data)
- Panel data structure
- Fixed-effects estimation with
fixest
andplm
- Cluster-robust standard errors (CRSEs)
>
Demo: Worldwide fertility rates (QOG/World Bank data)
- Survey weighting
- Survey-weighted operations with
survey
andsrvyr
- Generalized linear models (GLMs)
>
Demo: EU skepticism and migration (ESS data, code by Holger Döring)
Feedback on your first drafts, and recommendations for the coming weeks.
- Multilevel (hierarchical) data
- Multilevel (mixed) model estimation with
lme4
>
Demo: EU skepticism and migration, continued (ESS data, code by Holger Döring)
- Machine learning essentials
- Decision trees and random forests
- The
tidymodels
package bundle
>
Demo: White Trump voters (CCES data, code by Steven Miller)
- Jupyter notebooks and Google Colab
- Text mining basics
- Example algorhithms from the
scikit-learn
library
>
Demo: Trump tweets (Twitter data, code by Bernhard Rieder)
- The
flexdashboard
package - Maps with
sf
and Leaflet - General wrap-up
>
Demo: Worldwide air pollution (World Bank data, code by Paul Moraga)
pkg_data <- c("countrycode", "rsdmx", "RSQLite", "sf", "tidyverse")
# ... also installs {DBI} and {rvest}, inter alia
pkg_models <- c("easystats", "lme4", "plm", "fixest", "tidymodels")
# ... installs a lot of essentials, such as {performance}
pkg_tables <- c("broom", "broom.mixed", "DT", "modelsummary", "texreg")
pkg_varia <- c("flexdashboard", "leaflet")
# install.packages("remotes")
for (i in c(pkg_data, pkg_models, pkg_tables, pkg_varia)) {
remotes::install_cran(i)
}
The DSR README has a list of relevant credits.
More to come.