Version: 15 August 2017
-
Instructor: Matthias Haber
-
Office: 3.60
-
Email: [email protected]
Instructor Information
Matthias Haber is a Postdoctoral Research Scientist for the Governance Report at the Hertie School of Governance. He is a political scientist with research interests in party politics, electoral behavior, machine learning, survey experiments, and measurement problems. He was previously Research Associate at the Collaborative Research Center ‘Political Economy of Reforms’ at the University of Mannheim. He holds degrees from the University of Mannheim, the University of Essex, and the University of Potsdam.
As data is increasingly available online, data analysis has replaced data acquisition as the bottleneck to empirical research in the social sciences. 80% of empirical research is spent sourcing, cleaning and preparing often noisy data, while the remaining 20% is actual data analysis. Extracting knowledge from heterogeneous datasets requires not only computational tools, but the programming skills to use them effectively.
This course introduces computational methods needed for data generation, data manipulation, data visualization, and data reproducibility and provides students with the ability to apply them to their own projects. There is an increasing demand inside and outside of academia for skills to effectively collect, transform, and analyze data as well as present results to a range of audiences making this course equally relevant for students seeking scientific or business careers.
The course is organized in three parts. The first part of the course introduces students to ways to effectively visualize and transform data. The second part focuses on importing different data formats, storing them in a consistent format, and narrowing in on observations of interests. The final part of this course shows ways to communicate the results to others.
The course is intended for students with experience in working with R. If you have had little to no exposure to R before, but nevertheless want to take this course, then you have to complete the Introduction to R course on DataCamp and be willing to invest more time into learning R in addition to the regular course work. All lecture materials and their source files will be hosted in the course's GitHub repository and on Moodle. You are highly encouraged to suggest changes to the lecture material with a pull request (we'll learn about how to do this during the first week of class) if you think of improvements that can be made for clarity, relevance, and to fix typos. All of the software used in this course will be open source, i.e. free. If you can, please bring your own laptop to class and download and install R and RStudio.
A certificate is granted for regular attendance, active participation, the completion of small, weekly homework exercises, and a final data project. Political science thrives of collaboration and co-authorship. Hence, the participants are allowed (but not required) to complete their homework exercises and their final projects in two-person teams. The data project is due in the final exam week.
1. Homework Exercises
Each week students have to complete small homework exercises that allow them to directly apply the techniques they learned in class. Homework exercises contribute 5% to the final grade each and students are encouraged to complete them in pairs.
2. Final project
For the final data project students are given a large dataset and will analyze it and present their results using their own ideas and skills learned throughout the course.
3. Participation
Students are expected to be present and prepared for every class session and actively engage in class discussions. Furthermore, you are encouraged to make pull requests to the main course material if you find an error or think of an improvement and participate in online discussions. As such, your GitHub contributor statistics will be used to partially evaluate your participation.
Composition of the Final Grade
Name | Percent of Final Mark | Due |
---|---|---|
Homework Exercises | 50% | Weekly |
Data Project | 40% | Final Exam Week |
Attendance/active Participation | 10% | - |
Late submission of assignments
For each day the assignment is turned in late, the grade will be reduced by 10% (e.g. submission two days after the deadline would result in 20% grade deduction).
Attendance
Students are expected to be present and prepared for every class session. Active participation during lectures and seminar discussions is essential. If unavoidable circumstances arise which prevent attendance or preparation, the instructor should be advised by email with as much advance notice as possible. Please note that students cannot miss more than two sessions. For further information please consult the Examination Rules §4.
Academic integrity
The Hertie School of Governance is committed to the standards of good academic and ethical conduct. Any violation of these standards shall be subject to disciplinary action. Plagiarism, deceitful actions as well as free-riding in group work are not tolerated. See Examination Rules §11.
If you want to understand how to use R as a programming language then this is a great place to start if R is your first programming language:
- Grolemund, G. 2014. Hands-On Programming with R: Write Your Own Functions and Simulations. O'Reilly Media, Inc.
If you have existing programming experience then I recommend reading this guide:
- Wickham, H. 2014. Advanced R. CRC Press.
A great guide on how to create dynamic and highly reproducible research:
- Gandrud, C. 2015. Reproducible Research with R and RStudio. 2nd Edition. Chapman & Hall/CRC Press, Oxford.
Finally, the core background reading for the course is:
- Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly.
Session | Session Date | Session Title |
---|---|---|
1 | 04.09.2017 | Introduction to the Course |
2 | 11.09.2017 | Data Visualization |
3 | 18.09.2017 | Data Transformation |
4 | 25.09.2017 | Exploratory Data Analysis |
5 | 02.10.2017 | Data Import |
6 | 09.10.2017 | Tidy Data |
7 | 16.10.2017 | Working with Relational Data |
Mid-term Exam Week | ||
8 | 30.10.2017 | Working with Strings |
9 | 06.11.2017 | Web Scraping |
10 | 13.11.2017 | Markup languages |
11 | 20.11.2017 | Graphics for communication |
12 | 27.11.2017 | Guest Lecture – Work of a Data Scientist |
Part I: Data Exploration
Session 1: 04.09.2017 | Introduction to the Course |
---|---|
Aim | Learn about the course structure and how to use GitHub |
Required Readings | - Harrison, E. 2015. RStudio and GitHub. R-bloggers.com |
- Interactive introduction to Git from the Code School | |
Additional Readings |
Session 2: 11.09.2017 | Data Visualization |
---|---|
Aim | Learn about the grammar of graphics with ggplot2 |
Required Readings | - Wickham, Hadley. 2010. “A Layered Grammar of Graphics”. Journal of Computational and Graphical Statistics 19 (1): 3–28. |
- Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 3. | |
Additional Readings |
Session 3: 18.09.2017 | Data Transformation |
---|---|
Aim | Learn how to transform data with dplyr |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 5. |
Additional Readings |
Session 4: 25.09.2017 | Exploratory Data Analysis |
---|---|
Aim | Learn how to combine the power of ggplot2 and dplyr to explore data |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 7. |
Additional Readings |
Part II: Data Wrangling
Session 5: 02.10.2017 | Data Import |
---|---|
Aim | Learn how to read different file formats into R. |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 11. |
Additional Readings |
Session 6: 09.10.2017 | Tidy Data |
---|---|
Aim | Learn how to organize data consistently with tidyr |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 12. |
- Wickham, Hadley. 2014. “Tidy Data”. Journal of Statistical Software 59 (10). | |
Additional Readings |
Session 7: 16.10.2017 | Working with Relational Data |
---|---|
Aim | Learn how to work with multiple tables of data |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 13. |
Additional Readings |
Mid-term Exam Week: 23-27 October 2017 – no class
Session 8: 30.10.2017 | Working with Strings |
---|---|
Aim | Learn how to effectively manipulate strings with stringr |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 14. |
- Wickham, Hadley. 2010. ‘‘stringr: modern, consistent string processing’’. The R Journal 2 (2): 38-40. | |
Additional Readings |
Session 9: 06.11.2017 | Web Scraping |
---|---|
Aim | Learn how to automatically collect data off the web and interact with APIs |
Required Readings | - Munzert, S., C. Rubba, P. Meißner and D. Nyhuis. 2015. Automated Data Collection with R A Practical Guide to Web Scraping and Text Mining. Wiley. Chapter 9. |
- Law, J. and J. Rosenblum. 2015. rvest tutorial: scraping the web using R. | |
- Bacon, Greg. Regular Expressions. Stackoverflow.com | |
Additional Readings |
Part III: Data Communication
Session 10: 13.11.2017 | Markup languages |
---|---|
Aim | Learn how to collaborate and communicate with literate programming |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapters 27, 29, 30. |
- RStudio. 2015. RMarkdown--Dynamic Documents for R. | |
- RStudio. 2015. Pandoc Markdown. | |
Additional Readings | - RStudio. 2015. Presentations with ioslides. |
Session 11: 20.11.2017 | Graphics for communication |
---|---|
Aim | Learn how to (dynamically) communicate your data to others |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 28. |
- Gelman, Andrew and Antony Unwin. 2012. “Infovis and Statistical Graphics: Different Goals, Different Looks.” Journal of Computational and Graphical Statistics 22(1): 2-28. | |
- Plotly | |
- Shiny | |
Additional Readings |
Session 12: 27.11.2017 | Guest Lecture – Work of a Data Scientist |
---|---|
Aim | Learn about the job of a data scientist |
Required Readings | - DataCamp Blog. 2017. The Periodic Table of Data Science. R-bloggers.com |
Additional Readings |