Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Streamlining #138

Merged
merged 45 commits into from
Oct 16, 2022
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
36f4f9e
removed all "details" tags
Sep 30, 2022
01a921b
rips out fancier features which base jupyter lacks
Sep 30, 2022
051c636
changed spellings
Sep 30, 2022
c2a4253
Delete settings.json
Sep 30, 2022
fff64ed
Delete checklist.md
Sep 30, 2022
69fd1b4
grammar for m1.1
Sep 30, 2022
9be9c0b
pre grammar script
Oct 3, 2022
c72bf86
applied some grammar scripts
Oct 3, 2022
3f543d3
tweaked commas
Oct 3, 2022
7dd8c9a
Minor tweaks for - stopping to move m2
Oct 3, 2022
f2a23f9
Fixed naming for m2
Oct 3, 2022
c39dd66
increase accessibility of landing page
Oct 4, 2022
137ef94
move setting up environment to dedicated page
Oct 4, 2022
b6e3a23
add WIP glossary and contact pages
Oct 4, 2022
83c2fb9
add appendix to toc
Oct 4, 2022
4a9e652
rename module 4 notebooks for consistency
Oct 4, 2022
e4828f9
fix links and typo
Oct 4, 2022
d3de947
edit overviews for consistent styling
Oct 4, 2022
6d12e3d
increase accessibility of landing page
Oct 4, 2022
bb1805b
move setting up environment to dedicated page
Oct 4, 2022
863378b
add WIP glossary and contact pages
Oct 4, 2022
95752ec
add appendix to toc
Oct 4, 2022
c503cfa
rename module 4 notebooks for consistency
Oct 4, 2022
1e63cd6
fix links and typo
Oct 4, 2022
023dd9c
edit overviews for consistent styling
Oct 4, 2022
bc7d97e
typo in toc
Oct 4, 2022
6038441
typo in index
Oct 4, 2022
ada4a4b
change 'Exploring and Wrangling' to 'Data Wrangling'
Oct 4, 2022
dbab6f9
change title of 2.2 to Data Wrangling
Oct 5, 2022
38b85d7
typo
Oct 5, 2022
0cde0fd
large rewrite of 1.1
Oct 5, 2022
1e0ad32
Merge branch 'streamlining' into streamlining-cm
Oct 5, 2022
9f3d20d
Merge pull request #140 from alan-turing-institute/streamlining-cm
Oct 5, 2022
0aad8fc
add definition of RDS @AoifeHughes
Oct 5, 2022
7f5f166
small 4.1. tweaks
Oct 5, 2022
617001d
Shifted disclaimer to be more general
Oct 7, 2022
575e4c6
grammar!!!
Oct 7, 2022
044550f
minor grammar
Oct 7, 2022
d4d92a5
Added some placeholders
Oct 7, 2022
d5dde5c
73 warnings!
Oct 14, 2022
f2254e5
tweaked to not run on hands-on
Oct 16, 2022
f0f4bb0
70 warnings...
Oct 16, 2022
7c37bc3
Remove appendix + refs
Oct 16, 2022
fb7984c
added data
Oct 16, 2022
e3c7b0f
resolve merge
Oct 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ data/*
node_modules/
package-lock.json
package.json
*.sh
57 changes: 31 additions & 26 deletions coursebook/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,33 +25,33 @@ parts:
- file: modules/m1/hands-on
- caption: "Module 2: Handling data"
chapters:
- file: modules/m2/2-overview
- file: modules/m2/overview
# getting data
- file: modules/m2/2-01-GettingLoading
- file: modules/m2/2.1-GettingLoading
sections:
- file: modules/m2/2-01-01-WhereToFindData
- file: modules/m2/2-01-02-LegalityAndEthics
- file: modules/m2/2-01-03-PandasIntro
- file: modules/m2/2-01-04-DataSourcesAndFormats
- file: modules/m2/2-01-05-ControllingAccess
- file: modules/m2/2.1.1-WhereToFindData
- file: modules/m2/2.1.2-LegalityAndEthics
- file: modules/m2/2.1.3-PandasIntro
- file: modules/m2/2.1.4-DataSourcesAndFormats
- file: modules/m2/2.1.5-ControllingAccess
# cleaning and wrangling
- file: modules/m2/2-02-ExploringWrangling
- file: modules/m2/2.2-DataWrangling
sections:
- file: modules/m2/2-02-01-DataConsistency
- file: modules/m2/2-02-02-ModifyingColumnsAndIndices
- file: modules/m2/2-02-03-FeatureEngineering
- file: modules/m2/2-02-04-DataManipulation
- file: modules/m2/2.2.1-DataConsistency
- file: modules/m2/2.2.2-ModifyingColumnsAndIndices
- file: modules/m2/2.2.3-FeatureEngineering
- file: modules/m2/2.2.4-DataManipulation
sections:
- file: modules/m2/2-02-04-01-TimeAndDateData
- file: modules/m2/2-02-04-02-TextData
- file: modules/m2/2-02-04-03-CategoricalData
- file: modules/m2/2-02-04-04-ImageData
- file: modules/m2/2-02-05-PrivacyAndAnonymisation
- file: modules/m2/2-02-06-LinkingDatasets
- file: modules/m2/2-02-07-MissingData
- file: modules/m2/2.2.4.1-TimeAndDateData
- file: modules/m2/2.2.4.2-TextData
- file: modules/m2/2.2.4.3-CategoricalData
- file: modules/m2/2.2.4.4-ImageData
- file: modules/m2/2.2.5-PrivacyAndAnonymisation
- file: modules/m2/2.2.6-LinkingDatasets
- file: modules/m2/2.2.7-MissingData
# hands on
- file: modules/m2/2-hands-on
- file: modules/m2/2-hands-on-complete
- file: modules/m2/hands-on
- file: modules/m2/hands-on-complete
- caption: "Module 3: Data Visualisation & Exploration"
chapters:
- file: modules/m3/overview
Expand All @@ -64,9 +64,14 @@ parts:
- caption: "Module 4: Introduction to Modelling"
chapters:
- file: modules/m4/overview
- file: modules/m4/4.1_What_and_Why
- file: modules/m4/4.2_Fitting_Models
- file: modules/m4/4.3_Building_simple_model
- file: modules/m4/4.4_Evaluating_a_model
- file: modules/m4/hands-on
- file: modules/m4/4.1-WhatAndWhy
- file: modules/m4/4.2-ModelFitting
- file: modules/m4/4.3-ModelBuilding
- file: modules/m4/4.4-ModelEvaluation
- file: modules/m4/hands-on
- caption: "Appendix"
chapters:
- file: modules/appendix/A.1-Glossary
- file: modules/appendix/A.2-SettingUp
- file: modules/appendix/A.3-ContactUs

62 changes: 38 additions & 24 deletions coursebook/index.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,41 @@
# Welcome to The Alan Turing Institute's Introduction to Research Data Science course
# Welcome!

This Research Data Science online training course was
developed by [The Alan Turing Institute's](https://www.turing.ac.uk/)
Welcome to an **Introduction to Research Data Science**, developed by [The Alan Turing Institute's](https://www.turing.ac.uk/)
AoifeHughes marked this conversation as resolved.
Show resolved Hide resolved
[Research Engineering Group](https://www.turing.ac.uk/research-engineering).

The course consists of four modules, each involving a half-day taught session and a half-day hands-on session. The material can be used for synchronous online attendance or asynchronous study.

## Summary
## Introduction

Data science methods and tools have become commonplace in research projects across academia, government and industry. Researchers increasingly need to collaborate with multi-disciplinary teams of data scientists, software engineers and other stakeholders.

This course is designed for researchers interested in understanding and using data science methods in their work. The course will help learners move beyond data science principles, to learn how to tackle real, complex and sometimes vaguely defined research data science projects. They will learn how to do this in a collaborative environment, with an emphasis on practical techniques and technologies and with an overarching awareness of ethics and diversity issues. This is an intensive, hands-on course, informed by REG’s experience with research data science projects and aiming to bring learners in touch with day-to-day research data science practices.
The goal of this course is to introduce how you can use data science principles to tackle real, complex, and sometimes vaguely defined research data science projects. The course is not a handbook of data science methods. Rather, the focus is how to begin using these methods on collaborative research projects, with an emphasis an awareness of ethics and diversity issues.

The course consists of:
- Taught modules that will introduce learners to key concepts, methodology and ways of solving problems.
- Hands-on modules where learners will work in teams to tackle a real research data science problem, including scoping it, discussing it from an equality, diversity and inclusion (EDI) point of view and coding collaboratively to produce a data science solution.

This course complements the Turing’s Research Software Engineering with Python course (found [here](https://alan-turing-institute.github.io/rsd-engineeringcourse/)).
## Who?

## Key objectives and learning outcomes
The main objectives of the course are the following:
- Teach attendees how to use research data science (RDS) methods in an interdisciplinary research environment.
- Move beyond core principles and methdology, towards a hands-on, practical understanding, focused on collaboration, reproducibility and openness.
- Provide exposure to a real-world RDS project and demonstrate the decision-making process used to choose the right method and tools for each setting and in each project step.
- Embed data ethics, diversity and inclusion awareness into the learners’ approach to all stages of an RDS project, providing multiple examples.
**We are** a group of data scientists and software engineers that work on a wide range of research problems.

**You are** someone interested in learning about, or using, data science methods in research. To completely follow along with the course some basic programming is needed, see [Prerequisites](#prerequisites) for more information.



## Course materials

This free and open course is primarily the jupyter book you're reading. You can work through the material by yourself. See the [Syllabus](#syllabus).


Some tips on **how to use this course**:

- You will get a lot out of simply reading the online course book. However, the course is built by executable jupyter notebooks that you can run yourself, and we encourage learners to try the hands-on sections where we tackle a real research data science problem. Visit the [Setting Up](./modules/appendix/A.2-SettingUp.ipynb) page to setup your computer to follow along.

- There are some benefits to reading the course chronologically. The same dataset is used throughout the modules, especially on the hands-on sessions. However, much of the material is self-contained and can be consumed independently.


- If you are a self-learner and have questions, see how you can contact an instructor or other students in the [Contact us](./modules/appendix/A.3-ContactUs.ipynb) page.


- There is also a synchronous, taught, version of the course, where modules are spread over a half-day taught session and a half-day hands-on session.

The learning outcomes are the following:
- Attendees will understand fundamental RDS methods (e.g., data wrangling, visualisation, exploration, modeling) and know when/how to apply them to their research in order to draw data-driven insights or create data-driven tools.
- Attendees will be familiar with the stages of a collaborative RDS project, from scoping and data exploration to visualisation and modeling and will become aware of the challenges of tackling real-world problems.
- Attendees will be able to recognise power imbalances, bias and diversity issues in their technical work and in their ways of working with others and challenge them.

## Syllabus

Expand Down Expand Up @@ -79,10 +86,17 @@ Hands-on session:


## Prerequisites
Participants are expected to:
- Be comfortable with basic Python, either through working on a project or through attending a training course. Indicatively, they should be comfortable with the concepts covered in the “Introduction to Python” module from the Turing’s Research Software Engineering with Python course. The Programming with Python Software Carpentry also covers some of these concepts. Familiarity with Matplotlib, NumPy and Pandas is beneficial but not required.
- Have some basic knowledge of Git (setting up repositories, commits) through using it in projects or by attending training, e.g. the Software Carpentry’s Version Control with Git (Sections 1 to 4 and 7 to 9).
- Have read the first two sections of the Turing Way’s Guide for Collaboration (“Getting Started in GitHub” and “Maintainers and Reviewers in GitHub”).

There is no code in Module 1. Students will get more out of Modules 2-4 if they:

- Are comfortable with basic Python, as presented in:
- The [Introduction to Python](https://alan-turing-institute.github.io/rse-course/html/module01_introduction_to_python/index.html) module from the Turing's Research Software Engineering.
- Software Carpentry's [Programming with Python](https://swcarpentry.github.io/python-novice-inflammation/).
- Have some basic knowledge of using Git for version control, for example the Software Carpentry’s [Version Control with Git](https://swcarpentry.github.io/git-novice/) (Sections 1 to 4 and 7 to 9).
- Have basic knowledge of using Github for collaboration. See the first two sections of the Turing Way’s Guide for Collaboration ([Getting Started in GitHub](https://the-turing-way.netlify.app/collaboration/github-novice.html) and [Maintainers and Reviewers in GitHub](https://the-turing-way.netlify.app/collaboration/maintain-review.html)).


This course complements the [Turing’s Research Software Engineering with Python](https://alan-turing-institute.github.io/rse-course/) course.


## Acknowledgement
Expand Down
37 changes: 37 additions & 0 deletions coursebook/modules/appendix/A.1-Glossary.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Glossary \n",
"\n",
"A collection of explainers on useful terms and concepts used throughout the course.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 ('.venv': poetry)",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "b9e8921bd18fbd36d3a09ae9691fc21c58beec206524d0083259030e87e84f05"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
149 changes: 149 additions & 0 deletions coursebook/modules/appendix/A.2-SettingUp.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting up your environment\n",
"\n",
"Here we learn how to set up your local environment so that you can run the notebooks. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"To setup and run the commands in this notebook you will need a (preferably bash/similar) shell with these installed:\n",
"- Python 3.7 or later\n",
" - Check by running `python --version` or `python3 --version` in your shell\n",
"- Git (optional)\n",
" - Check by running `git --version` in your shell\n",
"- Curl (optional)\n",
" - Check by running `curl --version` in your shell\n",
"\n",
"If you don't have these we have instructions in our [Research Software Engineering course](https://alan-turing-institute.github.io/rse-course/html/course_prerequisites/index.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clone the Course Repository\n",
"\n",
"In order to work locally with this notebook, you should clone the course repository.\n",
"\n",
"1. Go to the GitHub repository in a web browser: https://github.com/alan-turing-institute/rds-course\n",
"2. Click on the green \"Code\" button and copy the address under \"Clone - HTTPS\".\n",
"3. In your shell, run the following command from a sensible location (this will create a new dir for the course in current dir):\n",
" git clone https://github.com/alan-turing-institute/rds-course.git\n",
"4. Change directory to the repository root\n",
" cd rds-course\n",
"5. We're currently using the `develop` branch, so check that out\n",
" git checkout develop\n",
"\n",
"**Troubleshooting:**\n",
"- **If you don't have `git`:** We recommend using git, but if you don't have it installed you can download a zip of the code by clicking on \"Download Zip\" in step 2 above instead, and unpack it locally.\n",
"- **If you previously cloned/downloaded the repo:** Please run `git checkout develop` and then `git pull` from the `rds-course` directory to ensure you have the latest version of the material."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Local Python Environment\n",
"\n",
"We need to install third-party packages necessary for the course, with the same package versions as it was developed with to ensure compatibility and reproducibility.\n",
"\n",
"### Managing Python Versions\n",
"\n",
"As well as the versions of packages your codebase should specify which version(s) of Python itself that it's compatible with. The code for this course should run with Python 3.7 or above. We don't cover it here to speedup setup, but if you need to use multiple versions of Python on your system we recommend [Pyenv](https://github.com/pyenv/pyenv) and [Conda](https://conda.io/projects/conda/en/latest/index.html#)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating a Virtual Environment with `Poetry`\n",
"\n",
"The Python ecosystem has many different ways of managing packaging and installing dependencies ([this page](https://packaging.python.org/key_projects/#pipenv) lists somem). The most well-known is `pip` with dependencies listed in a `requirements.txt` file.\n",
"\n",
"In this course we use the tool [Poetry](https://python-poetry.org/), which can help manage [multiple environments](https://python-poetry.org/docs/managing-environments/), in particular [switching between environments ](https://python-poetry.org/docs/managing-environments/#switching-between-environments).\n",
"\n",
"Dependencies are listed in `pyproject.toml` and have versions fixed in `poetry.lock`. `Poetry` will pick these files up and install the required packages in a predictable manner, and into a virtual environment isolated from other projects on your system.\n",
"\n",
"1. Install `Poetry` by following their instructions [here](https://python-poetry.org/docs/#installation).\n",
"\n",
"2. Change to the `rds-course` directory (the directory of the git repository cloned above), if you're not there already:\n",
" cd /path/to/rds-course\n",
"\n",
"3. Set the relevant Python executable for Poetry to use:\n",
" - If `python --version` returns a version number of 3.7 or above:\n",
" - Skip to step 4\n",
" - If `python --version` is less than 3.7 (e.g., 2.7), but `python3 --version` gives 3.7 or above:\n",
" - Run `poetry env use python3`\n",
" - If you have a Python 3.7+ environment available somewhere else:\n",
" - Run `poetry env use /full/path/to/python`\n",
" - If you don't have Python 3.7+ installed or don't know where to find it:\n",
" - Refer back to the instructions in the prerequisites and/or ask for help.\n",
"\n",
"4. Run the following command to create the virtual environment and install the third-party packages necessary for the course:\n",
" poetry install\n",
"\n",
"5. Check the details of the virtual environment that's been created:\n",
" poetry env info\n",
"\n",
"6. Initialise the environment:\n",
" poetry shell\n",
"\n",
"The last step creates a new shell setup to use the Python virtual environment we just created (e.g., `which python`, should now show the path returned earlier by `poetry env info` above, rather the path to your global Python executable). If you want to stop using the virtual environment `exit` the shell.\n",
"\n",
"**Troubleshooting:**\n",
"\n",
"- **If you don't have `curl`**: `curl` is used to download a Python script (currently [this script](https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py) but check the [Poetry documentation](https://python-poetry.org/docs/#installation) for the latest URL). Save this script as `get-poetry.py` and then run `python get-poetry.py` to install Poetry.\n",
"- **If you don't want to use `Poetry`**: You can install the course dependencies by running `pip install .` from the `rds-course` directory, but we recommend doing this in an alternative virtual environment of your choice (not in your global Python installation)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Jupyter\n",
"\n",
"We recommend use of [JupyterLab](https://jupyter.org/) for running through the hands-on notebooks in this course.\n",
"\n",
"JuypyterLab was installed into your Poetry environment in the previous step. We can launch a local instance, from the poetry environment terminal, with:\n",
"\n",
"jupyter lab # from the root of the cloned github repository! \"rds-course\" directory\n",
"\n",
"We recommend following the rest of the notebook via the JupyterLab instance that should spawn!\n",
"\n",
"Click through the file explorer in the left-hand pane to bring up this notebook.\n",
"\n",
"The notebook should be present at: `rds-course/coursebook/modules/m2/2-hands-on.ipynb`\n",
"\n",
"If you've not used `Jupyter` before you might find their [Notebook basics](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Notebook%20Basics.html) and [Running code](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Running%20Code.html) documentation helpful."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 ('.venv': poetry)",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "b9e8921bd18fbd36d3a09ae9691fc21c58beec206524d0083259030e87e84f05"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
32 changes: 32 additions & 0 deletions coursebook/modules/appendix/A.3-ContactUs.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To add:\n",
"- Direct Self-Learners to Github Discussion board\n",
"- Add info to contact course organisers (can either use issues, discussion board, or perhaps an email)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 ('.venv': poetry)",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "b9e8921bd18fbd36d3a09ae9691fc21c58beec206524d0083259030e87e84f05"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading