Skip to content

Latest commit

 

History

History
560 lines (455 loc) · 67.1 KB

README.md

File metadata and controls

560 lines (455 loc) · 67.1 KB

DAT7 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (6/1/15 - 8/12/15).

Instructor: Kevin Markham (Data School blog, email newsletter, YouTube channel)

Monday Wednesday
6/1: Introduction to Data Science 6/3: Command Line and Version Control
6/8: Data Reading and Cleaning 6/10: Exploratory Data Analysis
6/15: Visualization 6/17: Machine Learning
6/22: Getting Data
Project Discussion Deadline
6/24: K-Nearest Neighbors
Project Question and Dataset Due
6/29: Basic Model Evaluation 7/1: Linear Regression
7/6: Logistic Regression 7/8: Advanced Model Evaluation
7/13: First Project Presentation 7/15: Naive Bayes and Text Data
7/20: Natural Language Processing 7/22: Kaggle Competition
7/27: Decision Trees
Draft Paper Due
7/29: Ensembling
8/3: Advanced scikit-learn and
Clustering, Peer Review Due
8/5: Course Review
8/10: Final Project Presentation 8/12: Final Project Presentation

Python Resources

Submission Forms


Class 1: Introduction to Data Science

Homework:

Resources:


Class 2: Command Line and Version Control

  • Command line exercise (code)
  • Git and GitHub (slides)
  • Intermediate command line
  • Wrap up: Course schedule, office hours

Homework:

  • Complete the homework exercise listed in the command line introduction. Create a Markdown document that includes your answers and the code you used to arrive at those answers. Add this file to a GitHub repo that you'll use for all of your coursework, and submit a link to your repo using the homework submission form.
  • Review the code from the beginner and intermediate Python workshops. If you don't feel comfortable with any of the content (up through the "dictionaries" section), you should spend some time this weekend practicing Python. Here are my recommended resources:
    • If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
    • If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
    • If you have more time, try these much longer lessons from DataQuest: "Find the US city with the lowest crime rate" and "Discover weather patterns in LA".
    • If you've already mastered these topics and want more of a challenge, try solving the second Python Challenge and send me your code in Slack.
  • If there are specific Python topics you want me to cover next week, send me a Slack message.

Git and Markdown Resources:

  • Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
  • If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
  • If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
  • GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
  • Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.

Command Line Resources:

  • If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
  • If you want to do more at the command line with CSV files, try out csvkit, which can be installed via pip.

Class 3: Data Reading and Cleaning

  • Git and GitHub assorted tips (slides)
  • Review command line homework (solution)
  • Python:
    • Spyder interface
    • Review of list comprehensions
    • Lesson on file reading with airline safety data (code, data, article)
    • Data cleaning exercise
    • Walkthrough of homework with Chipotle order data (code, data, article)

Homework:

  • Complete the homework assignment with the Chipotle data, and add a commented Python script to your GitHub repo. If you are unable to complete a part, try writing some pseudocode instead! You have until Monday to complete this assignment.

Resources:

  • PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.

Class 4: Exploratory Data Analysis

Homework:

Resources:


Class 5: Visualization

  • Part 2 of Exploratory Data Analysis with Pandas (code)
  • Visualization with Pandas and Matplotlib (code)

Homework:

Pandas Resources:

Visualization Resources:


Class 6: Machine Learning

Homework:

  • Your deadline for discussing your project ideas with an instructor is Monday, and your project question and dataset is due Wednesday.

Resources:


Class 7: Getting Data

Homework:

API Resources:

Web Scraping Resources:


Class 8: K-Nearest Neighbors

Homework:

  • Reading assignment on the bias-variance tradeoff
  • Browse through the scikit-learn documentation for KNN to get a sense of how it's organized: user guide, module reference, class documentation
  • Work on your project... your first project presentation is in less than three weeks!
  • Optional: Read the Teaching Assistant Evaluation dataset into Pandas, create the X and y objects (the response variable is "class attribute"), and go through scikit-learn's 4-step modeling process. (There's no need to submit your code unless you have a question or would like feedback!)

KNN Resources:

Reproducibility Resources:

Other Resources:

  • If you would like to learn the IPython Notebook, the official Notebook tutorials are useful.
  • To get started with Seaborn for visualization, the official website has a series of tutorials and an example gallery.

Class 9: Basic Model Evaluation

Homework:

Resources:


Class 10: Linear Regression

Homework:

Resources:


Class 11: Logistic Regression

Homework:

Resources:


Class 12: Advanced Model Evaluation

  • Advanced model evaluation (notebook, notebook code)
    • Null accuracy, handling missing values
    • Confusion matrix
    • Handling categorical features
  • ROC curves and AUC

Homework:

  • Your first project presentation is on Monday! Please submit a link to your project repository (with slides, code, data, and visualizations) before class using the submission form.

ROC Resources:

Other Resources:


Class 13: First Project Presentation

  • Project presentations!

Homework:


Class 14: Naive Bayes and Text Data

Homework:

  • Confirm that you have TextBlob installed by running import textblob from within your preferred Python environment. If it's not installed, run pip install textblob at the command line (not from within Python).
  • Complete the Yelp review text homework, and add a Python script (or IPython notebook) to your GitHub repo. This assignment is due on Monday.
  • There is a video/reading assignment on cross-validation, for those of you that have not already watched the video or would prefer a reading instead.

Resources:

  • For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (14 pages).
  • For an intuitive explanation of Naive Bayes classification, read this post on airport security.
  • For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
  • When applying Naive Bayes classification to a dataset with continuous features, it is best to use GaussianNB rather than MultinomialNB. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
  • These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
  • Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
  • If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.

Class 15: Natural Language Processing

Homework:

  • Download the competition files, move them to the DAT7/data directory, and make sure you can open the CSV files using Pandas. If you have any problems opening the files, you probably need to turn off real-time virus scanning (especially Microsoft Security Essentials).
  • Come up with some theories about which features might be relevant to predicting the response, and then explore the data to see if those theories appear to be true.
  • Optional: Think about some features that might be worth creating from the data, and then figure out how to actually create those features.
  • Optional: Watch my project presentation video (16 minutes) for a tour of the end-to-end machine learning process for a Kaggle competition, including the creation of new features. (Or, just read through the slides.)

NLP Resources:

Cross-Validation Resources:


Class 16: Kaggle Competition

Homework:

  • Your draft paper is due on Monday! Please submit a link to your project repository (with paper, code, data, and visualizations) before class using the submission form.
  • Optional: Keep working on this competition! You can make up to 5 submissions per day, and the competition doesn't close until 6:30pm ET on Wednesday, August 5 (class 20).

Resources:


Class 17: Decision Trees

Homework:

Resources:

Installing GraphViz (optional):

  • Mac: Download and install PKG file
  • Windows: Download and install MSI file, and then add GraphViz to your path:
    • Go to Control Panel, System, Advanced System Settings, Environment Variables
    • Under system variables, edit "Path" to include the path to the "bin" folder, such as: C:\Program Files (x86)\Graphviz2.38\bin

Class 18: Ensembling

Resources:


Class 19: Advanced scikit-learn and Clustering

Homework:

scikit-learn Resources:

Clustering Resources:


Class 20: Course Review

Homework:

  • Your final project is due next week!

Resources:


Classes 21 and 22: Final Project Presentation


Bonus Resources

Databases and SQL

Tidy Data

Regular Expressions ("Regex")

  • RegexOne is an interactive tutorial for learning the basics of regular expressions.
  • Google's Python Class includes an excellent introductory lesson on regular expressions (which also has an associated video).
  • Python for Informatics has a nice chapter on regular expressions. (If you want to run the examples, you'll need to download mbox.txt and mbox-short.txt.)
  • My reference guide to regular expressions includes lots of short explanations and simple examples.
  • regex101 is an online tool for testing your regular expressions in real time.
  • If you want to go really deep with regular expressions, RexEgg includes endless articles and tutorials.
  • Exploring Expressions of Emotions in GitHub Commit Messages is a fun example of how regular expressions can be used for data analysis, and Emojineering explains how Instagram uses regular expressions to detect emoji in hashtags.

Regularization

Recommendation Systems