Skip to content

MichiganDataScienceTeam/W23-Webscraping

Repository files navigation

webscraping

Winter 2023 MDST Webscraping Project

Table of Contents

Introduction

Have you ever felt like stalking someone's 📱insta485gram📱, but were too embarrassed to do it while logged into your account? Maybe you wanted to pull quotes from someone's favorite 🎥movie🎥 so you could annoy serenade him with a random quote every day! Now imagine these kinds of tasks, but on a much larger scale. Some more MDST examples...

Description

Since there is no one way to scrape websites, we won't have just one project that we work on the entire semester. Instead, we have a few mini projects (one is completely self-guided) to give us experience scraping many different kinds of websites. This should give us some appreciation for the work google does making their crawlers work.

The culminating project is a unified app that scrapes information about all UofM professors from their websites (and cross references this with relevant reviews from Atlas). One use case of this is to show open research positions professors have, while checking their teacher experience.

Goals

  1. Webscrape structured and unstructured data and what are good ways to display/visualize it
  2. Dive into a self-guided mini project that interests YOU (⚡ talk?)
  3. Create a "one-stop shop" that UofM students could use to search for research positions in areas they are interested in
  4. Have fun and learn something! 😃

A Look at the Data

We scrape our data!

Project Roadmap

Week of 1/29: Intro to Webscraping

  • Kickoff!
  • Introductions
  • Familiarize ourselves with BeautifulSoup

Week of 2/5: Scrape well-tabulated websites

  • MLB website
  • Tennis rankings
  • Pretty much any competitive sport
  • instagram

Weeks of 2/12-3/12: Begin individual projects

  • Sub-teams!
  • Find something to scrape
  • (At some point) Intro to Selenium (interactive webscraping)

2/25-3/5: Spring Break!


Week of 3/19: Wrap up individual projects

  • Make visualizations of our data

Week of 3/26-4/16: Develop Michigan Web Crawler

  • Plan out application design
  • Flesh out basic API to interact with webpage
  • Test it!

Week of 4/16: Finishing Touches

  • Complete the write-up
  • Prepare for final presentations!

Week of 4/23: Final Expo

  • Show what we've been working on!

Setup

First, clone this repo (via ssh)

git clone [email protected]:MichiganDataScienceTeam/webscraping.git

Virtual Environment

You can choose whether or not to use a virtual environment for this project (though it is recommended). The setup guide shows how to create a venv through pip, but it can also be done via Conda if you want. The important thing is that you can run the commands found in the Good to go section.

We are going to initialize a Python virtual environment with all the required packages. We use a virtual environment here to isolate our development environment from the rest of your computer. This is helpful in not leaving messes and keeping project setups contained.

First create a Python 3.8 virtual environment. The virtual environment creation code for Linux/MacOS is below:

python3 -m venv venv

Now that you have a virtual environment installed, you need to activate it. This may depend on your system, but on Linux/MacOS, this can be done using

source ./venv/bin/activate

Now your computer will know to use the Python installation in the virtual environment rather than your default installation.

After the virtual environment has been activated, we can install the required dependencies into this environment using

pip install -r requirements.txt

Good to go

If it is set up correctly, you should be able to open a dev server and see the app for some intro webscraping by moving to the "flaskr" directory and then running the app:

cd flaskr
flask run

Open up the server to see if it works! (ctrl + click on http://127.0.0.1:5000)

Other relevant stuff

MDST Calendar

Required Skills

Intermediate Python, Pandas (enough that it won't impede progress)

Learned Skills

HTML, CSS, BeautifulSoup, Selenium, RegEx

About

Winter 2023 MDST Webscraping Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •