Skip to content

Exercise: Shell scripts

Saagar Deshpande edited this page Sep 25, 2013 · 4 revisions

Building our Amazon image web scraper

Let's build our scraper that we introduced at our very first meeting! Recall (this command is also located in scraper/scraper.sh:

cutpoint=$(echo "http://ecx.images-amazon.com/images/I/" | wc -m | grep '[0-9]\{1,\}' --only-matching); mkdir -p images; curl --silent http://www.amazon.com/s/\&field-keywords\=ocaml | grep 'http://ecx.images-amazon.com/images/I/[0-9A-Za-z\.\_\,\%\\-]\{0,\}.jpg' --only-matching | while read image; do suffix=$(echo $image | cut -c $cutpoint-); wget $image -O images/$suffix; done

Let's break this down piece-by-piece and rewrite this more elegantly as a shell script!

Finding sources to scrape

How did I know to scrape images from http://ecx.images-amazon.com/images/I/? Go to www.amazon.com, do a search for ocaml, then right-click a result image and click "Inspect Element":

inspect element

This should open up the web inspector with the corresponding html highlighted.

inspector

If you look more carefully, you'll see the source of the image:

inspector

It's http://ecx.images-amazon.com/images/I/! If you poke around at the other images, you'll see that they all start with this same prefix as well.

Build the scraper

Make sure you have the bootcamp code repo (same as the scavenger hunt exercise).

git clone git://github.com/hcs/bootcamp-unix.git
cd bootcamp-unix/exercise-scraper

Open up the file scraper.sh in your favorite text editor, and then follow the instructions in the file to finish building the scraper.

When you think you're done, run the scraper ./scraper.sh, and this should create an images directory will the images! To remove the images:

rm -rf images

The solution is located in ./scraper_solution.sh. To see the original scraper from the first week and the prettified version of it:

cd ../scraper
ls

There should be two files scraper.sh (the original version in the demo) and scraper_pretty.sh (a more readable version of it).

Finish bootcamp

Go back to the main page.