Scraper

Image: Picks the biggest image by file size on the page
Title: OG title else title in meta tags else HTML title
Description: OG description else meta description
Price: Picks non-zero dollar price that matches regular expression
Site Name: OG site_name

Note: Image selection takes consideration of OG image.

Extensions

Custom user agent can be defined for PhantomJS (see their API documentation and use it in conjucation with the NodeJS library for PhatomJS.).
Custom element to look for price can be specified based on domain. (See function getPrice.)
Custom URL transformations (eg. from mobile to desktop) can be specified. (See functions applyUrlRules, getAsosToUSFromUK, urbanTransformers, shopBigBop.)

In general custom rules in all aspects can be specified very easily.

Limitations

PhantomJS doesn't handle refresh headers in this current version.

Pre-Requisites

Install node.js and npm.
Install phantom.js which is a head-less browser used for parsing.

Install Guide

After cloning the repo you will want to run

npm install

Also make sure you have installed phantom.js.

brew install phantomjs

Running the Server

To install the up server

npm install -g up

From the project root directory run

NODE_ENV=development
up server.js

which will bring up the server on http://localhost:3000/ and listen to post requests for the variable url on http://localhost:3000/scraper for the webpage address.

Alternatively, you can set the NODE_ENV variable in you bash or zsh rc so that all you have to do is run

up server.js

Testing

For testing you would need to install nodeunit which is a part of devDependencies.

To run tests, simply run npm test from the root directory.

Response

Sample response would look like

{
    'status': 'ok',
    'title': 'Black skirt - GAP',
    'description': 'Just another skirt.',
	'price': 10
    'image': 'http://path/to/biggest/image',
    'alternateImages': [
        'http://foo/bar',
        'http://for/bar/baz'
    ],
	'siteName': null,
    ‘finalDestination’: ‘http://final/redirect/url’
}

when fetching the URL doesn't cause errors. And when it does the response would be

{
    'status': 'error'
}

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
config		config
test		test
.gitignore		.gitignore
.npmignore		.npmignore
Capfile		Capfile
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
package.json		package.json
readme.markdown		readme.markdown
scraper.js		scraper.js
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper

Extensions

Limitations

Pre-Requisites

Install Guide

Running the Server

Testing

Response

About

Releases

Packages

Contributors 2

Languages

gotryiton/image_scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper

Extensions

Limitations

Pre-Requisites

Install Guide

Running the Server

Testing

Response

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages