Skip to content

A simple way to structure your web scraper.

License

Notifications You must be signed in to change notification settings

masterT/yolo-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

yolo-scraper

A simple way to structure your web scraper.

npm version Build Status JavaScript Style Guide

  • Define the request.
  • Extract the data from the response.
  • Validate the data against JSON Schema.

But what is web scraping?

install

Using NPM:

npm install yolo-scraper --save

usage

Define your scraper function.

var yoloScraper = require('yolo-scraper');

var scraper = yoloScraper.createScraper({

  request: function (username) {
    return 'https://www.npmjs.com/~' + username.toLowerCase();
  },

  extract: function (response, body, $) {
    return $('.collaborated-packages li').toArray().map(function (element) {
      var $element = $(element);
      return {
        name: $element.find('a').text(),
        url: $element.find('a').attr('href'),
        version: $element.find('strong').text()
      };
    });
  },

  schema: {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type" : "array",
    "items": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "name": { "type": "string" },
        "url": { "type": "string", "format": "uri" },
        "version": { "type": "string", "pattern": "^v\\d+\\.\\d+\\.\\d+$" }
      },
      "required": [ "name", "url", "version" ]
    }
  }

});

Then use it.

scraper('masterT')
  .then(function (data) {
    console.log(data)
  })
  .catch(function (error) {
    console.error(error)
  })

documentation

ValidationError

Error instance with additional Object property errorObjects which content all the error information, see ajv error.

createScraper(options)

Returned a scraper function defined by the options.

var yoloScraper = require('yolo-scraper');

var options = {
  // ...
};
var scraper = yoloScraper.createScraper(options);

The scraper function returns a Promise that resolves with the valid extract data or rejects with an Error.

scraper(params)
  .then(function (data) {
    console.log(data)
  })
  .catch(function (error) {
    console.error(error)
  })

options.paramsSchema

The JSON schema that defines the shape of the accepted arguments passed to options.request. When invalid, an Error will be thrown.

Optional

options.request = function(params)

Function that takes the arguments passed to your scraper function and returns the options to pass to the axios module to make the network request.

Required

options.extract = function(response, body, $)

Function that takes axios response, the response body (String) and a cheerio instance. It returns the extracted data you want.

Required

options.schema

The JSON schema that defines the shape of your extracted data. When your data is invalid, an Error with the validation message will be passed to your scraper callback.

Required

options.cheerioOptions

The option to pass to cheerio when it loads the request body.

Optional, default: {}

options.ajvOptions

The option to pass to ajv when it compiles the JSON schemas.

Optional, default: {allErrors: true} - It check all rules collecting all errors

dependecies

  • axios - Promise based HTTP client for the browser and node.js.
  • cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
  • ajv - The fastest JSON Schema Validator. Supports draft-04/06/07.

dev dependecies

  • jasmine - Simple JavaScript testing framework for browsers and node.js.
  • nock HTTP server mocking and expectations library for Node.js.

test

npm test

license

MIT