A simple way to structure your web scraper.
- Define the request.
- Extract the data from the response.
- Validate the data against JSON Schema.
Using NPM:
npm install yolo-scraper --save
Define your scraper function.
var yoloScraper = require('yolo-scraper');
var scraper = yoloScraper.createScraper({
request: function (username) {
return 'https://www.npmjs.com/~' + username.toLowerCase();
},
extract: function (response, body, $) {
return $('.collaborated-packages li').toArray().map(function (element) {
var $element = $(element);
return {
name: $element.find('a').text(),
url: $element.find('a').attr('href'),
version: $element.find('strong').text()
};
});
},
schema: {
"$schema": "http://json-schema.org/draft-04/schema#",
"type" : "array",
"items": {
"type": "object",
"additionalProperties": false,
"properties": {
"name": { "type": "string" },
"url": { "type": "string", "format": "uri" },
"version": { "type": "string", "pattern": "^v\\d+\\.\\d+\\.\\d+$" }
},
"required": [ "name", "url", "version" ]
}
}
});
Then use it.
scraper('masterT')
.then(function (data) {
console.log(data)
})
.catch(function (error) {
console.error(error)
})
Error instance with additional Object property errorObjects
which content all the error information, see ajv error.
Returned a scraper function defined by the options
.
var yoloScraper = require('yolo-scraper');
var options = {
// ...
};
var scraper = yoloScraper.createScraper(options);
The scraper function returns a Promise
that resolves with the valid extract data or rejects with an Error
.
scraper(params)
.then(function (data) {
console.log(data)
})
.catch(function (error) {
console.error(error)
})
The JSON schema that defines the shape of the accepted arguments passed to options.request
. When invalid, an Error will be thrown.
Optional
Function that takes the arguments passed to your scraper function and returns the options to pass to the axios module to make the network request.
Required
Function that takes axios response, the response body (String) and a cheerio instance. It returns the extracted data you want.
Required
The JSON schema that defines the shape of your extracted data. When your data is invalid, an Error with the validation message will be passed to your scraper callback.
Required
The option to pass to cheerio when it loads the request body.
Optional, default: {}
The option to pass to ajv when it compiles the JSON schemas.
Optional, default: {allErrors: true}
- It check all rules collecting all errors
- axios - Promise based HTTP client for the browser and node.js.
- cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
- ajv - The fastest JSON Schema Validator. Supports draft-04/06/07.
- jasmine - Simple JavaScript testing framework for browsers and node.js.
- nock HTTP server mocking and expectations library for Node.js.
npm test
MIT