domwaiter

A well-behaved URL scraper that brings you delicious DOM objects

Do you have a large collection of URLs you want to scrape? Scraping one page at a time is too slow, and scraping all the pages at once could put too much stress on the website you're scraping, and it could also crash your Node.js process due to excess memory usage. That's where this package comes in: it has a built-in rate limiter which allows you to quickly (and respectfully) collect those pages, and an event-emitting API to keep memory usage low.

Features

Uses Promises so it's async/await friendly
Event-emitting API to keep a low memory footprint
Supports fetching JSON too (instead of HTML DOM)
Rate limiting powered by bottleneck
DOM parsing powered by cheerio (optional; can be disabled)
HTTP requests powered by got

Installation

npm install domwaiter

Usage

const domwaiter = require('domwaiter')

const pages = [
  { url: 'https://help.github.com/en', language: 'English' },
  { url: 'https://help.github.com/ja', language: 'Japanese' },
  { url: 'https://help.github.com/cn', language: 'Chinese' }
]

domwaiter(pages)
  .on('page', (page) => {
    console.log(page.language, page.$('title').text())
  })
  .on('error', (err) => {
    console.error(err)
  })
  .on('done', () => {
    console.log('Done!')
  })

API

This module exports a single function domwaiter:

`domwaiter(pages, [opts])`

pages Array (required) - Each item in the array must have a url property with a fully-qualified HTTP(S) URL. These object can optionally have other properties, which will be included in the emitted page events. See below.
opts Object (optional)
- parseDOM Boolean - Defaults to true. Set to false if you don't need the parsed page.$ DOM object. Disabling DOM parsing will boost performance.
- json Boolean - Defaults to false. Set to true if you're fetching JSON instead of HTML. If true, a json property will be present on each emitted page object (and the $ and body properties will NOT be present).
- maxConcurrent Number - How many jobs can be executing at the same time. Defaults to 5. This option is passed to the underlying bottleneck instance.
- minTime: Number - How long to wait after launching a job before launching another one. Defaults to 500 (milliseconds). This option is passed to the underlying bottleneck instance.

Events

The domwaiter function returns an event emitter which emits the following events:

beforePageLoad - Emitted with page object for any optional prehandling you want to do, e.g. setting up a request timer.
page - Emitted after the page has been requested and the response is parsed. Returns an object which is a shallow clone of the original page object you provided, but with two added properties:
- body: the raw HTTP response body text
- $: The body parsed into a jQuery-like cheerio DOM object.
error - Emitted when an error occurs fetching a URL
done - Emitted when all the pages have been fetched.

Tests

npm install
npm test

Dependencies

bottleneck: Distributed task scheduler and rate limiter
cheerio: Tiny, fast, and elegant implementation of core jQuery designed specifically for the server
got: Human-friendly and powerful HTTP request library for Node.js

Dev Dependencies

jest: Delightful JavaScript Testing.
nock: HTTP server mocking and expectations library for Node.js
standard: JavaScript Standard Style

domwaiter

1.4.0

@zeke

domwaiter

Features

Installation

Usage

API

`domwaiter(pages, [opts])`

Events

Tests

Dependencies

Dev Dependencies