Stack Scraper is a system for efficiently scraping information from complex web sites in a repeatable way, exporting directly to a data store.
Stack Scraper is good at collecting lots of semi-structured data from complicated, or even poorly-written, web sites in a repeatable manner.
See the example directory for a full sample scraper.
Stack Scraper provides the code to write a simple command-line application for downloading semi-structured data from complex web sites. However you'll need to take a number of things into consideration when you're building your stack-scraper implementation, namely:
Arguments:
type: Type of scraper to load (e.g. 'images' or 'artists').source: The name of the source to download (e.g. 'ndl' or '*').Options:
--scrape: Scrape and process the results from the already-downloaded pages.--process: Process the results from the already-downloaded pages.--reset: Don't resume from where the last scrape left off.--delete: Delete all the data associated with the particular source.--debug: Output additional debugging information.--test: Test scraping and extraction of a source.--test-url: Test extraction against a specified URL.Initialization Properties:
rootDataDir (String): A full path to the root directory where downloaded data is stored. (See "File System" for more information.)scrapersDir (String): A full path to the directory where scraper .js files are stored. (See "Scrapers" for more information.)directories (ObjectrootDataDir. (See "File System" for more information.)model (Function): A function representing the model in which extracted data will be stored. (See "Datastore and Data Models" for more information.)logModel (Function): A function representing the log model for storing information about an in-progress site scrape. (See "Datastore and Data Models" for more information.)postProcessors (Object, optional): An object whose keys are the names of model properties which should be processed and values are functions through which the data will be processed. (See "Post-Processors" for more information.)MongoDB + Mongoose
dbFind(filter:Object, callback)
dbFindById(id:String, callback)
dbSave(data:Object, callback)
dbUpdate(data:Object, newData:Object, callback)
dbRemove(filter:Object, callback)
dbLog(data:Object, callback)
dbStreamLog(filter:Object) -> Stream
dbRemoveLog(filter:Object, callback)