Example Web Scraper

Component ID

1062522

Component name

Example Web Scraper

Component type

module

Maintenance status

Development status

Component security advisory coverage

not-covered

Downloads

4336

Component created

Component changed

Component body

This example demonstrates how to build a Drupal-native web scraper. It imports events from a single month of Stanford University's calendar by navigating to the page for each day and creating a node for each event on the day's list. Each event is then queued for scraping its details.

All functionality is provided by Feeds, Feeds XPath HTML Parser, Feeds Crawler, Feeds SelfNode Processor, and Feeds Tamper. This module only includes an example configuration packaged using Features. Developers and site-builders who are interested in web scraping may find it a helpful starting point.

Additional recommended modules:

  • Feeds UI - for viewing and modifying feeds configurations
  • Feeds Tamper UI - for viewing and modifying feeds tamper configurations
  • Queue UI - for viewing queued jobs
  • Drush - for running groups of queued cron jobs on demand

Note:

Please be mindful that using this software will produce slightly higher server loads than standard web browsing. Please use this demonstration tool carefully so as to avoid abusing Stanford's resources. Before making configuration changes, consult the handbooks of each module or submit a support request . Once you have learned how to make scrapers suitible to your needs, please disable the provided importers.

Anticipated Questions:

Q. Is it ethical to pull content from Stanford's events calendar?
A. The Policies and Procedures indicates that Stanford is supportive of feeding content to other websites, and their robots.txt allows all crawlers to index their site with the exception of /images and /xml. So yes, it's fine.

Q. Why use Stanford's calendar as a demonstration?
A. The two measurements used when evaluting web sites with structured data and without a public API were the ethical boundaries of indexing the source's content (answered above) and whether or not the site would have enough bandwidth to handle an increased traffic load. This site appears to be a perfect match.

Q. I have an idea/patch to improve the example's clarity, information architecture, crawling pattern, etc. Where can I submit it?
A. In the issue queue. However, if it is an issue related to the functionality provided by a module on which this project depends, please add it to that module's issue queue.

Q. I packaged another scraper. Where should I post it?
A. Please wait for more information to be posted.