WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.
WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.
The Crawler Workbench is a graphical user interface that lets you configure and control a customizable web crawler. Using the Crawler Workbench, you can:
- Visualize a collection of web pages as a graph
- Save pages to your local disk for offline browsing
- Concatenate pages together for viewing or printing them as a single document
- Extract all text matching a certain pattern from a collection of pages.
WebSPHINX class library
The WebSPHINX class library provides support for writing web crawlers in Java. The class library offers a number of features:
- Multithreaded Web page retrieval in a simple application framework
- An object model that explicitly represents pages and links
- Support for reusable page content classifiers
- Tolerant HTML parsing
- Support for the robot exclusion standard
- Pattern matching, including regular expressions, Unix shell wildcards, and HTML tag expressions. Regular expressions are provided by the Apache jakarta-regexp regular expression library.
- Common HTML transformations , such as concatenating pages , saving pages to disk, and renaming links
Licence: Apache License